A major issue in cluster analysis is deciding on the number of clusters. Although there are no hard and fast rules, some guidelines are available.
- Theoretical, conceptual, considerations may suggest a certain number of clusters. For example, if the practical of clustering is to identify market segments, management may want a particular number of clusters.
- In hierarchical clustering, the distances at which clusters are combined can be used as criteria. This information can be obtained from the agglomeration schedule or from the dendrogram. In our case, we see from the “Agglomeration Schedule” in Table 202 that the value in the “Coefficients” column suddenly more than doubles between stages 17 (three clusters) and 18 (two clusters). Likewise, at the last two stages of the dendrogram in Figure 20.8, the clusters are being combined at large distances. Therefore, it appears that a three-cluster solution is appropriate
- In nonhierarchical clustering, the ratio of total within-group variance to between-group variance can be plotted against the number of clusters. The point at which an elbow or, a sharp bend occurs indicates an appropriate number of clusters. Increasing the number of clusters beyond this point is usually not worthwhile.
- The relative sizes of the clusters should be meaningful. In Table 20.2, by making a simple frequency count of cluster membership, we see that a three-cluster solution results in clusters with eight, six, and six elements. However, if we go to a four-cluster solution, the sizes of the clusters are eight, six, five, and one. It is not meaningful to have a cluster with only one case, so a three-cluster solution is preferable in this situation.
Interpret and Profile the Clusters
Interpreting and profiling clusters involves examining the cluster centroids. The centroids represent the mean values of the objects contained in the cluster on each of the variables. The centroids enable us to describe each cluster by assigning it a name or label. If the clustering program does not print this information, it may be obtained through discriminant analysis. Table 20.3 gives the centroids or mean values for each cluster in our example. Cluster I has relatively high values on variables VI (shopping is fun) and V3(I combine shopping with eating out). It also has a low value on Vs (I don’t care about shopping). Hence, cluster I could be labeled “fun-loving and concerned shoppers.” This cluster consists of cases 1,3,6,7,8,12, 15, and 17.Cluster 2 is just the opposite, with lo~ values on VI and V3and a high value on Vs’ and this cluster could be labeled “apathetic shoppers.” Members of cluster 2 are cases 2,5,9,11,13, and 20. Cluster 3 has high values on V2 (shopping upsets my budget), V4 (I try to get the best buys when shopping), and V6(you can save a lot of money by comparing prices). Thus, this cluster could be labeled “economical shoppers.” Cluster 3 comprises cases 4,10, 14, 16, 18,and 19.
Often it is helpful to profile the clusters in terms of variables that were not used for clustering. These may include demographic, psychographic, product usage, media usage, or other variables.. For example, the clusters may have been derived based on benefits sought. Further profiling may be done in terms of demographic and psychographic variables to target marketing efforts for each cluster. The variables that significantly differentiate between clusters can be identified via discriminant analysis and one-way analysis of variance.
Assess Reliability and Validity
Given the several judgments entailed in cluster analysis, no ch:stering solution should be accepted without some assessment of its reliability and validity. Formal procedures for assessing the reliability and validity of clustering solutions are complex and not fully defensible.U Hence, we omit them here. However, the following procedures provide adequate checks on the quality.
- Perform cluster analysis on the same data using different distance measures. Compare the results across measures to determine the stability of the solutions.
- Use different methods of clustering and compare the results.
- Split the data randomly into halves. Perform clustering separately on each half. Compare cluster centroids across the two subsamples.
- Delete variables randomly. Perform clustering based on the reduced set of variables. Compare the results with those obtained by clustering based on the entire set of variables.
- In nonhierarchical clustering, the solution may depend on the order of cases in the data set. Make multiple runs using a different order of cases until the solution stabilizes.
We further illustrate hierarchical clustering with a study of differences in marketing strategy among American, Japanese, and British firms.