Before discussing the statistics associated with cluster analysis, it should be mentioned that most clustering methods are relatively simple procedures that are not supported by an extensive body of statistical reasoning. Rather, most clustering methods are heuristics, which are based on algorithms. Thus, cluster analysis contrasts sharply with analysis of variance, regression, discriminant analysis, and factor analysis, which are based upon an extensive body of statistical reasoning. Although many clustering methods have important statistical properties, the fundamental simplicity of these methods needs to be recognized.t The following statistics and concepts are associated with cluster analysis.
Agglomeration schedule. An agglomeration schedule gives information on the objects or cases being combined at each stage ota hierarchical clustering process.
Cluster centroid. The cluster centroid is the mean values of the variables for all the cases or objects in a particular cluster
Cluster centers. The cluster centers are the initial starting points in nonhierarchical clustering. Clusters are built around these centers or seeds.
Cluster membership. Cluster membership indicates the cluster to which each object or case belongs.
Dendrogram. A dendrogram, or tree graph, is a graphical device for displaying clustering results. Vertical lines represent clusters that are joined together. The position of the line on the scale indicates the distances at which clusters were joined. The dendrogram is read from left to right. Figure 20.8 is a dendrogram.
Distances between cluster centers. These distances indicate how separated the individual pairs of clusters are. Clusters that are widely separated are distinct, and therefore desirable. Icicle plot. An icicle plot is a graphical display of clustering results, so called because it resembles a row of icicles hanging from the eaves of a house. The columns correspond to the objects being clustered, and the rows correspond to the number of clusters. An icicle plot is read from bottom to top. Figure 20.7 is an icicle plot.
Similarity/distance coefficient matrix. A similarity/distance coefficient matrix is a lower triangle matrix containing pairwise distances between objects or cases.
Conducting Cluster Analysis
The steps involved in conducting cluster analysis are listed in Figure 20.3. The first step is to formulate the clustering problem by defining the variables on which the clustering will be based. Then an appropriate distance measure must be selected. The distance measure determines how similar or dissimilar the objects being clustered are. Several clustering procedures have been developed and the researcher should select one that is appropriate for the problem at hand. Deciding on the number of clusters requires judgment on the part of the researcher. The derived clusters should be interpreted in terms of the variables used to cluster them and profiled in terms of additional salient variables. Finally, the researcher must assess the validity of the clustering process.
Formulate the Problem
Perhaps the most important part of formulating the clustering problem is selecting the variables on which the clustering is based. Inclusion of even one or two irrelevant variables may distort an otherwise useful clustering solution. Basically, the set of variables selected should describe the similarity between objects in terms that are relevant to the marketing research problem. The variables should be selected based on past research, theory, or a consideration of the hypotheses being tested. In exploratory research, the researcher should exercise judgment and intuition
To illustrate, we consider a clustering of consumers based on attitudes toward shopping. Based on past research, six attitudinal variables were identified. Consumers were asked to express their degree of agreement with the following statements on a 7-point scale (I =disagree, 7 = agree):
V1: Shopping is fun.
V2: Shopping is bad for your budget.
V3: I combine shopping with eating out.
V4: I try to get the best buys when shopping.
V5: I don’t care about shopping.
V6: You can save a lot of money by comparing prices
Data obtained from a pretestssmple of 20 respondents are shown in TabJe20.1. A small sample size has been used to illustrate the clustering process. In actual practice, cluster analysis is performed on a much larger sample such as that in the Dell running case and other cases with real data that are presented in this book
Select a Distance or Similarity Measure
Because the objective of clustering is to group similar objects together, some measure is needed to assess how similar or different the objects are. The most common approach is to measure similarity in terms of distance between pairs of objects. Objects with smaller distances between them are more similar to each other than are those at larger distances. There are several ways to compute the distance between two objects.
The most commonly used measure of similarity is the euclidean distance or its square. The euclidean distance is the square root of the sum of the squared differences in values for each variable. Other distance measures are also available. The city-block or Manhattan distance between two objects is the sum of the absolute differences in values for each variable. The Chebychev distance between two objects is the maximum absolute difference in values for any variable. For our example, we will use the squared euclidean distance.
If the variables are measured in vastly different units, the clustering solution will be influenced by the units of measurement. In a supermarket shopping study, attitudinal variables may be measured on a 9-point Likert-type scale; patronage, in terms of frequency brand loyalty, in terms of percentage of grocery shopping expenditure allocated to the favorite supermarket. In these cases, before clustering respondents, we must standardize the data by resealing each variable to have a mean of zero and a standard deviation of unity. Although standardization can remove the influence of the unit of measurement, it can also reduce the differences between groups on variables that may best discriminate groups or clusters. It is also desirable to eliminate outliers (cases with atypical values)
Use of different distance measures may lead to different clustering results. Hence, it is advisable to use different measures and compare the results. Having selected a distance or similarity measure, we can next Selecta clustering procedure.