Hierarchical Clustering is an unsupervised learning technique that builds a hierarchy of clusters by either merging smaller clusters into larger ones (agglomerative approach) or splitting larger clusters into smaller ones (divisive approach). The result is a tree-like structure called a dendrogram that shows the nested relationships between clusters at different levels of granularity. Hierarchical clustering allows exploration of cluster structures at multiple levels simultaneously. The algorithm uses distance metrics (Euclidean, Manhattan) and linkage criteria (single, complete, average) to determine which clusters to merge or split. This technique is particularly valuable for understanding hierarchical relationships in data, such as biological taxonomies, customer segments, or document organization.
Objectives of Hierarchical Clustering:
1. Build a Hierarchy of Clusters
The primary objective of hierarchical clustering is to build a hierarchy of clusters that reveals nested relationships at multiple levels of granularity. Unlike partitioning methods that produce a single flat partition, hierarchical clustering generates a tree-like structure called a dendrogram. This hierarchy shows how clusters relate to each other, with smaller clusters nested within larger ones. For example, in biological taxonomy, species cluster into genera, genera into families, families into orders, and so on. This hierarchical structure enables analysts to understand data organization at different abstraction levels simultaneously. The hierarchy preserves information about cluster relationships that flat methods discard, providing richer insight into data structure. Users can cut the dendrogram at different heights to obtain clusterings with varying numbers of clusters, depending on their analytical needs.
2. Eliminate Need to Pre-specify Cluster Count
Eliminate need to pre-specify cluster count is a major advantage of hierarchical clustering over partitioning methods like k-means. The algorithm builds the complete hierarchy without requiring the user to decide K (number of clusters) in advance. After examining the dendrogram, analysts can choose the appropriate level of granularity based on domain knowledge or application requirements. For example, a marketing analyst might cut the dendrogram at a higher level for broad customer segments, and at a lower level for more targeted micro-segments. This flexibility is particularly valuable when the optimal number of clusters is unknown or when multiple clusterings at different granularities are useful. The dendrogram provides visual guidance for selecting cluster count based on natural gaps or consistent with business needs.
3. Reveal Nested Relationships in Data
Reveal nested relationships in data enables understanding of how clusters relate to each other hierarchically. Many real-world phenomena exhibit natural hierarchies customer segments containing sub-segments, product categories containing sub-categories, or biological classifications. Hierarchical clustering captures these structures, showing which clusters are most similar and how they combine at higher levels. For example, in customer segmentation, the dendrogram might reveal that “budget shoppers” and “occasional buyers” are more similar to each other than either is to “premium loyalists,” informing marketing strategy. Understanding these nested relationships provides deeper insight than flat clustering, revealing not just that groups exist but how they are related. This hierarchical perspective often aligns with how humans naturally categorize and understand complex domains.
4. Provide Visual Understanding Through Dendrograms
Provide visual understanding through dendrograms transforms abstract clustering results into intuitive, interpretable visualizations. The dendrogram displays the entire clustering hierarchy as a tree, where branches represent clusters and branch lengths indicate dissimilarity between merged clusters. Analysts can visually identify natural groupings, outliers, and the level at which clusters stabilize. For example, a dendrogram might show two main branches separating clearly, with smaller sub-branches within each, suggesting two primary clusters with meaningful sub-structure. The visual nature of dendrograms makes hierarchical clustering particularly accessible to non-technical stakeholders who can see and understand cluster relationships at a glance. This visualization capability supports exploratory analysis, hypothesis generation, and communication of findings to business audiences.
5. Handle Different Data Types and Distance Metrics
Handle different data types and distance metrics makes hierarchical clustering versatile across diverse applications. The algorithm works with any data type for which a distance or similarity measure can be defined, including numerical, categorical, binary, or mixed data. Different distance metrics capture different aspects of similarity Euclidean for continuous data, Manhattan for grid-based data, cosine for text, Jaccard for sets. This flexibility enables hierarchical clustering in domains ranging from customer analytics to bioinformatics to document organization. For example, in text analysis, cosine distance captures document similarity based on word frequencies. In genetics, specialized distance metrics reflect evolutionary relationships. The ability to choose appropriate distance measures for each data type ensures that hierarchical clustering adapts to domain-specific similarity concepts.
6. Discover Clusters of Arbitrary Shape
Discover clusters of arbitrary shape overcomes a limitation of many partitioning algorithms that assume spherical clusters. Hierarchical clustering makes no assumptions about cluster geometry, instead basing groupings on proximity relationships between individual points. This enables discovery of clusters with complex, non-convex shapes that reflect natural data structure. For example, hierarchical clustering can identify crescent-shaped clusters, elongated clusters, or clusters with unusual contours that k-means would merge or split incorrectly. The algorithm achieves this flexibility by working directly with distance matrices rather than assuming parametric cluster shapes. This capability is particularly valuable in spatial data analysis, image segmentation, and any domain where clusters may have irregular forms that reflect underlying generative processes rather than convenient mathematical assumptions.
7. Identify Outliers and Noise
Identify outliers and noise is facilitated by hierarchical clustering’s detailed view of data structure. In the dendrogram, outliers often appear as single-point clusters that join the main hierarchy only at very high dissimilarity levels. These late-joining points stand out visually, alerting analysts to potential anomalies requiring investigation. For example, in customer data, a point that joins only at the top of the dendrogram might represent an unusual high-value customer or a data error. This outlier detection capability supports both data quality improvement and discovery of rare but important cases. Unlike some partitioning methods that force outliers into clusters, distorting cluster characteristics, hierarchical clustering reveals their distinctive nature, enabling appropriate handling whether through removal, separate analysis, or focused investigation.
8. Enable Multi-Level Analysis
Enable multi-level analysis allows examination of data at different granularities within a single analytical framework. The dendrogram can be cut at any height to produce clusterings with different numbers of clusters, each representing a valid perspective on data organization. For example, a retail chain might examine store clusters at the national level for broad strategy, regional level for territory planning, and local level for individual store benchmarking. This multi-level capability supports both strategic and operational decision-making from the same underlying analysis. Analysts can explore how patterns emerge and dissolve as granularity changes, gaining insight into data structure unavailable from single-level methods. This flexibility is particularly valuable when different stakeholders require different levels of detail or when the appropriate granularity is not known in advance.
9. Support Exploratory Data Analysis
Support exploratory data analysis positions hierarchical clustering as a discovery tool rather than just a confirmatory method. Before formal modeling or hypothesis testing, analysts can use hierarchical clustering to understand data structure, generate hypotheses about groupings, and identify promising directions for further investigation. The dendrogram provides an intuitive map of data relationships, revealing natural groupings that might correspond to meaningful categories. For example, exploring customer survey data might reveal unexpected clusters corresponding to distinct psychographic segments, generating hypotheses for targeted marketing research. This exploratory capability is essential in early stages of analysis when questions are open-ended and patterns unknown. Hierarchical clustering’s flexibility and visual output make it particularly suitable for this discovery-oriented role.
10. Complement Other Clustering Methods
Complement other clustering methods enables more comprehensive analysis by combining hierarchical clustering with other techniques. Hierarchical clustering can determine the number of clusters K for methods like k-means by examining the dendrogram for natural cutting points. It can provide initial centroids for k-means, improving algorithm stability and results. Hierarchical clustering can also validate results from other methods, showing whether discovered clusters have meaningful hierarchical relationships. For example, after k-means clustering, hierarchical clustering of the cluster centroids reveals relationships between clusters not apparent in the flat partition. This complementarity enhances analytical capabilities, combining the strengths of different approaches. Rather than competing, hierarchical and partitioning methods work together in integrated analytical workflows, each contributing unique insights.
Types of Hierarchical Clustering:
1. Agglomerative Clustering (Bottom–Up)
Agglomerative clustering is the most common hierarchical clustering approach, starting with each data point as its own individual cluster and progressively merging the closest pairs until all points belong to a single cluster. The process begins with N clusters for N data points, then repeatedly combines the two most similar clusters based on a chosen distance metric and linkage criterion. This creates a bottom-up hierarchy recorded in a dendrogram. The algorithm’s computational complexity is O(n³) in its basic form, though optimized implementations achieve O(n² log n). Agglomerative clustering is intuitive and widely implemented in statistical software. Its step-by-step merging process provides a complete record of cluster formation, enabling analysts to understand exactly how groups combine. Applications range from biological taxonomy to document organization to customer segmentation, where natural hierarchical structures exist.
2. Divisive Clustering (Top–Down)
Divisive clustering takes the opposite approach, starting with all data points in a single cluster and recursively splitting clusters until each point forms its own cluster. This top-down method begins by considering how to best partition the entire dataset into two groups, then further splits each resulting cluster, continuing until individual points remain. Divisive clustering is computationally more complex than agglomerative because evaluating all possible splits at each step is expensive; O(2ⁿ) possibilities make exhaustive search infeasible. Practical implementations use heuristic approaches to choose splits. Despite higher computational cost, divisive clustering can produce more accurate hierarchies for certain data structures because it makes splitting decisions based on the global data view rather than local merges. It is particularly useful when the focus is on identifying major divisions in data before examining finer structure.
3. Single-Linkage Clustering
Single-linkage clustering defines the distance between two clusters as the minimum distance between any point in the first cluster and any point in the second cluster. This “nearest neighbor” approach tends to produce long, chain-like clusters, a phenomenon known as “chaining.” The algorithm merges clusters based on the closest pair of points, regardless of other points’ positions. Single-linkage excels at capturing clusters of arbitrary, non-elliptical shapes, including elongated or irregular forms that other methods miss. However, it is sensitive to noise and outliers, which can act as bridges between distinct clusters, causing premature merging. The method’s chaining tendency can be useful for certain applications like identifying connected components in spatial data but problematic when compact, well-separated clusters are expected. Single-linkage is computationally efficient and has interesting theoretical connections to graph theory and minimum spanning trees.
4. Complete-Linkage Clustering
Complete-linkage clustering defines cluster distance as the maximum distance between any point in the first cluster and any point in the second cluster. This “farthest neighbor” approach tends to produce compact, spherical clusters of roughly equal diameter. By considering the furthest points, complete-linkage ensures that all points in a merged cluster are within a certain distance of each other, promoting homogeneity. This method is less susceptible to chaining than single-linkage and generally produces more balanced dendrograms. However, it is sensitive to outliers, which can artificially inflate distances and prevent natural merges. Complete-linkage works well when clusters are expected to be compact and well-separated, but may force unnatural splits when clusters have varying densities or irregular shapes. It is widely used in biological and social sciences where relatively homogeneous groups are expected.
5. Average-Linkage Clustering
Average-linkage clustering (also called UPGMA Unweighted Pair Group Method with Arithmetic Mean) defines cluster distance as the average of all pairwise distances between points in the two clusters. This compromise between single and complete linkage considers all point relationships, not just extremes. Average-linkage tends to produce clusters with moderate compactness and is less sensitive to outliers than complete-linkage while avoiding the chaining problem of single-linkage. It is one of the most popular hierarchical methods, widely used in bioinformatics for constructing phylogenetic trees and in social sciences for survey analysis. The method assumes that all points contribute equally to cluster similarity, which is reasonable for many applications. Average-linkage often yields balanced dendrograms that reflect overall data structure well, though it can be computationally more intensive than extreme-based methods due to calculating all pairwise distances.
6. Ward’s Method
Ward’s method takes a different approach, merging clusters based on minimizing the increase in total within-cluster variance after merging. At each step, it combines the two clusters that result in the smallest increase in the sum of squared errors (ESS), essentially finding mergers that create the most homogeneous clusters. Ward’s method tends to produce compact, spherical clusters of roughly equal size, similar to k-means objectives. It is based on a minimum-variance criterion, making it particularly suitable when clusters are expected to be Gaussian-distributed. The method is computationally efficient and often produces clear, interpretable dendrograms. However, it is sensitive to outliers and tends to create balanced clusters even when natural cluster sizes are unequal. Ward’s method is widely used in social sciences, market research, and any domain where compact, homogeneous clusters are desired and data roughly meets normality assumptions.
7. Centroid-Linkage Clustering
Centroid-linkage clustering defines cluster distance as the distance between cluster centroids (mean vectors). After each merger, a new centroid is calculated for the combined cluster. This approach is intuitive, representing clusters by their central points. However, centroid-linkage can produce inversions in the dendrogram where the distance between merged clusters appears smaller than distances at previous levels, violating the ultrametric property and complicating interpretation. These inversions occur because centroid positions can shift in ways that don’t preserve monotonic distance increases. Despite this limitation, centroid-linkage is computationally efficient and can work well when clusters are roughly spherical and well-separated. It is implemented in various statistical packages and used in applications where cluster centers have meaningful interpretations, such as in geographic analysis where centroids represent average locations.
8. Median-Linkage Clustering
Median-linkage clustering modifies centroid-linkage by giving equal weight to the two clusters being merged, regardless of their sizes. This weighting prevents larger clusters from dominating centroid calculations, addressing a potential bias in centroid-linkage. The method computes the new cluster center as the midpoint between the two cluster centroids, treating both clusters equally. This approach reduces the size bias that can occur in centroid-linkage, where large clusters have disproportionate influence. However, median-linkage can also produce dendrogram inversions. It is less commonly used than other methods but valuable when cluster sizes vary substantially and equal representation is desired. Median-linkage provides a compromise between centroid-based approaches and those that consider all points, focusing on cluster centers while controlling for size effects that might otherwise distort the hierarchy.
9. Flexible Method
The flexible method introduces a parameter β that controls the space-dilating or space-contracting nature of the clustering process. By adjusting β, analysts can create a continuum of methods between extreme behaviors. Positive β values produce space-dilating methods (like complete-linkage) that tend to create compact clusters, while negative β values produce space-contracting methods (like single-linkage) that allow chaining. The flexible method, proposed by Lance and Williams, generalizes many hierarchical approaches through a recurrence formula that updates cluster distances without storing full distance matrices. This parametric flexibility allows analysts to tune clustering behavior to match domain expectations about cluster structure. The method is particularly valuable in exploratory analysis when appropriate linkage is unknown, enabling systematic exploration of how clustering results change with linkage criteria. However, choosing the optimal β requires experimentation and domain knowledge.
10. Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH)
BIRCH is a scalable hierarchical clustering algorithm designed for large datasets. It incrementally constructs a Clustering Feature Tree (CF-tree) that summarizes data points into compact cluster representations, requiring only a single database scan. Each clustering feature stores summary statistics (count, linear sum, squared sum) for a cluster, enabling efficient distance calculations and updates. BIRCH first builds the CF-tree, then optionally applies additional clustering algorithms to refine results. This approach handles millions of points efficiently, overcoming the O(n²) limitations of traditional hierarchical methods. BIRCH is particularly valuable for very large datasets where traditional hierarchical clustering would be computationally prohibitive. It identifies cluster structure while using limited memory, making it suitable for streaming data and resource-constrained environments. Applications include network intrusion detection, customer segmentation on large databases, and real-time data analysis.
Steps of Hierarchical Clustering:
Hierarchical clustering follows a systematic step-by-step process to build a hierarchy of clusters. The algorithm begins with individual data points and progressively merges (agglomerative) or splits (divisive) them based on similarity. Each step involves calculating distances, identifying closest clusters, and updating the cluster structure. The process continues until all points belong to a single cluster or individual clusters, depending on approach. Understanding these steps is essential for interpreting results, choosing appropriate parameters, and applying hierarchical clustering effectively across different domains. The following steps outline the agglomerative approach, which is the most commonly implemented.
1. Data Preparation
The hierarchical clustering process begins with data preparation, ensuring the dataset is suitable for analysis. This step involves selecting relevant features, handling missing values, and removing outliers that could distort distance calculations. For numerical data, features should be standardized or normalized so that variables with larger scales don’t dominate distance measures. For example, in customer data, income (ranging in lakhs) and age (ranging in tens) need scaling to contribute equally. Categorical variables require appropriate encoding or distance measures designed for categorical data. Data preparation may also involve dimensionality reduction if the feature space is very large, as distance calculations become less meaningful in high dimensions. Quality preparation ensures that the clustering reflects genuine patterns rather than artifacts of measurement scales or data quality issues.
2. Distance Matrix Calculation
Distance matrix calculation computes pairwise distances between all data points, forming the foundation for cluster merging decisions. For N points, this creates an N×N symmetric matrix where entry (i,j) represents the distance between point i and point j. Common distance metrics include Euclidean distance for continuous data, Manhattan distance for grid-based data, cosine distance for text, and correlation-based distances for time series. The choice of distance metric significantly affects clustering results and should reflect domain-specific notions of similarity. For example, Euclidean distance works well for spatial data, while cosine distance captures document similarity regardless of document length. Distance matrix calculation is computationally intensive O(N²), which limits hierarchical clustering’s scalability to very large datasets. This step produces the fundamental similarity information that guides all subsequent merging decisions.
3. Initialize Each Point as a Cluster
Initialize each point as a cluster sets the starting state for agglomerative clustering. With N data points, the algorithm begins with N clusters, each containing exactly one point. At this initial level, within-cluster distance is zero, and between-cluster distances are simply the pairwise distances between points. This initialization represents the most granular level of the hierarchy, where every observation is its own cluster. The algorithm records this state as the bottom of the dendrogram. This starting point reflects the assumption that each observation is distinct at the finest level of analysis, with merging only occurring when evidence supports grouping. The initialization establishes the baseline from which the hierarchy will be built through successive mergers.
4. Find Closest Cluster Pair
Find closest cluster pair identifies the two clusters with minimum distance according to the chosen linkage criterion. The algorithm examines all current cluster pairs and selects the pair with smallest inter-cluster distance. This step requires maintaining and updating cluster distances as merging progresses. The “closest” definition depends on linkage: single-linkage uses minimum pointwise distance, complete-linkage uses maximum, average-linkage uses mean, and Ward’s method uses variance increase. For example, in single-linkage, the algorithm finds the pair of clusters with the nearest two points across cluster boundaries. This identification step determines the next merger in the hierarchy and is repeated at each level until all points are merged. The pair selected represents the most similar clusters at the current hierarchical level.
5. Merge Closest Clusters
Merge closest clusters combines the identified cluster pair into a single new cluster. The two selected clusters are removed from the active cluster set, and their union is added as a new cluster. This merger creates the next level in the hierarchy, recorded in the dendrogram with a node joining the two clusters at a height corresponding to their dissimilarity. The merged cluster inherits all points from both original clusters. For example, merging a cluster containing points {A,B} with cluster {C,D} creates new cluster {A,B,C,D}. This step reduces the total number of clusters by one. The merger represents the algorithm’s decision that these two groups are more similar to each other than to any other clusters at the current level, forming a natural grouping in the data hierarchy.
6. Update Distance Matrix
Update distance matrix recalculates distances between the newly formed cluster and all remaining clusters. This step is critical for subsequent iterations, as the algorithm needs current distance information to identify the next closest pair. The update uses the chosen linkage criterion to compute distances from the new cluster to each existing cluster. For example, with average-linkage, the distance between new cluster C and existing cluster D is the average of all pairwise distances between points in C and points in D. Efficient implementations use Lance-Williams recurrence formulas to update distances without recomputing all pairwise distances, significantly improving computational performance. The updated distance matrix reflects the current cluster configuration and feeds into the next iteration’s closest-pair search.
7. Record Dendrogram Information
Record dendrogram information captures the merger details for visualization and interpretation. For each merger, the algorithm records which clusters were combined, the distance (height) at which they merged, and the size of the resulting cluster. This information builds the dendrogram structure that will be used for visualization and analysis. The recording includes the hierarchical relationships showing which clusters contain which subclusters. For example, if clusters {A,B} and {C,D} merge, the dendrogram will show these as branches joining at a node. The merger height reflects the dissimilarity between the merged clusters, with earlier mergers (lower height) indicating stronger similarity. This recorded hierarchy enables cutting the dendrogram at different heights to obtain clusterings with varying numbers of clusters and understanding the complete nested structure of the data.
8. Repeat Until Single Cluster
Repeat until single cluster continues the iterative process of finding closest pairs, merging, and updating distances until all points belong to a single cluster. This creates the complete hierarchy from N individual points to one all-encompassing cluster. Each iteration reduces cluster count by one, requiring N-1 total mergers for N points. The progression of merger heights reveals the structure of the data gradual increases indicate relatively homogeneous data, while large jumps suggest natural cluster boundaries. The complete sequence of mergers provides a full picture of data relationships at all levels of granularity. This exhaustive approach ensures that no potential grouping is missed, but also means the algorithm always produces a hierarchy even when no natural clustering exists, requiring analyst judgment to identify meaningful levels.
9. Dendrogram Visualization
Dendrogram visualization creates a graphical representation of the clustering hierarchy, enabling intuitive interpretation of results. The dendrogram displays clusters as branches that join at nodes, with branch lengths proportional to the distance at which mergers occurred. The vertical axis represents dissimilarity, with earlier mergers (more similar clusters) joining lower on the tree. Horizontal ordering of leaves (individual points) is arranged to avoid crossing branches, often reflecting underlying data structure. Analysts can visually identify natural groupings by looking for long vertical lines before major merges, indicating distinct cluster separation. The dendrogram also reveals outliers as points that join the main tree only at very high levels. This visualization is hierarchical clustering’s greatest strength, transforming abstract distance calculations into an accessible, interpretable map of data relationships.
10. Cut Dendrogram to Obtain Clusters
Cut dendrogram to obtain clusters is the final step where analysts select a specific level of granularity for their application. By drawing a horizontal line across the dendrogram at a chosen height, branches below the line form distinct clusters. The cutting height determines the number and composition of clusters lower cuts produce more, smaller clusters; higher cuts produce fewer, larger clusters. Analysts choose the cutting height based on domain knowledge, application requirements, or by identifying natural gaps in the dendrogram where long vertical stretches suggest meaningful cluster boundaries. For example, in customer segmentation, a higher cut might yield broad segments for company-wide strategy, while a lower cut provides micro-segments for targeted marketing. This flexibility is a key advantage of hierarchical clustering, enabling multiple clustering solutions from a single analysis. The final clusters can then be characterized, validated, and deployed for business applications.
Example of Hierarchical Clustering:
1. Data Preparation
The dataset contains five customers with two features: annual income in lakhs and spending score on a 1-100 scale. Both features are already on comparable scales (income 4-9, spending 30-80), so no additional normalization is required for this simple example. In real applications, features would typically be standardized to ensure equal contribution. The data represents typical customer profiling variables used in marketing analytics income indicates purchasing power, while spending score reflects actual spending behavior. These two dimensions often reveal meaningful customer segments like “high-income high-spenders,” “budget-conscious,” or “cautious spenders.” The small dataset size allows us to trace each step of the clustering process manually, building intuition for how hierarchical clustering works before scaling to larger real-world applications.
2. Distance Matrix Calculation
First, we calculate the Euclidean distance between each pair of customers. Euclidean distance between two points (x₁,y₁) and (x₂,y₂) is √[(x₁-x₂)² + (y₁-y₂)²].
Distances:
-
A to B: √[(5-7)² + (45-50)²] = √[4 + 25] = √29 = 5.39
-
A to C: √[(5-8)² + (45-75)²] = √[9 + 900] = √909 = 30.15
-
A to D: √[(5-4)² + (45-30)²] = √[1 + 225] = √226 = 15.03
-
A to E: √[(5-9)² + (45-80)²] = √[16 + 1225] = √1241 = 35.23
-
B to C: √[(7-8)² + (50-75)²] = √[1 + 625] = √626 = 25.02
-
B to D: √[(7-4)² + (50-30)²] = √[9 + 400] = √409 = 20.22
-
B to E: √[(7-9)² + (50-80)²] = √[4 + 900] = √904 = 30.07
-
C to D: √[(8-4)² + (75-30)²] = √[16 + 2025] = √2041 = 45.18
-
C to E: √[(8-9)² + (75-80)²] = √[1 + 25] = √26 = 5.10
-
D to E: √[(4-9)² + (30-80)²] = √[25 + 2500] = √2525 = 50.25
This distance matrix captures all pairwise similarities, with smaller distances indicating more similar customers.
3. Initial Clusters
We begin with each customer as its own cluster: {A}, {B}, {C}, {D}, {E}. At this initial level, there are five clusters, representing the most granular view of the data. The dendrogram would show five separate branches at the bottom. The distances between clusters at this stage are simply the pairwise distances between the individual points, as calculated in the distance matrix. This initialization reflects the assumption that each customer is distinct at the finest level of analysis, with merging only occurring when evidence supports grouping. The clustering algorithm will now proceed to merge the closest pair of clusters, gradually building the hierarchy from these five individual points up to a single cluster containing all customers.
4. First Merger: C and E
Examining the distance matrix, the smallest distance is between C and E at 5.10. These two customers are most similar C has income 8 and spending 75, E has income 9 and spending 80, both representing high-income, high-spending individuals. The algorithm merges them into a new cluster {C,E}. The merger height is recorded as 5.10, indicating the dissimilarity at which they joined. This first merger makes intuitive sense customers with similar high spending patterns naturally group together. The dendrogram will show C and E joining at a low height, indicating strong similarity. After this merger, we now have four clusters: {A}, {B}, {D}, and {C,E}. The number of clusters has reduced from five to four.
5. Update Distance Matrix (After First Merger)
Now we need distances between the new cluster {C,E} and all remaining clusters {A}, {B}, {D}. Using average linkage, we compute the average of all pairwise distances between points in each cluster.
For {C,E} to {A}: average of distance(C,A)=30.15 and distance(E,A)=35.23 = (30.15+35.23)/2 = 32.69
For {C,E} to {B}: average of distance(C,B)=25.02 and distance(E,B)=30.07 = (25.02+30.07)/2 = 27.55
For {C,E} to {D}: average of distance(C,D)=45.18 and distance(E,D)=50.25 = (45.18+50.25)/2 = 47.72
Distances among remaining single-point clusters remain as originally calculated:
A-B: 5.39, A-D: 15.03, B-D: 20.22
This updated distance matrix reflects the current cluster configuration and will guide the next merger decision.
6. Second Merger: A and B
Examining the updated distance matrix, the smallest distance is between A and B at 5.39. These two customers A (income 5, spending 45) and B (income 7, spending 50) represent moderate-income, moderate-spending individuals. The algorithm merges them into new cluster {A,B} at height 5.39. This merger also makes intuitive sense customers with similar mid-range spending patterns naturally group together. After this merger, we have three clusters: {A,B}, {C,E}, and {D}. Customer D (income 4, spending 30) remains separate, representing a lower-income, lower-spending individual. The dendrogram now shows A and B joining at a low height, with {C,E} already formed, and D still isolated.
7. Update Distance Matrix (After Second Merger)
Compute distances between new cluster {A,B} and remaining clusters {C,E} and {D}.
For {A,B} to {C,E}: average of distances A-C=30.15, A-E=35.23, B-C=25.02, B-E=30.07 = (30.15+35.23+25.02+30.07)/4 = 120.47/4 = 30.12
For {A,B} to {D}: average of distances A-D=15.03 and B-D=20.22 = (15.03+20.22)/2 = 35.25/2 = 17.63
Distance between {C,E} and {D} remains 47.72 from previous calculation.
The updated matrix now shows distances between three clusters, setting up the next merger decision.
8. Third Merger: {A,B} and D
Examining the updated distances, the smallest is between {A,B} and D at 17.63. Customer D (low income, low spending) is closer to the moderate cluster {A,B} than to the high cluster {C,E}. The algorithm merges {A,B} and D into new cluster {A,B,D} at height 17.63. This merger groups the lower and moderate spending customers together, distinct from the high spenders. After this merger, we have two clusters: {A,B,D} and {C,E}. The dendrogram now shows {A,B} and D joining, with {C,E} as a separate branch. This structure reveals two main groups in the data moderate/low spenders versus high spenders.
9. Update Distance Matrix (After Third Merger)
Compute distance between the two remaining clusters {A,B,D} and {C,E}. Using average linkage, we calculate the average of all pairwise distances between points in the two clusters.
Pairs and distances:
A-C: 30.15, A-E: 35.23, B-C: 25.02, B-E: 30.07, D-C: 45.18, D-E: 50.25
Sum = 30.15+35.23+25.02+30.07+45.18+50.25 = 215.90
Number of pairs = 3 points × 2 points = 6
Average distance = 215.90 / 6 = 35.98
This single distance value will determine the final merger height.
10. Final Merger
The algorithm performs the final merger, combining {A,B,D} and {C,E} into a single cluster containing all five customers at height 35.98. This completes the hierarchy, with all points now in one cluster. The dendrogram shows the two main branches joining at this height, with {C,E} on one side and {A,B,D} on the other. The merger height of 35.98 is substantially larger than previous merger heights (5.10, 5.39, 17.63), indicating that these two groups are quite distinct. This separation reflects the natural division in the data between high spenders (C,E) and moderate/low spenders (A,B,D). The complete hierarchy now shows the nested relationships among all customers.
11. Dendrogram Construction
The dendrogram visually represents the complete clustering hierarchy. At the bottom, five leaves represent individual customers. The first merger joins C and E at height 5.10, creating a small branch. The second merger joins A and B at height 5.39, creating another branch. The third merger joins this {A,B} branch with D at height 17.63, forming a larger branch containing A,B,D. Finally, the two main branches {A,B,D} and {C,E} join at height 35.98, forming the complete tree. The vertical axis represents dissimilarity, with lower joins indicating stronger similarity. The dendrogram clearly shows two natural clusters: {C,E} as high spenders, and {A,B,D} as moderate/low spenders. Within the moderate group, A and B are most similar, with D somewhat more distant.
12. Cluster Interpretation
Cutting the dendrogram at a height of around 25 yields two distinct clusters: Cluster 1: {C,E} representing “High Spenders” with high income and high spending scores. These customers are prime targets for premium products and loyalty programs. Cluster 2: {A,B,D} representing “Moderate/Low Spenders” with lower income and spending. Within this cluster, A and B form a sub-group of “Moderate Spenders” (income 5-7, spending 45-50), while D is a “Low Spender” (income 4, spending 30). This hierarchical structure reveals both broad segments and finer sub-segments. Marketing strategies can be tailored accordingly premium offers for Cluster 1, value propositions for the moderate sub-group, and perhaps engagement campaigns to increase spending from the low spender. The hierarchical view provides flexibility to choose segmentation granularity based on business needs.