Grid-Based Clustering, Functions, Types, Process, Example

GridBased Clustering partitions the data space into a finite number of cells (grid) and performs clustering on the grid cells rather than individual data points. This approach quantizes the space into rectangular cells using a multi-resolution grid structure, then groups adjacent dense cells to form clusters. By working with cell summaries instead of individual points, grid-based methods achieve high computational efficiency, with processing time independent of data size and dependent only on grid resolution. Popular algorithms include STING (Statistical Information Grid), CLIQUE (Clustering In QUEst), and WaveCluster which uses wavelet transforms. Grid-based clustering excels at handling large datasets, discovering arbitrary-shaped clusters, and supporting incremental updates. Applications include spatial data mining, image segmentation, and processing massive datasets where point-based methods become computationally prohibitive.

Functions of Grid-Based Clustering:

1. Quantize Data Space into Grid Cells

The primary function of grid-based clustering is to quantize the data space into a finite number of grid cells, creating a multi-resolution structure. The algorithm partitions each dimension into intervals, forming a grid of cells that cover the entire data space. Each data point falls into exactly one cell based on its coordinate values. For example, in two-dimensional space with ranges divided into 10 intervals each, 100 grid cells are created. This quantization transforms the clustering problem from operating on individual points to operating on cell summaries. The grid resolution (number of intervals per dimension) is a key parameter that affects both computational efficiency and clustering quality. Finer grids capture more detail but increase computation; coarser grids are faster but may miss fine-grained patterns. This space quantization is the foundation for all subsequent grid-based operations.

2. Summarize Cell Statistics

Summarize cell statistics computes aggregate information for each grid cell, replacing individual points with compact summaries. Common statistics include cell population (number of points), mean, variance, minimum, maximum, and distribution information for each dimension. For example, a cell containing 100 customer records might store average income, income variance, and spending score distribution. These summaries capture essential cell characteristics without storing individual points. Statistical information grids like STING store these summaries at multiple resolution levels, enabling hierarchical analysis. This summarization function dramatically reduces data volume from potentially millions of points to thousands of cells. It also supports privacy-preserving analysis by working with aggregates rather than individual records. The summaries provide sufficient information for clustering decisions while enabling efficient processing.

3. Identify Dense Cells

Identify dense cells determines which grid cells contain enough points to be considered for cluster formation. A density threshold parameter specifies the minimum number of points required for a cell to be considered dense. Cells below this threshold are treated as sparse and ignored in cluster formation, effectively performing noise filtering. For example, in customer segmentation, cells with fewer than 10 customers might be considered sparse and excluded from major segments. This density identification is the grid-based equivalent of core point identification in density-based clustering. The threshold can be absolute (minimum points) or relative (percentage of average cell density). Dense cells become the building blocks for clusters, with adjacent dense cells merged to form contiguous regions. This function separates significant data regions from noise and low-density areas.

4. Merge Adjacent Dense Cells

Merge adjacent dense cells connects neighboring dense cells to form contiguous clusters. The algorithm examines grid connectivity, typically considering cells that share faces (4-connectivity in 2D, 6-connectivity in 3D). All connected dense cells are grouped into a single cluster, creating regions of arbitrary shape that follow the grid structure. For example, in spatial analysis, a cluster might follow an irregular shape as connected cells trace a pattern. This merging function is the grid-based equivalent of density connectivity in DBSCAN. The resulting clusters can have complex shapes, limited only by grid resolution. Merging decisions are based purely on cell adjacency and density, requiring no distance calculations between points. This efficiency is a key advantage of grid-based methods, as connectivity checks are simple compared to pointwise distance computations.

5. Support Multi-Resolution Analysis

Support multi-resolution analysis enables examination of cluster structure at different levels of granularity. Grid-based algorithms often build hierarchical grid structures, with coarse grids at higher levels and finer grids at lower levels. Analysts can explore clusters from coarse overview to fine detail, understanding how patterns emerge at different scales. For example, STING stores statistical information at multiple grid resolutions, allowing queries like “show major population centers at state level” then drilling down to city level. WaveCluster uses wavelet transforms to automatically identify significant regions across scales. This multi-resolution capability is particularly valuable when data contains hierarchical structure or when different applications require different granularities. It provides flexibility impossible in single-resolution methods and supports interactive exploration from strategic overview to operational detail.

6. Achieve High Computational Efficiency

Achieve high computational efficiency is a defining function of grid-based clustering. Processing time depends on the number of grid cells, not the number of data points. With G grid cells and N data points, complexity is typically O(G) for clustering after cell summarization, plus O(N) for assigning points to cells. For large datasets where N >> G, this represents dramatic efficiency gains. For example, clustering 10 million points with a 100×100 grid (10,000 cells) processes only 10,000 cells after initial assignment, rather than considering 10 million points pairwise. This efficiency makes grid-based methods suitable for very large datasets where point-based algorithms would be computationally prohibitive. The approach also supports incremental updates adding new points only requires updating affected cell statistics, not reclustering all data.

7. Handle High-Dimensional Data

Handle high-dimensional data through careful grid design and dimensionality reduction techniques. While grid partitioning becomes exponentially expensive with dimensions (the “curse of dimensionality”), algorithms like CLIQUE address this by identifying dense regions in subspaces rather than full-dimensional space. CLIQUE examines projections onto lower-dimensional subspaces, finding dense units and combining them across dimensions. This subspace clustering approach discovers patterns that may not be visible in full-dimensional space. For example, in customer data, a cluster might be defined by specific income and age ranges regardless of location. Grid-based subspace clustering automatically identifies which dimensions are relevant for each cluster. This function enables meaningful clustering in high-dimensional domains like text analysis, genomics, and customer analytics where traditional distance-based methods fail.

8. Discover Arbitrarily Shaped Clusters

Discover arbitrarily shaped clusters by connecting dense grid cells without shape constraints. Since clusters are formed by merging adjacent dense cells, they can assume any shape that the grid can represent, from simple rectangles to complex irregular forms. The resolution of the grid determines the precision of shape representation finer grids capture more detailed shapes but require more computation. For example, in geographic analysis, population clusters following river valleys or coastlines can be accurately represented by connected cells tracing these features. This shape flexibility is comparable to density-based clustering but achieved through grid connectivity rather than pointwise distance calculations. The grid approximation may miss fine details at cell boundaries but provides good shape representation for most applications. This function is essential for spatial data, image segmentation, and any domain where clusters have natural, non-geometric forms.

9. Support Incremental Clustering

Support incremental clustering enables efficient updates as new data arrives. When new points are added, only the affected grid cells need updating their statistics and potentially their density classification. If density status changes (sparse to dense or vice versa), local cluster adjustments may be needed, but the global grid structure remains stable. This incremental capability is crucial for streaming data, real-time monitoring, and dynamic databases where complete reclustering would be prohibitively expensive. For example, in network traffic analysis, new connection records can be incrementally incorporated into the clustering without reprocessing all historical data. The grid structure provides natural partitioning that localizes updates. This function makes grid-based methods suitable for applications with continuously arriving data where maintaining up-to-date clustering is essential.

10. Provide Statistical Information for Each Cell

Provide statistical information for each cell enriches clustering with detailed cell characteristics that support interpretation and further analysis. Beyond simple point counts, cells can store means, variances, distributions, and other statistics for each dimension. This information enables understanding of what characterizes each cluster at the cell level. For example, in customer segmentation, cells within a cluster might share similar income and age profiles; examining cell statistics reveals the typical characteristics of the cluster. STING (Statistical Information Grid) emphasizes this function, storing multiple levels of statistical summaries that support fast query processing. This statistical enrichment transforms clusters from simple point groupings into information-rich regions that support characterization, comparison, and decision-making. The cell statistics also enable quality assessment, helping identify cells that may be outliers within otherwise homogeneous clusters.

Types of Grid-Based Clustering:

1. STING (Statistical Information Grid)

STING is a grid-based clustering method that partitions the spatial area into rectangular cells at multiple resolution levels, forming a hierarchical structure. Each cell stores statistical information about the points within it, including count, mean, standard deviation, minimum, maximum, and distribution type. Higher-level cells are partitioned into smaller lower-level cells, creating a pyramid structure. To answer clustering queries, STING starts at a coarse resolution, identifies relevant cells, and progressively drills down to finer resolutions only where needed. This hierarchical approach enables efficient processing, as only cells relevant to the query are examined. STING’s complexity is O(K) where K is the number of grid cells at the finest resolution, independent of data size. It supports incremental updates and provides fast query response, making it suitable for large spatial databases and interactive exploration. However, cluster boundaries follow grid lines, potentially reducing accuracy.

2. CLIQUE (Clustering In QUEst)

CLIQUE is a grid-based clustering algorithm designed to automatically find clusters in high-dimensional data by identifying dense regions in subspaces. It partitions each dimension into equal-width intervals and identifies dense units where point density exceeds a threshold. CLIQUE then uses an Apriori-like approach to generate candidate subspaces of increasing dimensionality, retaining only those with dense units. Clusters are formed by connecting adjacent dense units within each subspace. The algorithm excels at discovering clusters that exist in subspaces of the original feature space, not requiring all dimensions to be relevant. For example, in customer data, a cluster might be defined by specific income and age ranges regardless of location. CLIQUE automatically determines which dimensions characterize each cluster. Its output includes both the cluster descriptions and the relevant subspaces, providing interpretable results. However, the grid resolution parameter significantly affects results.

3. WaveCluster

WaveCluster applies wavelet transform to grid-based clustering, using signal processing techniques to identify clusters in multi-dimensional data. The algorithm first overlays a grid on the data space and computes the number of points in each cell, creating a multi-dimensional signal. Wavelet transform is then applied to this signal, which acts as a multi-resolution filter, highlighting regions of high density while suppressing noise. In the transformed space, clusters appear as connected components that can be easily identified. The wavelet transform’s ability to detect clusters at different scales enables automatic identification of clusters without requiring the number of clusters as input. WaveCluster is computationally efficient O(N) and robust to noise and outliers. It discovers arbitrarily shaped clusters and handles boundary points effectively. Applications include image processing, spatial data mining, and large-scale data analysis where both accuracy and efficiency are critical.

4. MAFIA (Merging of Adaptive Finite Intervals)

MAFIA extends CLIQUE by using adaptive grid partitioning rather than fixed equal-width intervals. It analyzes the data distribution in each dimension to determine optimal interval boundaries, creating grid cells of varying sizes based on data density. Dense regions receive finer partitioning, while sparse regions are represented by coarser cells. This adaptive approach better captures cluster structure, especially when data distribution varies significantly across dimensions. MAFIA also introduces parallelism for efficiency and uses scalable architectures for high-dimensional data. The algorithm generates candidate clusters in subspaces, merging those that overlap or are connected. By adapting grid resolution to data density, MAFIA reduces the number of cells in sparse regions while maintaining detail where needed. This adaptability improves both accuracy and efficiency compared to fixed-grid approaches. MAFIA is particularly valuable for high-dimensional data where uniform grids would be prohibitively large.

5. ENCLUS (ENtropy-based CLUStering)

ENCLUS is a subspace clustering algorithm that uses entropy to identify interesting subspaces for clustering. It partitions each dimension into intervals and evaluates the entropy of each subspace, where low entropy indicates strong clustering structure. Subspaces with entropy below a threshold are considered potentially interesting. ENCLUS then generates higher-dimensional subspaces using an Apriori-like approach, pruning those that cannot contain interesting clusters. The algorithm also incorporates a connectivity requirement, ensuring that clusters in the identified subspaces are formed by connected dense units. ENCLUS automatically determines both the relevant subspaces and the clusters within them. The entropy-based approach provides a statistical foundation for assessing subspace quality, distinguishing it from density-based methods. This function is particularly valuable when clusters may exist in different subspaces and the goal is to discover both the clusters and their characterizing dimensions.

6. OptiGrid

OptiGrid is an optimal grid-based clustering algorithm that constructs an adaptive grid by finding optimal cutting planes perpendicular to the coordinate axes. It identifies dense regions in one-dimensional projections to determine where to place grid boundaries, creating cells that align with natural cluster boundaries. The algorithm builds a hierarchical grid structure, recursively partitioning cells that contain multiple clusters. OptiGrid includes built-in noise detection and can discover clusters of arbitrary shape. Its adaptive approach avoids the fixed resolution problem of uniform grids, automatically adjusting grid density to data distribution. The algorithm is particularly effective for datasets with varying density across different regions. OptiGrid’s complexity is O(n log n) with appropriate indexing, making it scalable to large datasets. The resulting clusters are defined by the adaptive grid cells, with cluster boundaries that more accurately reflect data structure than uniform grid approaches.

7. GDILC (Grid-based Density-Isoline Clustering)

GDILC combines grid-based partitioning with density isolines to identify clusters of arbitrary shape. It first overlays a grid on the data space and estimates density at each grid cell. The algorithm then constructs density isolines connecting cells with similar density values, similar to contour lines on a topographic map. Clusters are identified as regions bounded by density isolines that exceed a threshold. This approach naturally handles clusters of different densities and shapes, as isolines adapt to local density variations. GDILC also identifies nested cluster structures where lower-density regions contain higher-density subregions. The isoline representation provides intuitive visualization of cluster structure, with clusters appearing as “mountains” on a density landscape. GDILC is computationally efficient due to grid-based processing and handles noise effectively by filtering low-density regions. Applications include spatial data analysis, image segmentation, and any domain where density variations reveal natural groupings.

8. GRIDCLUS

GRIDCLUS is a grid-based clustering algorithm that first partitions the data space into grid cells and then performs clustering on the cells rather than individual points. It identifies dense cells based on a density threshold, then groups adjacent dense cells into clusters using a region-growing approach. GRIDCLUS uses a multi-resolution grid hierarchy, enabling clusters to be detected at different granularities. The algorithm efficiently handles large datasets by working with cell summaries rather than individual points. Cluster boundaries are approximated by the grid cells, which can be refined by using finer grids in regions of interest. GRIDCLUS supports incremental updates and parallel processing, making it suitable for dynamic databases and large-scale applications. The algorithm’s simplicity and efficiency make it a practical choice for many real-world applications where approximate but fast clustering is acceptable, such as customer segmentation, spatial data analysis, and image processing.

9. BANG-clustering

BANG-clustering uses a balanced adaptive grid structure based on a kd-tree-like partitioning of the data space. It recursively splits dimensions at points that best separate the data, creating grid cells of varying sizes adapted to data distribution. This adaptive approach generates fewer but more meaningful cells compared to uniform grids. BANG-clustering identifies dense cells and merges adjacent dense cells to form clusters, similar to other grid-based methods. The algorithm also includes techniques for handling clusters that span multiple cells at different grid resolutions. BANG-clustering’s adaptive grid structure provides efficient processing while maintaining cluster quality. The kd-tree organization enables fast cell access and neighbor finding, supporting scalability to high-dimensional data. BANG-clustering is particularly effective when data distribution is highly non-uniform, as the adaptive grid allocates more cells to dense regions while using fewer cells in sparse areas, improving both accuracy and efficiency.

10. Adaptive Grid-Based Clustering

Adaptive Grid-Based Clustering encompasses methods that dynamically adjust grid resolution based on local data density. These algorithms typically start with a coarse grid and refine it only in regions where finer detail is needed to separate clusters. The adaptation can be based on statistical tests, density gradients, or cluster validity measures. Adaptive approaches address the fundamental limitation of uniform grids the trade-off between resolution and computational cost. By allocating more cells to dense, complex regions and fewer cells to sparse, simple regions, adaptive grids achieve better cluster quality with less total cells. Techniques include recursive splitting of cells exceeding density thresholds, quad-tree structures, and Voronoi-based partitioning. Adaptive grid methods are particularly valuable for datasets with widely varying densities, where uniform grids would either miss fine structure in dense regions or waste cells in sparse areas. These algorithms combine the efficiency of grid-based processing with the accuracy of density-adaptive methods.

Process of Grid-Based Clustering:

1. Data Preparation

The grid-based clustering process begins with data preparation, ensuring the dataset is suitable for grid partitioning. This step involves selecting relevant features, handling missing values, and potentially normalizing or standardizing the data so that all dimensions contribute equally to grid cell assignment. For example, if one dimension ranges from 0-100 and another from 0-1,000,000, normalization ensures both dimensions receive appropriate grid resolution. Data preparation also includes understanding the range of each attribute to define the overall data space boundaries. Outlier detection may identify extreme values that could distort grid partitioning; these outliers can be handled separately or used to extend grid boundaries appropriately. Quality preparation ensures that the subsequent grid structure covers the data space effectively and that cell assignments reflect genuine patterns rather than scaling artifacts.

2. Define Grid Structure

Define grid structure partitions each dimension of the data space into intervals, creating a multi-dimensional grid of cells. The number of intervals per dimension (grid resolution) is a critical parameter that affects both computational efficiency and clustering quality. For example, in two-dimensional space with 10 intervals per dimension, 100 grid cells are created. Resolution selection involves trade-offs finer grids capture more detail but increase cells exponentially; coarser grids are faster but may merge distinct clusters. Some algorithms use adaptive resolution, varying cell size based on data density. The grid defines cell boundaries that will determine which points fall into which cells. This structure is the foundation for all subsequent processing, as all operations will work with cell summaries rather than individual points. The grid essentially quantizes the continuous data space into discrete units.

3. Map Points to Grid Cells

Map points to grid cells assigns each data point to its corresponding cell based on coordinate values. For each point, the algorithm determines which interval it falls into for each dimension, identifying the unique cell containing that point. For example, a point with coordinates (45, 75) might fall into cell (4,7) if dimension 1 intervals are 0-10,10-20,…,90-100 and dimension 2 similarly. This mapping is computationally efficient O(N) and can be done in a single pass through the data. After mapping, each point is associated with a cell identifier. This step transforms the original dataset into a compact representation where individual points are no longer needed for clustering decisions. The mapping also creates the foundation for cell statistics by accumulating points per cell.

4. Compute Cell Statistics

Compute cell statistics aggregates information for each grid cell based on the points assigned to it. Basic statistics include cell population (number of points). More detailed statistics may include mean, variance, minimum, maximum, and distribution information for each dimension within the cell. For example, a cell might store count = 150, mean income = ₹45,000, income variance = 5,000, and age distribution summary. These statistics characterize the cell’s data content without storing individual points. The level of statistical detail depends on the algorithm and application requirements. STING, for example, stores comprehensive statistical summaries to support fast query processing. This summarization dramatically reduces data volume from potentially millions of points to thousands of cells. The cell statistics become the input for all subsequent clustering decisions, enabling efficient processing.

5. Identify Dense Cells

Identify dense cells determines which grid cells contain enough points to be considered for cluster formation. A density threshold parameter specifies the minimum number of points required for a cell to be considered dense. This threshold can be absolute (e.g., at least 10 points) or relative (e.g., above average cell density). Cells meeting the threshold are marked as dense; those below are considered sparse and treated as potential noise. For example, in customer segmentation, cells with fewer than 5 customers might be considered sparse and excluded from major segments. This density identification is analogous to core point identification in density-based clustering. The threshold choice significantly affects results too high may miss meaningful clusters in sparse regions; too low may include noise cells in clusters. Some algorithms use adaptive thresholds based on local density characteristics.

6. Merge Adjacent Dense Cells

Merge adjacent dense cells connects neighboring dense cells to form contiguous clusters. The algorithm examines grid connectivity, typically considering cells that share faces (4-connectivity in 2D, 6-connectivity in 3D). All connected dense cells are grouped into a single cluster, creating regions that follow the grid structure. For example, in spatial analysis, a cluster might follow an irregular shape as connected cells trace a pattern. This merging function is the grid-based equivalent of density connectivity. Merging can be implemented through simple region growing or connected component labeling on the grid. The resulting clusters can have complex shapes, limited only by grid resolution. Merging decisions are based purely on cell adjacency and density, requiring no distance calculations between points. This efficiency is a key advantage of grid-based methods.

7. Handle Boundary Cells

Handle boundary cells refines cluster boundaries to improve accuracy. Since grid cells are discrete units, clusters defined by merged cells have jagged boundaries following grid lines. This may merge points that are actually separate if they fall in adjacent dense cells that span a sparse region. Boundary handling techniques include examining points near cell boundaries, using finer grids near boundaries, or applying post-processing to adjust cluster membership based on actual point positions. Some algorithms use overlapping grids or shift grids to detect cells that should belong to different clusters. For example, points in cells at cluster boundaries might be reassigned based on nearest dense cell or local density estimation. This step improves clustering quality while maintaining most of grid-based efficiency.

8. Multi-Resolution Refinement

Multi-resolution refinement examines cluster structure at different grid resolutions to validate and refine results. Using a hierarchical grid structure, the algorithm can start with coarse resolution to identify major clusters, then examine finer resolutions within those clusters to detect sub-structure or refine boundaries. For example, STING stores statistical information at multiple grid levels, allowing queries to drill down for detail only where needed. WaveCluster uses wavelet transforms to identify significant regions across scales. This refinement identifies whether coarse clusters actually contain separable sub-clusters at finer resolution. It also validates that clusters found at coarse resolution persist across scales, indicating stable structure. Multi-resolution refinement provides confidence in clustering results and reveals hierarchical organization.

9. Cluster Labeling

Cluster labeling assigns each original data point to its corresponding cluster based on cell membership. Since points were mapped to cells, and cells were assigned to clusters through merging, each point inherits the cluster label of its containing cell. Points in sparse cells that were not merged into any cluster are typically labeled as noise or outliers. This labeling produces the final clustering result. Some algorithms provide additional output, such as probability of cluster membership for boundary points. The labeling step also generates cluster statistics overall cluster size, average point characteristics, and cluster boundaries. These labeled results can be visualized, analyzed, and deployed for business applications such as customer segmentation, anomaly detection, or spatial analysis.

10. Output and Interpretation

Output and interpretation presents the clustering results in formats useful for business analysis and decision-making. Typical outputs include cluster assignments for each data point, cluster centroids or representative points, cluster boundaries, and cluster characteristics derived from cell statistics. For example, in customer segmentation, output might show that Cluster 1 represents “high-income urban professionals” with average income ₹12 lakhs, average age 35, and cluster size 15,000 customers. Visualization tools display clusters in the original data space, with grid cells colored by cluster membership. Statistical summaries characterize each cluster’s properties. Domain experts interpret these results, validating that clusters align with business knowledge and identifying actionable insights. This interpretation transforms technical clustering output into business intelligence that drives decisions about targeting, personalization, and strategy. The process concludes with documentation and, if appropriate, deployment of cluster definitions for ongoing application.

Example of Grid-Based Clustering:

Consider 20 customers with data on two features: annual income (in lakhs) and spending score (1-100). The dataset contains customers with incomes ranging from 3 to 12 lakhs and spending scores from 20 to 90. We will apply grid-based clustering using a 4×4 grid structure (4 intervals per dimension) to identify customer segments. This example demonstrates the complete process from grid creation to cluster interpretation, showing how grid-based methods efficiently handle data by working with cell summaries rather than individual points.

1. Data Understanding

The dataset contains 20 customers with two features: annual income (3-12 lakhs) and spending score (20-90). These features capture customer purchasing power and actual spending behavior, two dimensions commonly used for market segmentation. The data ranges are: Income minimum 3, maximum 12, range 9; Spending minimum 20, maximum 90, range 70. Understanding these ranges is essential for defining grid boundaries. The data appears to have natural groupings some customers with high income and high spending, others with moderate income and moderate spending, and some with low income and low spending. This structure makes it suitable for clustering analysis. Grid-based clustering will partition this two-dimensional space into cells and identify dense regions corresponding to customer segments.

2. Define Grid Structure

We define a 4×4 grid by partitioning each dimension into 4 equal-width intervals. For Income (3-12 lakhs): interval width = (12-3)/4 = 2.25 lakhs. Intervals: I1: 3.00-5.25, I2: 5.25-7.50, I3: 7.50-9.75, I4: 9.75-12.00. For Spending (20-90): interval width = (90-20)/4 = 17.5. Intervals: S1: 20.0-37.5, S2: 37.5-55.0, S3: 55.0-72.5, S4: 72.5-90.0. This creates 16 grid cells, each representing a unique combination of income and spending intervals. For example, cell (I1,S1) contains customers with income 3.00-5.25 lakhs and spending 20.0-37.5. This grid structure quantizes the continuous data space into discrete units that will form the basis for clustering.

Grid S1 (20-37.5) S2 (37.5-55) S3 (55-72.5) S4 (72.5-90)
I4 (9.75-12) Cell (4,1) Cell (4,2) Cell (4,3) Cell (4,4)
I3 (7.5-9.75) Cell (3,1) Cell (3,2) Cell (3,3) Cell (3,4)
I2 (5.25-7.5) Cell (2,1) Cell (2,2) Cell (2,3) Cell (2,4)
I1 (3-5.25) Cell (1,1) Cell (1,2) Cell (1,3) Cell (1,4)

3. Map Points to Grid Cells

Each customer is assigned to a grid cell based on their income and spending values. For example:

Customer 1: Income 4.2 (I1), Spending 30 (S1) → Cell (1,1)
Customer 2: Income 5.0 (I1), Spending 42 (S2) → Cell (1,2)
Customer 3: Income 6.8 (I2), Spending 48 (S2) → Cell (2,2)
Customer 4: Income 8.5 (I3), Spending 70 (S3) → Cell (3,3)
Customer 5: Income 10.2 (I4), Spending 85 (S4) → Cell (4,4)
Customer 6: Income 4.5 (I1), Spending 35 (S1) → Cell (1,1)
Customer 7: Income 5.8 (I2), Spending 45 (S2) → Cell (2,2)
Customer 8: Income 7.2 (I2), Spending 52 (S2) → Cell (2,2)
Customer 9: Income 9.0 (I3), Spending 68 (S3) → Cell (3,3)
Customer 10: Income 11.0 (I4), Spending 82 (S4) → Cell (4,4)
Customer 11: Income 3.8 (I1), Spending 28 (S1) → Cell (1,1)
Customer 12: Income 6.2 (I2), Spending 50 (S2) → Cell (2,2)
Customer 13: Income 8.2 (I3), Spending 65 (S3) → Cell (3,3)
Customer 14: Income 10.5 (I4), Spending 88 (S4) → Cell (4,4)
Customer 15: Income 4.8 (I1), Spending 40 (S2) → Cell (1,2)
Customer 16: Income 7.5 (I3), Spending 60 (S3) → Cell (3,3)
Customer 17: Income 9.5 (I3), Spending 72 (S4) → Cell (3,4)
Customer 18: Income 5.2 (I1), Spending 38 (S2) → Cell (1,2)
Customer 19: Income 6.5 (I2), Spending 55 (S3) → Cell (2,3)
Customer 20: Income 8.8 (I3), Spending 75 (S4) → Cell (3,4)

4. Compute Cell Statistics

After mapping, we count the number of customers in each cell and optionally compute additional statistics like average income and spending within each cell.

Cell populations:

  • Cell (1,1): Customers 1,6,11 → Count = 3

  • Cell (1,2): Customers 2,15,18 → Count = 3

  • Cell (1,3): None → Count = 0

  • Cell (1,4): None → Count = 0

  • Cell (2,1): None → Count = 0

  • Cell (2,2): Customers 3,7,8,12 → Count = 4

  • Cell (2,3): Customer 19 → Count = 1

  • Cell (2,4): None → Count = 0

  • Cell (3,1): None → Count = 0

  • Cell (3,2): None → Count = 0

  • Cell (3,3): Customers 4,9,13,16 → Count = 4

  • Cell (3,4): Customers 17,20 → Count = 2

  • Cell (4,1): None → Count = 0

  • Cell (4,2): None → Count = 0

  • Cell (4,3): None → Count = 0

  • Cell (4,4): Customers 5,10,14 → Count = 3

Average income and spending can also be computed per cell, but count alone suffices for basic density-based clustering.

5. Set Density Threshold

We set a density threshold of 2 points per cell. Cells with 2 or more points are considered dense; cells with 0 or 1 point are considered sparse. This threshold is chosen based on the total data size (20 points across 16 cells, average 1.25 per cell) and our interest in finding meaningful customer groups rather than individual outliers. A threshold of 2 ensures that only cells with multiple customers contribute to clusters, filtering out isolated points that might represent rare customer types or data errors. This threshold selection balances cluster formation with noise filtering. In practice, threshold selection may involve domain knowledge or experimentation.

Dense cells (count ≥ 2):

  • Cell (1,1): count 3 → Dense

  • Cell (1,2): count 3 → Dense

  • Cell (2,2): count 4 → Dense

  • Cell (3,3): count 4 → Dense

  • Cell (3,4): count 2 → Dense

  • Cell (4,4): count 3 → Dense

Sparse cells (count < 2):

  • Cell (2,3): count 1 → Sparse

  • All other cells: count 0 → Sparse

6. Identify Dense Cells

The dense cells identified are:

  1. Cell (1,1): Income 3.0-5.25, Spending 20-37.5 (Low income, low spending)

  2. Cell (1,2): Income 3.0-5.25, Spending 37.5-55 (Low income, moderate spending)

  3. Cell (2,2): Income 5.25-7.5, Spending 37.5-55 (Moderate income, moderate spending)

  4. Cell (3,3): Income 7.5-9.75, Spending 55-72.5 (High income, high spending)

  5. Cell (3,4): Income 7.5-9.75, Spending 72.5-90 (High income, very high spending)

  6. Cell (4,4): Income 9.75-12, Spending 72.5-90 (Very high income, very high spending)

These dense cells represent regions of the customer space with multiple customers, indicating potential cluster centers. Sparse cells are treated as noise or transitional areas between clusters.

7. Merge Adjacent Dense Cells

We merge adjacent dense cells (sharing edges, not just corners) to form clusters:

Cluster A (Low Income Segment): Cells (1,1) and (1,2) are adjacent horizontally. They merge to form a cluster spanning low income (3.0-5.25) and spending from 20-55. Total points: 3+3=6 customers.

Cluster B (Moderate Income Segment): Cell (2,2) stands alone, as adjacent cells (2,1), (2,3), (1,2), (3,2) are sparse. This forms a cluster of moderate income and moderate spending with 4 customers.

Cluster C (High Income Segment): Cells (3,3), (3,4), and (4,4) form a connected region. (3,3) adjacent to (3,4) horizontally; (3,4) adjacent to (4,4) vertically. This cluster spans high to very high income and high to very high spending. Total points: 4+2+3=9 customers.

Sparse cell (2,3) with 1 customer remains unassigned, treated as noise.

8. Resulting Clusters

The grid-based clustering produces three distinct customer segments:

Cluster A (Budget Shoppers): 6 customers with income 3.0-5.25 lakhs and spending 20-55. These are price-sensitive customers with limited purchasing power. They likely respond to discounts and value-oriented promotions.

Cluster B (Mainstream Shoppers): 4 customers with income 5.25-7.5 lakhs and spending 37.5-55. These represent middle-market customers with moderate income and spending, likely responsive to balanced value and quality.

Cluster C (Premium Shoppers): 9 customers with income 7.5-12 lakhs and spending 55-90. These are high-value customers with significant purchasing power, likely interested in premium products, loyalty programs, and personalized service.

One customer in cell (2,3) with income 6.5 and spending 55 is identified as noise/outlier, potentially representing a transitional customer.

9. Cluster Characterization

Each cluster is characterized using the cell statistics:

Cluster A (Budget Shoppers): Cells (1,1) and (1,2). Average income approximately 4.5 lakhs, average spending approximately 38. Total 6 customers. This segment represents 30% of the customer base.

Cluster B (Mainstream Shoppers): Cell (2,2). Average income approximately 6.6 lakhs, average spending approximately 49. Total 4 customers. This segment represents 20% of the customer base.

Cluster C (Premium Shoppers): Cells (3,3), (3,4), (4,4). Average income approximately 9.5 lakhs, average spending approximately 75. Total 9 customers. This segment represents 45% of the customer base.

Noise: 1 customer (5%) with income 6.5 and spending 55, falling between mainstream and premium segments.

10. Business Interpretation

The grid-based clustering reveals meaningful customer segments for targeted marketing:

Cluster A (Budget Shoppers): Target with value promotions, discounts, and bundle offers. Focus marketing on price advantage and essential products. These customers may be students or entry-level workers.

Cluster B (Mainstream Shoppers): Target with balanced value-quality messaging. Offer mid-tier products and moderate loyalty programs. These are likely established middle-class consumers.

Cluster C (Premium Shoppers): Target with premium products, exclusive offers, and personalized service. Implement high-value loyalty programs and early access to new products. These are affluent customers with high lifetime value.

Noise Customer: Investigate individually; may represent an opportunity to move into premium segment with appropriate engagement, or may be a data error.

This segmentation enables tailored marketing strategies, optimized inventory allocation, and personalized customer experiences that maximize revenue and satisfaction across all customer groups.

Leave a Reply

error: Content is protected !!