Clustering of data into groups is an important task to perform dimensionality reduction and to identify important properties of a data set. A wide range of algorithms for clustering have been devised that all use some built-in similarity/distance measure that is built into the algorithm to establish groupings of the data with a range of properties. These clustering algorithms group the data, attempting to maximize the distance between the clusters while minimizing the variance within the individual clusters. However, traditional clustering algorithms do not provide an efficient mechanism to fine tune the final clustering solution, given some sparse information about the desired properties of the grouping of the data.
The need of semi-unsupervised clustering arises, for example, in data sets with large numbers of attributes where most of the attributes are not semantically relevant but will dominate any distance metric (due to their number), used by traditional clustering algorithms. In these cases, sparse information regarding the quality of clusters or regarding relations between a small number of data points might be available which could be used to guide the cluster formation.
Semi-unsupervised clustering defines pairwise constraints on the input data in order to direct the clustering algorithm towards an answer which satisfies given constraints. We can have two possible types of constraints, same cluster or must link constraints which indicate that points should be in the same cluster, and different cluster or must not link constraints indicating that points should be in different clusters.
Given the input samples, it is often not possible to cluster the data according to the constraints in their original feature space using unmodified distance measures as indications for similarity. Thus, we have to modify the feature space, usually by scaling the dimensions, so that an unmodified clustering algorithm is able to cluster based on it’s own distance and variance constraints. In order to solve this problem, this paper presents a novel approach which, at first, learns a policy to compute the scaling factors using Reinforcement learning from a set of training problems and subsequently applies the learned policy to compute the scaling factors for new problems. The goal here is that by working on the scaled dimensions, the traditional clustering algorithm can yield results that satisfy the constraints.
Clustering Methods:
- Density-Based Methods: These methods consider the clusters as the dense region having some similarities and differences from the lower dense region of the space. These methods have good accuracy and the ability to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.
- Hierarchical Based Methods: The clusters formed in this method form a tree-type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two categories.
- Agglomerative (bottom-up approach)
- Divisive (top-down approach)
Applications of Clustering in different fields
Marketing: It can be used to characterize & discover customer segments for marketing purposes.
Biology: It can be used for classification among different species of plants and animals.
Libraries: It is used in clustering different books on the basis of topics and information.
Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones.
Clustering Algorithms:
K-means clustering algorithm: It is the simplest unsupervised learning algorithm that solves clustering problem. K-means algorithm partitions n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.