K-Nearest Neighbors (KNN) is a non-parametric machine learning algorithm used for both classification and regression tasks. KNN is an instance-based algorithm that is widely used for data mining and pattern recognition applications. KNN is considered as one of the simplest and most effective algorithms in machine learning, and it is widely used in various domains such as healthcare, finance, and social media analysis. In this article, we will discuss the KNN algorithm, its working principle, and its applications.
Working Principle of KNN:
The working principle of the KNN algorithm is based on the assumption that similar things are closer to each other. KNN is a distance-based algorithm that measures the distance between the new data point and all the other data points in the training set. The KNN algorithm then assigns the new data point to the class of its k-nearest neighbors.
KNN algorithm involves the following steps:
- Choose the value of k: The first step in the KNN algorithm is to choose the value of k, which is the number of nearest neighbors used to classify the new data point. The value of k is typically chosen using cross-validation techniques.
- Calculate the distance: The next step in the KNN algorithm is to calculate the distance between the new data point and all the other data points in the training set. The distance can be calculated using various distance metrics such as Euclidean distance, Manhattan distance, and Minkowski distance.
- Find the k-nearest neighbors: The next step in the KNN algorithm is to find the k-nearest neighbors of the new data point based on the distance metric. The k-nearest neighbors are the k data points that are closest to the new data point.
- Assign the class: The final step in the KNN algorithm is to assign the class of the new data point based on the majority class of its k-nearest neighbors. In classification tasks, the class with the highest frequency among the k-nearest neighbors is assigned to the new data point. In regression tasks, the average of the k-nearest neighbors is assigned to the new data point.
KNN is a lazy algorithm because it does not learn a model from the training data. Instead, it memorizes the training data and uses it to classify new data points. KNN is also an instance-based algorithm because it uses the entire training set as its model.
Applications of KNN:
The KNN algorithm has various applications in data mining and pattern recognition. Some of the applications of KNN are as follows:
- Healthcare: KNN is used in healthcare to diagnose diseases and predict patient outcomes. KNN is used to analyze patient data such as medical images, clinical data, and genetic data.
- Finance: KNN is used in finance to predict stock prices and identify fraudulent transactions. KNN is used to analyze financial data such as stock prices, trading volumes, and transaction logs.
- Social Media Analysis: KNN is used in social media analysis to analyze user behavior and sentiment. KNN is used to analyze social media data such as tweets, posts, and comments.
- Image Recognition: KNN is used in image recognition to classify images based on their features. KNN is used to analyze image data such as pixels, textures, and shapes.
- Recommender Systems: KNN is used in recommender systems to recommend products and services to users. KNN is used to analyze user data such as preferences, ratings, and reviews.
Advantages of KNN:
The KNN algorithm has the following advantages:
- Simple and Easy to Implement: The KNN algorithm is simple and easy to implement, and it does not require any training.
- Non-Parametric: The KNN algorithm is a non-parametric algorithm, which means that it does not make any assumptions about the underlying distribution of the data.
- High Accuracy: The KNN algorithm is known for its high accuracy, especially when the value of k is chosen appropriately.
- Robust to Noisy Data: The KNN algorithm is robust to noisy data because it does not make any assumptions about the underlying distribution of the data.
- Can Handle Multi-Class Problems: The KNN algorithm can handle multi-class problems because it assigns the new data point to the class with the highest frequency among its k-nearest neighbors.
Disadvantages of KNN:
The KNN algorithm also has some disadvantages, which are as follows:
- Computationally Expensive: The KNN algorithm is computationally expensive, especially when the size of the training set is large.
- Sensitive to the Choice of Distance Metric: The KNN algorithm is sensitive to the choice of distance metric because it uses the distance metric to measure the similarity between the new data point and the training data.
- Requires Feature Scaling: The KNN algorithm requires feature scaling because it uses the distance metric to measure the similarity between the new data point and the training data.
- Imbalanced Data: The KNN algorithm is sensitive to imbalanced data because it assigns the new data point to the class with the highest frequency among its k-nearest neighbors, which may not be representative of the true class distribution.
The k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:
- In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
- In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.
k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically.
Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the data.
Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).
The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.
A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric. Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.
A drawback of the basic “majority voting” classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. K-NN can then be applied to the SOM.
The best choice of k depends upon the data; generally, larger values of k reduces effect of the noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.
In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method.
The KNN Algorithm
- Load the data
- Initialize K to your chosen number of neighbours.
- For each example in the data.
3.1 Calculate the distance between the query example and the current example from the data.
3.2 Add the distance and the index of the example to an ordered collection.
- Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances.
- Pick the first K entries from the sorted collection.
- Get the labels of the selected K entries.
- If regression, return the mean of the K labels.
- If classification, return the mode of the K labels.
Choosing the right value for K
To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.
Here are some things to keep in mind:
- As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, imagine K=1 and we have a query point surrounded by several reds and one green (I’m thinking about the top left corner of the colored plot above), but the green is the single nearest neighbor. Reasonably, we would think the query point is most likely red, but because K=1, KNN incorrectly predicts that the query point is green.
- Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
- In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker.
- The algorithm is simple and easy to implement.
- There’s no need to build a model, tune several parameters, or make additional assumptions.
- The algorithm is versatile. It can be used for classification, regression, and search (as we will see in the next section).
The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.
KNN in practice
KNN’s main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where predictions need to be made rapidly. Moreover, there are faster algorithms that can produce more accurate classification and regression results.
However, provided you have sufficient computing resources to speedily handle the data you are using to make predictions, KNN can still be useful in solving problems that have solutions that depend on identifying similar objects. An example of this is using the KNN algorithm in recommender systems, an application of KNN-search.