An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error. The analysis of outlier data is referred to as outlier analysis or outlier mining. An outlier is an element of a data set that distinctly stands out from the rest of the data.
There are also different degrees of outliers:
- Extreme outliers are beyond an “outer fence.”
- Mild outliers lie beyond an “inner fence” on either side.
The easiest way to detect outliers is to create a graph. Plots such as Box plots, Scatterplots and Histograms can help to detect outliers. Alternatively, we can use mean and standard deviation to list out the outliers. Interquartile Range and Quartiles can also be used to detect outliers.
Detecting Outlier:
Clustering based outlier detection using distance to the closest cluster:
In the K-Means clustering technique, each cluster has a mean value. Objects belong to the cluster whose mean value is closest to it. In order to identify the Outlier, firstly we need to initialize the threshold value such that any distance of any data point greater than it from its nearest cluster identifies it as an outlier for our purpose. Then we need to find the distance of the test data to each cluster mean. Now, if the distance between the test data and the closest cluster to it is greater than the threshold value then we will classify the test data as an outlier.
About Smarten
The Smarten approach to augmented analytics and modern business intelligence focuses on the business user and provides tools for Advanced Data Discovery so users can perform early prototyping and test hypotheses without the skills of a data scientist. Smarten Augmented Analytics tools include assisted predictive modeling, smart data visualization, self-serve data preparation, Clickless Analytics with natural language processing (NLP) for search analytics, Auto Insights, Key Influencer Analytics, and SnapShot monitoring and alerts. These tools are designed for business users with average skills and require no specialized knowledge of statistical analysis or support from IT or data scientists. Businesses can advance Citizen Data Scientist initiatives with in-person and online workshops and self-paced eLearning courses designed to introduce users and businesses to the concept, illustrate the benefits and provide introductory training on analytical concepts and the Citizen Data Scientist role.
The Smarten approach to data discovery is designed as an augmented analytics solution to serve business users. Smarten is a representative vendor in multiple Gartner reports including the Gartner Modern BI and Analytics Platform report and the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms Report.
How to detect outliers in data
Data visualization is a core discipline for analysts and optimizers, not just to better communicate results with executives, but to explore the data fully.
As such, outliers are often detected through graphical means, though you can also do so by a variety of statistical methods using your favorite tool. (Excel and R will be referenced heavily here, though SAS, Python, etc., all work).
Two of the most common graphical ways of detecting outliers are the boxplot and the scatterplot. A boxplot is my favorite way.
Algorithm:
- Calculate the mean of each cluster
- Initialize the Threshold value
- Calculate the distance of the test data from each cluster mean
- Find the nearest cluster to the test data
- If (Distance > Threshold) then, Outlier