There are several graphical and numerical methods for identifying outliers in a dataset. Some commonly used methods include:
- Box plots: This method uses a box-and-whisker plot to display the distribution of a dataset. Outliers are typically identified as points outside of the upper and lower whiskers of the plot.
The box plot consists of a box and a set of whiskers that extend from the box. The box represents the middle 50% of the data, while the whiskers represent the minimum and maximum values of the data. The box and whiskers are constructed using the following elements:
- The lower quartile (Q1): This is the value that separates the lowest 25% of the data from the rest. It is also known as the first quartile.
- The median (Q2): This is the value that separates the lowest 50% of the data from the highest 50%. It is also known as the second quartile.
- The upper quartile (Q3): This is the value that separates the lowest 75% of the data from the highest 25%. It is also known as the third quartile.
- The minimum value: This is the smallest value in the dataset.
- The maximum value: This is the largest value in the dataset.
- The Interquartile range (IQR): This is the difference between the upper and lower quartiles, and it is used to calculate the range within which most of the data falls.
- Outliers: These are values that fall outside of the range of (Q1 – 1.5IQR) and (Q3 + 1.5IQR). They are typically represented as individual points outside of the whiskers.
- Extreme values: These are values that fall outside of the range of (Q1 – 3IQR) and (Q3 + 3IQR). They are typically represented as individual points even farther away from the whiskers.
- Scatter plots: This method uses a scatter plot to display the relationship between two variables. Outliers are typically identified as points that lie far away from the main cluster of points.
A scatter plot consists of a set of points plotted on a two-dimensional coordinate system, where the x-axis represents one variable and the y-axis represents the other variable.
- Identifying Outliers: Outliers are typically identified as points that lie far away from the main cluster of points. This can be done by drawing a line around the points and any points that fall outside of that line can be considered outliers.
- Identifying Patterns: Scatter plots can be used to identify patterns such as linear, non-linear, positive and negative correlation. Linear correlation is when the data points form a straight line, a positive correlation is when the data points move in the same direction, a negative correlation is when data points move in opposite direction.
- Identifying clusters: Scatter plot can also be used to identify clusters or groups of data points. This can be done by using clustering algorithms such as K-means, DBSCAN, etc.
- Identifying outliers with Mahalanobis Distance: Mahalanobis distance is a measure of the distance between a point and a distribution. It takes into account the covariance of the data. Points with a large Mahalanobis distance can be considered as outliers.
- Identifying clusters with Density-Based Clustering: Density-based clustering methods such as DBSCAN, identify clusters as dense regions of points, and outliers as points that do not belong to any cluster.
- Z-score: This method calculates the number of standard deviations away from the mean a data point is. Outliers are typically identified as points with a z-score greater than 3 or less than -3.
It is calculated by subtracting the mean of the dataset from a given data point and then dividing the result by the standard deviation of the dataset. The Z-score gives an idea of how many standard deviations away from the mean a data point is.
- Identifying Outliers: Outliers are typically identified as points with a Z-score greater than 3 or less than -3. These values fall outside of the normal range of data, and they are considered as extreme values.
- Normality test: Z-score can also be used to check for normality in a dataset. A dataset is considered normal if the Z-scores of all data points are within -3 and 3.
- Z-score calculation: Z-score can be calculated by using the following formula:
Z = (x – mean) / std_dev
Where x is a data point, mean is the mean of the dataset, and std_dev is the standard deviation of the dataset.
- Interquartile range (IQR): This method calculates the range between the first and third quartiles (Q1 and Q3) of a dataset. Outliers are typically identified as points that fall outside of the range of (Q1 – 1.5IQR) and (Q3 + 1.5IQR).
The IQR gives an idea of the spread of the middle 50% of the data.
- Identifying Outliers: Outliers are typically identified as points that fall outside of the range of (Q1 – 1.5IQR) and (Q3 + 1.5IQR). These values fall outside of the typical range of data, and they are considered as extreme values.
- Identifying outliers with the Tukey’s method: The Tukey’s method uses the following formula to identify outliers:
Outliers = Q1 – 1.5IQR or Q3 + 1.5IQR
Where Q1 is the lower quartile, Q3 is the upper quartile and IQR is the interquartile range.
- Identifying skewness: IQR can also be used to identify skewness in a dataset. A dataset is considered symmetric if the lower and upper quartiles are close to the median.
- IQR calculation: IQR can be calculated by using the following formula:
IQR = Q3 – Q1
Where Q1 is the lower quartile and Q3 is the upper quartile.
It’s important to note that IQR is useful for identifying outliers, but it’s based on the assumption that the data is moderately symmetric. It’s important to use other methods and visualization techniques to get a better understanding of the data. Also, it’s important to consider the context of the data when interpreting IQR.
- Mahalanobis distance: This method calculates the distance of a point from the mean of a dataset taking into account the covariance of the data.
It is a measure of the distance between a point and a distribution, taking into account the covariance of the data. It is commonly used in multivariate data analysis to identify outliers that deviate from the overall pattern of the data.
- Identifying Outliers: Outliers are typically identified as points with a large Mahalanobis distance. These values fall far away from the main cluster of points, indicating that they are unlikely to be part of the same distribution.
- Identifying clusters: Mahalanobis distance can also be used to identify clusters or groups of data points. This can be done by using clustering algorithms such as K-means, DBSCAN, etc.
- Mahalanobis distance calculation: Mahalanobis distance can be calculated by using the following formula:
D^2 = (X – M)^T * S^-1 * (X – M)
Where X is a data point, M is the mean of the dataset, S is the covariance matrix of the dataset, and T represents the transpose of a matrix.
It’s important to note that these methods should be used in conjunction with domain knowledge and not as a standalone solution. Also, it’s good to check these methods on different subsets of the data.