Noisy data is meaningless data. It includes any data that cannot be understood and interpreted correctly by machines, such as unstructured text. Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis.
Noisy data can be caused by faulty data collection instruments, human or computer errors occurring at data entry, data transmission errors, limited buffer size for coordinating synchronized data transfer, inconsistencies in naming conventions or data codes used and inconsistent formats for input fields( eg:date).
Noisy data can be handled by following the given procedures:
Binning:
Binning methods smooth a sorted data value by consulting the values around it.
- The sorted values are distributed into a number of “buckets,” or bins.
- Because binning methods consult the values around it, they perform local smoothing.
- Similarly, smoothing by bin median scan be employed, in which each bin value is replaced by the bin median.
- In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries.
- Each bin value is then replaced by the closest boundary value.
- In general, the larger the width, the greater the effect of the smoothing.
- Alternatively, bins may be equal-width, where the interval range of values in each bin is constant.
- Binning is also used as a discretization technique.
Regression:
Here data can be smoothed by fitting the data to a function.
- Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.
Clustering:
- Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”
- Similarly, values that fall outside of the set of clusters may also be considered outliers.
Cross Validation
Cross-validation is a technique that helps in tackling with noisy data by preventing overfitting. This is just like overfitting. In cross-validation, the dataset is broken into 3 sets (rather than 2):
- Training data
- Cross validation data
- Testing data
The algorithm is trained using the training data. However, the hyper-parameters are tuned using the cross-validation data which is separate from the training data. This makes sure that the algorithm is able to avoid learning the noise present in the training data and rather generalize by a cross-validation procedure. Finally, the fresh, test data can be used to evaluate how well the algorithm was able to generalize.
Regularization
The core of a Machine Learning algorithm is the ability to learn and generalize from the dataset that the algorithm has seen. However, if the algorithm is given enough flexibility (more parameters), then it may happen that the algorithm “overfits” the noisy data. This means that the algorithm is fooled into believing that the noise part of data also represents a pattern. In order to avoid that, one commonly used technique is called as Regularization. In regularization, a penalty term is added to the algorithm’s cost function, which represents the size of the weights (parameters) of the algorithm. This ensures that for the minimization of the cost, the weights are smaller thereby leading to lesser freedom for the algorithm. This greatly helps in avoiding overfitting. There are 2 commonly used techniques in regularization:
- L1 regularization: In L1 regularization, a term of |wi| is added for each i. The modulus function is always positive and so, the regularization term leads to an increase in the cost function.
- L2 regularization: In L2 regularization, a term of wi2 is added. Since square is a positive function, so here also the regularization term leads to an increase in the cost function.