Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute the attribute having p-value greater than significance level can be discarded.
Numerosity Reduction:
This enables to store the model of data instead of whole data, for example: Regression Models.
Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).
Data Compression
Data compression is the process of encoding, restructuring or otherwise modifying data in order to reduce its size. Fundamentally, it involves re-encoding information using fewer bits than the original representation.
Compression is done by a program that uses functions or an algorithm to effectively discover how to reduce the size of the data. For example, an algorithm might represent a string of bits with a smaller string of bits by using a ‘reference dictionary’ for conversion between them. Another example involves a formula that inserts a reference or pointer to a string of data that the program has already seen. A good example of this often occurs with image compression. When a sequence of colours, like ‘blue, red, red, blue’ is found throughout the image, the formula can turn this data string into a single bit, while still maintaining the underlying information.
Compression is often broken down into two major forms, “Lossy” and “Lossless”. When choosing between the two methods, it is important to understand their strengths and weaknesses:
- Lossless Compression: Removes bits by locating and removing statistical redundancies. Because of this technique, no information is actually removed. Lossless compression will often have a smaller compression ratio, with the benefit of not losing any data in the file. This is often very important when needing to maintain absolute quality, as with database information or professional media files. Formats such as FLAC and PNG offer lossless compression options.
- Lossy Compression: Lowers size by deleting unnecessary information, and reducing the complexity of existing information. Lossy compression can achieve much higher compression ratios, at the cost of possible degradation of file quality. JPEG offers lossy compression options, and MP3 is based on lossy compression.