Data Cleaning: Missing Values, Noisy Data, Binning, Clustering, Regression, Computer and Human inspection, Inconsistent Data, Data Integration and Transformation

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.

After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.

Missing Values

This situation arises when some data is missing in the data. It can be handled in various ways.

Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.

Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.

Noisy Data

Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways:

Binning Method:

This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task.

Regression:

Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).

Clustering:

This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.

Computer and Human inspection

Outliers may be identified through a combination of computer and human inspection. In one application, for example, an information theoretic measure was used to help identify outlier patterns in a handwritten character database for classification. The measure’s value reflected the “Surprise” content of the predicted character label with respect to the known label.

Inconsistent Data

There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies may be corrected manually using external references. For example, errors made at data entry may be corrected by performing a paper trace. This may be coupled with routines designed to help correct the inconsistent use of codes. Knowledge engineering tools may also be used to detect the violation of known data constraints. For example, known functional dependencies between attributes can be used to find values contradicting the functional constraints.

There may also be inconsistencies due to data integration, where a given attribute can have different names in different databases.

Data Integration

Data Integration is a data pre-processing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases, or flat files.

The data integration approaches are formally defined as triple <G, S, M> where,

G stand for the global schema,

S stands for the heterogeneous source of schema,

M stands for mapping between the queries of source and global schema.

Approaches:

Tight Coupling:

Data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical location through the process of ETL; Extraction, Transformation, and Loading.

Loose Coupling:

An interface is provided that takes the query from the user, transforms it in a way the source database can understand, and then sends the query directly to the source databases to obtain the result.
And the data only remains in the actual source databases.

Data Transformation

This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways:

Normalization:

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

Discretization:

This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example: The attribute “City” can be converted to “Country”.

How to Create a Good Landing Page for Your Coaching Business - espressocoder - Importance of Marketing as a Business Function and in the Economy

[…] So, nowadays, marketing matters more than ever: https://theintactone.com/2019/10/27/pom-u1-topic-2-importance-of-marketing-as-a-business-function-an… […]

How Businesses Can Ensure Smooth Collaboration with Their 3PL Fulfillment Partner: Key Strategies for Success - Transportation: Function, Cost & Mode of Transportation

[…] overall supply chain efficiency. Technological innovation also supports international shipping and transportation functions, promoting a more robust distribution […]

Tree planting, a hot topic… – Plant Cuttings - Forest resources, Use and over-exploitation, deforestation, Timber extraction, mining, dams and their effects on forest and tribal people, case studies

[…] trees to humankind has – sadly, but predictably – resulted in their unsustainable over-exploitation in nature (Kathleen Hermans-Neumann et…

The Service Concept: Design and Delivery – ctrade bot - Service Marketing Mix

[…] Image source: https://theintactone.com/2019/03/07/mm2-u1-topic-3-service-marketing-mix/ […]

Kayla Kapp - Foundations of individual behavior

please confirm who the author is for this article - I would like to put together a reference for my…

Leave a ReplyCancel reply

Dr. APJ Abdul Kalam Technical University MBA Notes (KMBN, KMB & RMB Series Notes)

Guru Gobind Singh Indraprastha University (BBA) Notes

Chaudhary Charan Singh University BBA Notes (Old & New Syllabus)

GGSIPU (MS113) Legal Aspects of Business

Difference between Memorandum and Articles of Association, Prospectus

Computation of Taxable income from Salary

Exemptions from Salary

Deductions from Salary

Meaning of Salary, Perquisites, Allowances

Residential Status, Determination, Types, Importance

Share this:

Like this:

Related

You might also like

Leave a ReplyCancel reply