Exploratory data Analytics

Exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

The objectives of EDA are to:

  • Enable unexpected discoveries in the data.
  • Assess assumptions on which statistical inference will be based.
  • Suggest hypotheses about the causes of observed phenomena.
  • Support the selection of appropriate statistical tools and techniques.
  • Provide a basis for further data collection through surveys or experiments.

Steps Involved in Exploratory Data Analysis

  • Data Collection

Data collection is an essential part of exploratory data analysis. It refers to the process of finding and loading data into our system. Good, reliable data can be found on various public sites or bought from private organizations. Some reliable sites for data collection are Kaggle, Github, Machine Learning Repository, etc.

  • Data Cleaning

Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. Such anomalies can disproportionately skew the data and hence adversely affect the results. Some steps that can be done to clean data are:

  • Removing missing values, outliers, and unnecessary rows/ columns.
  • Re-indexing and reformatting our data.

 

  • Univariate Analysis

In Univariate Analysis, you analyze data of just one variable. A variable in your dataset refers to a single feature/ column. You can do this either with graphical or non-graphical means by finding specific mathematical values in the data. Some visual methods include:

  • Box-plots: Here the information is represented in the form of boxes.
  • Histograms: Bar plots in which the frequency of data is represented with rectangle bars.

 

  • Bivariate Analysis

Here, you use two variables and compare them. This way, you can find how one feature affects the other. It is done with scatter plots, which plot individual data points or correlation matrices that plot the correlation in hues. You can also use boxplots.

Techniques and tools

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.

Typical graphical techniques used in EDA are:

  • Box plot
  • Histogram
  • Multi-vari chart
  • Run chart
  • Pareto chart
  • Scatter plot (2D/3D)
  • Stem-and-leaf plot
  • Parallel coordinates
  • Odds ratio
  • Targeted projection pursuit
  • Heat map
  • Bar chart
  • Horizon graph
  • Glyph-based visualization methods such as PhenoPlot and Chernoff faces
  • Projection methods such as grand tour, guided tour and manual tour
  • Interactive versions of these plots

Dimensionality reduction:

  • Multidimensional scaling
  • Principal component analysis (PCA)
  • Multilinear PCA
  • Nonlinear dimensionality reduction (NLDR)
  • Iconography of correlations

Typical quantitative techniques are:

  • Median polish
  • Trimean
  • Ordination

Software

  • JMP, an EDA package from SAS Institute.
  • KNIME, Konstanz Information Miner; Open-Source data exploration platform based on Eclipse.
  • Minitab, an EDA and general statistics package widely used in industrial and corporate settings.
  • Orange, an open-source data mining and machine learning software suite.
  • Python, an open-source programming language widely used in data mining and machine learning.
  • R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science.
  • TinkerPlots an EDA software for upper elementary and middle school students.
  • Weka an open source data mining package that includes visualization and EDA tools such as targeted projection pursuit.

Exploratory data analysis tools

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

Types of exploratory data analysis

There are four primary types of EDA:

  • Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
  • Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
  1. Stem-and-leaf plots, which show all data values and the shape of the distribution.
  2. Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
  3. Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Leave a Reply

error: Content is protected !!
%d bloggers like this: