Data Analysis can be defined as the process of reviewing and evaluating the data that is gathered from different sources. Data cleaning is very important as this will help in eliminating the redundant information and reaching to the accurate conclusions. Data analysis is the systematic process of cleaning, inspecting and transforming data with the help of various tools and techniques. The objective of data analysis is to identify the useful information which will support the decision-making process. There are various methods for data analysis which includes data mining, data visualization and Business Intelligence. Analysis of data will help in summarizing the results through examination and interpretation of the useful information. Data analysis helps in determining the quality of data and developing the answers to the questions which are of use to the researcher.
In order to discover the solution of the problem and to reach to the specific and quality results, various statistical techniques can be applied. These techniques will help the researcher to get accurate results by drawing relationships between different variables. The statistical techniques can mainly be divided into two
A) Parametric Test
B) Non-Parametric Test
Parametric statistics considers that the sample data relies on certain fixed parameters. It takes into consideration the properties of the population. It assumes that the sample data is collected from the population and population is normally distributed. There are equal chances of occurrence of all the data present in the population. The parametric test is based on various assumptions which are needed to be holding good. Various parametric tests are Analysis of Variance (ANOVA), Z test, T test, Chi Square test, Pearson’s coefficient of correlation, Regression analysis.
T- test can be defined as the test which helps in identifying the significant level of difference in a sample mean or between the means of two samples. It is also called as a T- Distribution. The t-test is conducted when the sample size of the population is small, and variance of the population is not known. The t-test is used when the population (n) is not larger than 30. There are two types of T-Test:
- Dependent mean T Test- It is used when same variables or groups are experimented.
- Independent mean T Test-It is used when two different groups experimented. The two different groups have faced different conditions.
The formula for T-Test is:-
This test is used when the population is normally distributed. The sample size of the population is large or small, but the variance of the population is known. It is used for comparing the means of the population or for identifying the significance level of difference between the means of two independent samples. Z test is based on the single critical value which makes the test more convenient.
The formula for z test is:-
X- Main value
µ – Sample Mean
σ – Standard Deviation
Analysis of Variance (ANOVA)
When there are two or more categorical data, then Analysis of Variance is used. Analysis of variance can be mainly of two types a) one-way ANOVA, b) Two-way ANOVA. One way ANOVA is used when the mean of three or more than three groups are compared. The variables in each group are same. Two-way ANOVA is used to discover if there is any relationship between two independent variables and dependent variables. Analysis of Variance is based on many assumptions. ANOVA assumes that there is a dependent variable which can be measured at continuous intervals. There are independent variables which are categorical, and there should be at least two categories. It also assumes that the population is normally distributed and there is no unusual element is present.
Chi Square Test
This test is also known as Pearson’s chi-square test. This test is used to find a relationship between two or more independent categorical variables. The two variables should be measured at the categorical level and should consist of two or more independent groups.
Coefficient of Correlation
Pearson’s coefficient of correlation is used to draw an association between two variables. It is denoted by ‘r’. The value of r ranges between +1 to -1. The coefficient of correlation is used to identify whether there is a positive association, negative association or no association between two variables. When the value is 0, it indicates that there is no association between two variables. When it is less than 0, it indicates a negative association, and when the value is more than 0, then it indicates a positive association.
This is used to measure the value of one variable which is based on the value of another variable. The variable whose value is predicted is the dependent variable, and the variable which is used to predict the value of another variable is called independent variable. The assumptions of regression analysis are that the variables should be measured at the continuous level and there should be a linear relationship between two variables.
Non-Parametric Statistics does not take into account any assumption relating to the parameters of the population. It explains that data is ordinal and is not necessary to be normally distributed. The non-parametric test is also known as a distribution-free test. These tests are comparatively simpler than the parametric test. Various non-parametric tests include Fisher- Irwin Test, Wilcoxon Matched –Pairs Test (Signed rank test), Wilcoxon rank-sum test, Kruskal- Wallis Test, Spearman’s Rank Correlation test.