Statistical measures in large Databases
Data mining refers to extracting or mining knowledge from large amounts of data. In other words, data mining is the science, art, and technology of discovering large and complex bodies of data in order to discover useful patterns. Theoreticians and practitioners are continually seeking improved techniques to make the process more efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in data mining:
Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
Non-statistical Analysis: This analysis provides generalized information and includes sound, still images, and moving images.
In statistics, there are two main categories:
Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the main characteristics of that data. Graphs or numbers summarize the data. Average, Mode, SD (Standard Deviation), and Correlation are some of the commonly used descriptive statistical methods.
Inferential Statistics: The process of drawing conclusions based on probability theory and generalizing the data. By analyzing sample statistics, you can infer parameters about populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some of these are:
- Population
- Sample
- Variable
- Quantitative Variable
- Qualitative Variable
- Discrete Variable
- Continuous Variable
Mean: The arithmetic average is evaluated simply by inserting together all values and splitting them by the number of values. It uses data from every single value. Let x1, x2,… xn be a set of N values or observations like salary.
Average= Sum / Count
Statistical-Based Algorithms
There are two types of statistical-based algorithms which are as follows −
Regression:
Regression issues deal with the evaluation of an output value located on input values. When utilized for classification, the input values are values from the database and the output values define the classes. Regression can be used to clarify classification issues, but it is used for different applications including forecasting. The elementary form of regression is simple linear regression that includes only one predictor and a prediction.
Regression can be used to implement classification using two various methods which are as follows:
Division: The data are divided into regions located on class.
Prediction: Formulas are created to predict the output class’s value.
Bayesian Classification: Statistical classifiers are used for the classification. Bayesian classification is based on the Bayes theorem. Bayesian classifiers view high efficiency and speed when used to high databases.
Bayes Theorem:
Let X be a data tuple. In the Bayesian method, X is treated as “evidence.” Let H be some hypothesis, including that the data tuple X belongs to a particularized class C. The probability P (H|X) is decided to define the data. This probability P (H|X) is the probability that hypothesis H’s influence has given the “evidence” or noticed data tuple X.
P (H|X) is the posterior probability of H conditioned on X. For instance, consider the nature of data tuples is limited to users defined by the attribute age and income, commonly, and that X is 30 years old users with Rs. 20,000 income. Assume that H is the hypothesis that the user will purchase a computer. Thus P (H|X) reverses the probability that user X will purchase a computer given that the user’s age and income are acknowledged.
P (H) is the prior probability of H. For instance, this is the probability that any given user will purchase a computer, regardless of age, income, or some other data. The posterior probability P (H|X) is located on more data than the prior probability P (H), which is free of X.
Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the probability that a user X is 30 years old and gains Rs. 20,000.
P (H), P (X|H), and P (X) can be measured from the given information. Bayes theorem supports a method of computing the posterior probability P (H|X), from P (H), P (X|H), and P(X). It is given by
P(H|X) = P(X|H)P(H) / P(X)
Distance-Based Algorithms