Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.
Data analysis is rooted in statistics, which has a pretty long history. It is said that the beginning of statistics was marked in ancient Egypt as it took a periodic census for building pyramids. Throughout history, statistics has played an important role for governments all across the world, for the creation of censuses, which were used for various governmental planning activities (including, of course, taxation). With the data collected, we can move on to the next step, which is the analysis of that data. Data analysis is a process that begins with retrieving data from various sources and then analyzing it with the goal of discovering beneficial information. For example, the analysis of population growth by district can help governments determine the number of hospitals that would be needed in a given area.
Data mining is a particular data analysis technique that focuses on statistical modelling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).
EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis. Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.
Notable free software for data analysis include:
- DevInfo: A database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
- ELKI: Data mining framework in Java with data mining oriented visualization functions.
- KNIME: The Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
- Orange: A visual programming tool featuring interactive data visualization and methods for statistical data analysis, data mining, and machine learning.
- Pandas: Python library for data analysis.
- PAW: FORTRAN/C data analysis framework developed at CERN.
- R: A programming language and software environment for statistical computing and graphics.
- ROOT: C++ data analysis framework developed at CERN.
- SciPy: Python library for data analysis.
- Julia: A programming language well-suited for numerical analysis and computational science.
Quickly zooming over the next five millennia
Other concepts, such as that of zero came into place over thousands of years. Zero itself was only invented about 1400 years ago in India, and took some 500 years to reach the West (Pythagoras was able to devise his theorem without having a functional zero). A few prior civilisations did have a sort of zero in use, but not rigorous enough to be a number of equal importance to other numbers. There are more than a few folk who attribute many of the great advances that have been made in the last 1400 years to this invention.
Relational Databases were invented by Edgar F. Codd in the 1970s and became quite popular in the 1980s. Relational Databases (RDBMs), in turn, allowed users to write in Sequel (SQL) and retrieve data from their database. Relational Databases and SQL provided the advantage of being able to analyze data on demand, and are still used extensively. They are easy to work with, and very useful for maintaining accurate records. On the negative side, RDBMs are generally quite rigid and were not designed to translate unstructured data.
During the mid-1990s, the internet became extremely popular, but relational databases could not keep up. The immense flow of information combined with the variety of data types coming from many different sources led to non-relational databases, also referred to as NoSQL. A NoSQL database can translate data using different languages and formats quickly and avoids SQL’s rigidity by replacing its “organized” storage with greater flexibility.
In the late 1980s, the amount of data being collected continued to grow significantly, in part due to the lower costs of hard disk drives. During this time, the architecture of Data Warehouses was developed to help in transforming data coming from operational systems into decision-making support systems. Data Warehouses are normally part of the Cloud, or part of an organization’s mainframe server. Unlike relational databases, a Data Warehouse is normally optimized for a quick response time to queries. In a data warehouse, data is often stored using a timestamp, and operation commands, such as DELETE or UPDATE, are used less frequently. If all sales transactions were stored using timestamps, an organization could use a Data Warehouse to compare the sales trends of each month.
The term Business Intelligence (BI) was first used in 1865, and was later adapted by Howard Dresner at Gartner in 1989, to describe making better business decisions through searching, gathering, and analyzing the accumulated data saved by an organization. Using the term “Business Intelligence” as a description of decision-making based on data technologies was both novel and far-sighted. Large companies first embraced BI in the form of analyzing customer data systematically, as a necessary step in making business decisions.
Data Mining began in the 1990s and is the process of discovering patterns within large data sets. Analyzing data in non-traditional ways provided results that were both surprising and beneficial. The use of Data Mining came about directly from the evolution of database and Data Warehouse technologies. The new technologies allow organizations to store more data, while still analyzing it quickly and efficiently. As a result, businesses started predicting the potential needs of customers, based on an analysis of their historical purchasing patterns.
However, data can be misinterpreted. Someone in the trades, having purchased two pairs of blue jeans online, probably won’t want to buy jeans for another two or three years. Targeting this person with blue jean advertisements is both a waste of time and an irritant to the potential customer.
In 2005, Big Data was given that name by Roger Magoulas. He was describing a large amount of data, which seemed almost impossible to cope with using the Business Intelligence tools available at the time. In the same year, Hadoop, which could process Big Data, was developed. Hadoop’s foundation was based on another open-source software framework called Nutch, which was then merged with Google’s MapReduce.
Apache Hadoop is an open-source software framework, which can process both structured and unstructured data, streaming in from almost all digital sources. This flexibility allows Hadoop (and its sibling open-source frameworks) to process Big Data. During the late 2000s, several open source projects, such as Apache Spark and Apache Cassandra came about to deal with this challenge.
Analytics in the Cloud
In its early form, the Cloud was a phrase used to describe the “empty space” between users and provider. Then, in 1997, Emory University professor Ramnath Chellappa described Cloud Computing as a new “computing paradigm where the boundaries of computing will be determined by economic rationale, rather than technical limits alone.”