Data requirements definition establishes the process used to identify, prioritize, precisely formulate, and validate the data needed to achieve business objectives. When documenting data requirements, data should be referenced in business language, reusing approved standard business terms if available. If business terms have not yet been standardized and approved for the data within scope, the data requirements process provides the occasion to develop them.
The data requirements analysis process employs a top-down approach that emphasizes business-driven needs, so the analysis is conducted to ensure the identified requirements are relevant and feasible. The process incorporates data discovery and assessment in the context of explicitly qualified business data consumer needs.
A business requirements document (BRD) details the business solution for a project including the documentation of customer needs and expectations. If an initiative intends to modify existing (or introduce new) hardware/software, a new BRD should be created.
The BRD process can be incorporated within a Six Sigma DMAIC (Define, Measure, Analyze, Improve, Control) culture.
The most common objectives of the BRD are:
- To gain agreement with stakeholders
- To provide input into the next phase for this project.
- To provide a foundation to communicate to a technology service provider what the solution needs to do to satisfy the customer’s and business’ needs.
- To describe what not how the customer/business needs will be met by the solution.
Steps in this phase of the process:
Identify relevant stakeholders: Stakeholders may be identified through a review of existing system documentation or may be identified by the data quality team through discussions with business analysts, enterprise analysts, and enterprise architects. The pool of relevant stakeholders may include business program sponsors, business application owners, business process managers, senior management, information consumers, system owners, as well as frontline staff members who are the beneficiaries of shared or reused data.
Acquire documentation: The data quality analyst must become familiar with overall goals and objectives of the target information platforms to provide context for identifying and assessing specific information and data requirements. To do this, it is necessary to review existing artifacts that provide details about the consuming systems, requiring a review of project charters, project scoping documents, requirements, design, and testing documentation. At this stage, the analysts should accumulate any available documentation artifacts that can help in determining collective data use.
Document goals and objectives: Determining existing performance measures and success criteria provides a baseline representation of high-level system requirements for summarization and categorization. Conceptual data models may exist that can provide further clarification and guidance regarding the functional and operational expectations of the collection of target systems.
Summarize scope of capabilities: Create graphic representations that convey the high-level functions and capabilities of the targeted systems, as well as providing detail of functional requirements and target user profiles. When combined with other context knowledge, one may create a business context diagram or document that summarizes and illustrates the key data flows, functions, and capabilities of the downstream information consumers.
Document impacts and constraints: Constraints are conditions that affect or prevent the implementation of system functionality, whereas impacts are potential changes to characteristics of the environment to accommodate the implementation of system functionality. Identifying and understanding all relevant impacts and constraints to the target systems are critical, because the impacts and constraints often define, limit, and frame the data controls and rules that will be managed as part of the data quality environment. Not only that, source-to-target mappings may be impacted by constraints or dependencies associated with the selection of candidate data sources.
Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems, abbreviated by the initialisms DAS, DAQ, or DAU, typically convert analog waveforms into digital values for processing. The components of data acquisition systems include:
- Sensors, to convert physical parameters to electrical signals.
- Signal conditioning circuitry, to convert sensor signals into a form that can be converted to digital values.
- Analog-to-digital converters, to convert conditioned sensor signals to digital values.
Data acquisition applications are usually controlled by software programs developed using various general purpose programming languages such as Assembly, BASIC, C, C++, C#, Fortran, Java, LabVIEW, Lisp, Pascal, etc. Stand-alone data acquisition systems are often called data loggers.
There are also open-source software packages providing all the necessary tools to acquire data from different, typically specific, hardware equipment. These tools come from the scientific community where complex experiment requires fast, flexible and adaptable software. Those packages are usually custom fit but more general DAQ packages like the Maximum Integrated Data Acquisition System can be easily tailored and is used in several physics experiments.
Data preparation is the process of gathering, combining, structuring and organizing data so it can be used in business intelligence (BI), analytics and data visualization applications. The components of data preparation include data pre-processing, profiling, cleansing, validation and transformation; it often also involves pulling together data from different internal systems and external sources.
Data preparation work is done by information technology (IT), BI and data management teams as they integrate data sets to load into a data warehouse, NoSQL database or data lake repository or when new analytics applications are developed. In addition, data scientists, other data analysts and business users can use self-service data preparation tools to collect and prepare data themselves.
Data preparation is often referred to informally as data prep. It’s also known as data wrangling, although some practitioners use that term in a narrower sense to refer to cleansing, structuring and transforming data as part of the overall data preparation process, distinguishing it from the data pre-processing stage.
Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery.
The issues to be dealt with fall into two main categories:
- Systematic errors involving large numbers of data records, probably because they have come from different sources;
- Individual errors affecting small numbers of data records, probably due to errors in the original data entry.
Purposes of data preparation
One of the primary purposes of data preparation is to ensure that raw data being readied for data processing and analysis is accurate and consistent so the results of BI and analytics applications will be valid. Data is commonly created with missing values, inaccuracies or other errors. Additionally, separate data sets often have different formats that need to be reconciled. Correcting data errors, verifying data quality and joining data sets constitutes a big part of the data preparation process.
The first step is to set out a full and detailed specification of the format of each data field and what the entries mean. This should take careful account of:
- Most importantly, consultation with the users of the data
- Any available specification of the system which will use the data to perform the analysis
- A full understanding of the information available, and any gaps, in the source data.