The problem concerning “Don’t know” (or DK) responses: While processing the data, the researcher often comes across some responses that are difficult to handle. One category of such responses may be ‘Don’t Know Response’ or simply DK response. When the DK response group is small, it is of little significance. But when it is relatively big, it becomes a matter of major concern in which case the question arises: Is the question which elicited DK response useless? The answer depends on two points viz., the respondent actually may not know the answer or the researcher may fail in obtaining the appropriate information. In the first case the concerned question is said to be alright and DK response is taken as legitimate DK response. But in the second case, DK response is more likely to be a failure of the questioning process.
How DK responses are to be dealt with by researchers? The best way is to design better type of questions. Good rapport of interviewers with respondents will result in minimising DK responses. But what about the DK responses that have already taken place? One way to tackle this issue is to estimate the allocation of DK answers from other data in the questionnaire. The other way is to keep DK responses as a separate category in tabulation where we can consider it as a separate reply category if DK responses happen to be legitimate, otherwise we should let the reader make his own decision. Yet another way is to assume that DK responses occur more or less randomly and as such we may distribute them among the other answers in the ratio in which the latter have occurred. Similar results will be achieved if all DK replies are excluded from tabulation and that too without inflating the actual number of other responses.
Use or percentages: Percentages are often used in data presentation for they simplify numbers, reducing all of them to a 0 to 100 range. Through the use of percentages, the data are reduced in the standard form with base equal to 100 which fact facilitates relative comparisons. While using percentages, the following rules should be kept in view by researchers:
- Two or more percentages must not be averaged unless each is weighted by the group size from which it has been derived.
- Use of too large percentages should be avoided, since a large percentage is difficult to understand and tends to confuse, defeating the very purpose for which percentages are used.
- Percentages hide the base from which they have been computed. If this is not kept in view, the real differences may not be correctly read.
- Percentage decreases can never exceed 100 per cent and as such for calculating the percentage of decrease, the higher figure should invariably be taken as the base.
- Percentages should generally be worked out in the direction of the causal-factor in case of two-dimension tables and for this purpose we must select the more significant factor out of the two given factors as the causal factor.
It is an operation performed over data (raw facts ) to manipulate and convert the data into meaningful information.
Data processing involves 3 activities:
1. Input
In this process, the data that is collected is transformed into a form that the computer can understand.
It is the most important step because the output or the result depends completely on the data that is provided as the input.
The data input process involves the following set of activities:
- Collection: In this, we gather the raw data from various data sources and prepare it for the input process.
- Encoding: In this process, we convert the collected data into a form that it becomes easier to put in a data processing system.
- Transmission: In this stage, we send the input data to the various processors and carry it across various components.
- Communication: A set of activities which allows the sending of data from one data processing system to another.
2. Process
It is the process of transforming raw data into information by performing some actual data manipulation techniques.
The techniques that we use in the process stage are as follows:
- Classification: In this stage, we classify the data into different groups and subgroups so that it is easier to handle the data.
- Storing: In the storage technique, we store the data in an arranged order so that we can access the data quickly whenever we need it.
- Calculation: We apply this technique to the numeric data to calculate the required output from the raw figures.
3. Output
The information that we get as a result of the data processing is the output. We can present the output information in a visual form and can take further actions or make some decision on the basis of the output.
Challenges in Data Processing
Until now we understand the difference between the data and the information. Moreover, we came to know how data processing works.
But, it is not as simple as it appears. There are several challenges that come while processing the data. Let me put some light on the key challenges that appear while processing the data.
1. Collection of Data
The very first challenge in data processing comes in the collection or acquisition of the correct data for the input. We have the following data sources from which we can acquire data:
- Administrative data sources
- Mobile and website data
- Social media
- support calls
- Statistical surveys
- Census
- Purchasing data from third parties
There are many more, sometimes, the data collection agent walks door to door to collect the data that we need.
The challenge here is to collect the exact data to get the proper result. As the result directly depends on the input data. Hence, it is vital to collect the correct and exact data to get the desired result.
Choosing the right data collection technique can help to overcome this challenge. Below are the 4 different data collection techniques:
- Observation: Making direct observation is a quick and effective way to collect simple data with minimal intrusion.
- Questionnaire: Survey can be carried out to every corner of the globe and with this, the researcher can structure and precisely formulate the data collection plan.
- Interview: is the most suitable technique to interpret and understand the respondents.
- Focus group session: The presence of several relevant people simultaneously debating on the topic gives the researcher a chance to view both sides of the coin and build a balanced perspective.
2. Duplicacy of Data
As the data is collected from different data sources, then many times it happens that there is duplicacy in data. The same entries and entities may present a number of times during the data encoding stage. This duplicate data is redundant and may produce an incorrect result.
Hence, we need to check the data for duplicacy and proactively remove the duplicate data.
Data deduplication is adapted to reduce the cost and free the storage space. Deduplication technology of data identifies the identical data blocks and eliminates the redundant data.
This technique significantly reduces the size of disk usage and also reduces the disk IO traffic. Hence, it enhances the processing performance and helps in achieving precise and high accuracy data.
3. Inconsistency of data
When we collect a huge amount of data and there is no guarantee that the data would be complete or all the fields that we need are filled correctly. Moreover, the data may be ambiguous.
As the input/raw data is heterogeneous in nature and is collected from autonomous data sources, the data may conflict with each other in three different levels:
- Schema Level: Different data sources have different data models and different schemas within the same data model.
- Data representation level: Data in different sources are represented in different structures, languages, and measurements.
- Data value level: Sometimes, the same data objects have factual discrepancies among various data sources. This occurs when we obtain two data objects from different sources and they are identified as versions of each other. But, the value corresponding to their attributes differ.
In this situation, we need to check for the completeness of the data. Also, we have to see the dependency and importance of the field (inconsistent field) to the desired result. furthermore, we need to proactively figure out bugs to ensure the consistency in the database.
4. Variety of data
The input data, as it is collected from different sources, can contain different forms. The rows and columns of a relational database don’t limit the data. The data varies from application to application and source to source. Much of these data is unstructured and cannot fit into a spreadsheet or a relational database.
There may be that the collected data is in text or tabular format. On the other hand, it may be a collection of photographs and videos and sometimes maybe just audio.
Sometimes to get the desired result, there is a need to process different forms of data altogether.
There are different techniques for resolving and managing data variety, some of them are as follows:
- Indexing: Different and incompatible data types can be related together with the indexing technique.
- Data profiling: This technique helps in identifying the abnormalities and interrelationship between the different data sources.
- Metadata: Meta description of data and its management helps in achieving contextual consistency in the data.
- Universal format conversion: In this technique, we can convert the collected data into a universally accepted format such as Extensible markup language (XML).
5. Data Integration
Data integration means to combine the data from various sources and present it in a unified view.
With the increased variety of data and different formats of data, the challenge to integrate the data enlarges.
The data integration consists of various challenges that are as follows:
- Isolation: Majority of the applications are developed and deployed in isolation which makes it difficult to integrate the data across various applications.
- Technological Advancements: With the advancement in the technology, the ways to store and retrieve data changes. The problem here occurs with the integration of newer data to the legacy data.
- Data Problems: The challenge in data integration rises when the data is incorrect, incomplete or is of the wrong format.
Then we have to figure the right approach to integrate the data so that the data remains consistent.
There are mainly three techniques for integrating data:
- Consolidation: It captures data from multiple sources and integrates it into a single persistent data store.
- Federation: It gives a single virtual view of multiple data sources. When it fires a query it returns data from the most appropriate data source.
- Propagation: Data propagation applications copies data from one source to another. Furthermore, it guarantees a two-way data exchange regardless of the type of data synchronization.
6. Volume and Storage of data
When processing big data, the volume of the data is considerably large. Big data consists of both structured and unstructured data. This includes the data available on the social networking sites, records of companies, data from surveillance sources, research and development data and much more. Here comes the challenge to store and manage this sheer volume of data. Also what amount of data is to present to the RAM so that the processing is faster and the resources utilization is smart.
Also, we need to back up the data to ensure the data protection from any sort of loss. The data loss could occur due to software or hardware loss, natural disaster or error made by humans.
Now, the data itself is huge in volume and we need to take the copy or backup of the data for safety. This increases the amount of stored data by up to 150% or even more.
Below are the possible approaches that we may use to store a large amount of data:
- Object storage: with this approach, it is easier to store very large sets of data. It is a replacement of traditional tree-like file system.
- Scale-out NAS: is capable of scaling the capacity of the storage. It usually has its own distributed or clustered file system.
- Distributed nodes: most often the low-cost commodity implements this. It attaches directly to the computer server or even server memory.
7. Poor description and Meta Data
One of the major sources of the input data is the data that are stored over the time in a relational database. But this data is not properly formatted and there is no meta description of the storage, structure and the relation of the data entities with each other.
The scenario even becomes worse when the amount of data is large and the database itself links to other databases. Without a proper documentation of the database, it is quite difficult to extract the correct input data from the databases.
- De-normalize the database for querying purpose.
- Use the stored procedure to allow complex data management task.
- Using NoSQL database for storing data
8. Modification of Network data
The data is distributed and simultaneously related to each other in a complex structure. The challenge here is to modify the structure of the data or add some data in this.
The internet is a network that consists of a variety of data, a lot of applications and websites generate data that are all of different forms and characteristics. Schema interconnects all of them.
A schema is the definition of the indexes, packages, table/rows and meta-data of a database.
It is difficult to transport data if a database doesn’t handle Schema.
Server Data Tools (SDT) includes a schema compare utility that we can use to compare two database definitions. The SDT can compare any combination of source and target databases.
Moreover, It also reports any discrepancies between schemas and detects mismatching data types and defaults of columns.
9. Security
Security plays the most important role in the data field. Hacking the data might result in a leak. Hence, it may cost highly to the data processing firm. The hacker might even change or delete the data that we have acquired and processed after a lot of struggle.
The reasons for the security breach in a database are mainly due to these reasons:
- Most of the data processing systems have a single level of protection.
- No encryption of Either the raw data or the result/ output data.
- Access of the data to unethical IT professional that risks in data loss.
To ensure the security of the data we should follow the below-mentioned practices:
- Do not connect to public networks.
- Keep personal information safe and secure with a strong password.
- Limit the access of humans to the data
- Encrypt and back up the data
10. Cost
Cost is the matter of consideration. When the amount of the data increases then the cost in each stage of the data processing increases gradually.
The cost of data processing depends on the following factors:
- The type of the processing data
- Turn around time to complete the processing of data and get the required result.
- The accuracy of the data.
- Workforce working on data processing.
The stakeholders or the management looking into the data processing must consider the budget and the expenses. Compressing the data reduces its size and thus the data occupy less disk space. With a proper planning of the costs and expenses, the firm could earn well with the data processing.
One thought on “Problems in processing”