Data Cleansing: Definition, Use Cases, and Challenges
Advertisement
This page explains the definition of data cleaning or data cleansing, along with use cases and challenges.
Data Cleansing Definition
Image alt: Data Quality Problems or issues
Data cleansing is the process of transforming sourced data containing errors, duplicates, and inconsistencies into clean, usable data. It’s a vital method used in data analytics.
As Figure 1 shows, real-world data is often “dirty,” meaning it suffers from several issues:
- Incomplete data: Missing values due to unavailability during recording or errors (human, hardware, or software).
- Noisy data: Errors introduced during data transmission, faulty equipment, or human/computer mistakes.
- Duplicate data: Redundant entries originating from various data sources.
Dirty data often exhibits these specific problems:
- Incomplete: Lacking attribute values.
- Example:
occupation = " "
- Example:
- Noisy: Containing errors like spelling mistakes, phonetic/typing errors, transpositions, or multiple values in a single field.
- Example:
Salary = " -10 "
- Example:
- Inconsistent: Discrepancies in codes or names (synonyms, nicknames, prefix/suffix variations, abbreviations, truncation, initials).
- Example #1:
Age = "42"
,Birthday = "03/07/1997 "
- Example #2: Rating changed from
"1,2,3"
to"A, B, C"
- Example #3: Discrepancies between approximate duplicate records.
- Example #1:
To tackle these data quality problems, data cleansing is employed as a key method in data analytics, alongside data quality checking, data normalization, data standardization, data analysis, and data deduplication.
Data cleansing performs various functions to improve data quality. One such function is “string matching,” used to identify the same entity from two different datasets (i.e., tables), as illustrated in Figure 3.
Image alt: Data Cleansing using string matching
Data Cleansing Use Cases
Data cleansing finds application in various areas within data analytics:
- MDM - Master Data Management
- CRM - Customer Relationship Management
- DWH - Data Warehousing
- BI - Business Intelligence
Typical issues addressed include inaccurate inventory levels, banking risks, IT overhead, incorrect KPIs, and poor publicity.
Data Cleaning Challenges
Data cleansing presents several challenges:
- How to define data quality?
- This is addressed through data profiling.
- Semantic complexity
- Domain experts are often required to validate the correctness of data values.
- The specific dataset and the desired outcome determine the choice of techniques. Achieving optimal results often requires fine-tuning.
- Computational complexity
- Duplicate detection can have quadratic time complexity, making it computationally intensive.
- Evaluation is difficult
- The absence of a universally accepted “gold standard” makes evaluating the effectiveness of data cleansing challenging.