Data Cleansing: Definition, Use Cases, and Challenges

This page explains the definition of data cleaning or data cleansing, along with use cases and challenges.

Data Cleansing Definition

Data Quality Problems

Image alt: Data Quality Problems or issues

Data cleansing is the process of transforming sourced data containing errors, duplicates, and inconsistencies into clean, usable data. It’s a vital method used in data analytics.

As Figure 1 shows, real-world data is often “dirty,” meaning it suffers from several issues:

  • Incomplete data: Missing values due to unavailability during recording or errors (human, hardware, or software).
  • Noisy data: Errors introduced during data transmission, faulty equipment, or human/computer mistakes.
  • Duplicate data: Redundant entries originating from various data sources.

Dirty data often exhibits these specific problems:

  • Incomplete: Lacking attribute values.
    • Example: occupation = " "
  • Noisy: Containing errors like spelling mistakes, phonetic/typing errors, transpositions, or multiple values in a single field.
    • Example: Salary = " -10 "
  • Inconsistent: Discrepancies in codes or names (synonyms, nicknames, prefix/suffix variations, abbreviations, truncation, initials).
    • Example #1: Age = "42", Birthday = "03/07/1997 "
    • Example #2: Rating changed from "1,2,3" to "A, B, C"
    • Example #3: Discrepancies between approximate duplicate records.

To tackle these data quality problems, data cleansing is employed as a key method in data analytics, alongside data quality checking, data normalization, data standardization, data analysis, and data deduplication.

Data Cleansing Chart

Data cleansing performs various functions to improve data quality. One such function is “string matching,” used to identify the same entity from two different datasets (i.e., tables), as illustrated in Figure 3.

Data Cleansing using string matching

Image alt: Data Cleansing using string matching

Data Cleansing Use Cases

Data cleansing finds application in various areas within data analytics:

  • MDM - Master Data Management
  • CRM - Customer Relationship Management
  • DWH - Data Warehousing
  • BI - Business Intelligence

Typical issues addressed include inaccurate inventory levels, banking risks, IT overhead, incorrect KPIs, and poor publicity.

Data Cleaning Challenges

Data cleansing presents several challenges:

  • How to define data quality?
    • This is addressed through data profiling.
  • Semantic complexity
    • Domain experts are often required to validate the correctness of data values.
    • The specific dataset and the desired outcome determine the choice of techniques. Achieving optimal results often requires fine-tuning.
  • Computational complexity
    • Duplicate detection can have quadratic time complexity, making it computationally intensive.
  • Evaluation is difficult
    • The absence of a universally accepted “gold standard” makes evaluating the effectiveness of data cleansing challenging.