Data Cleansing: Definition, Use Cases, and Challenges

data
data quality
data cleansing
data analytics
data mining

This page explains the definition of data cleaning or data cleansing, along with use cases and challenges.

Data Cleansing Definition

Data Quality Problems

Image alt: Data Quality Problems or issues

Data cleansing is the process of transforming sourced data containing errors, duplicates, and inconsistencies into clean, usable data. It’s a vital method used in data analytics.

As Figure 1 shows, real-world data is often “dirty,” meaning it suffers from several issues:

  • Incomplete data: Missing values due to unavailability during recording or errors (human, hardware, or software).
  • Noisy data: Errors introduced during data transmission, faulty equipment, or human/computer mistakes.
  • Duplicate data: Redundant entries originating from various data sources.

Dirty data often exhibits these specific problems:

  • Incomplete: Lacking attribute values.
    • Example: occupation = " "
  • Noisy: Containing errors like spelling mistakes, phonetic/typing errors, transpositions, or multiple values in a single field.
    • Example: Salary = " -10 "
  • Inconsistent: Discrepancies in codes or names (synonyms, nicknames, prefix/suffix variations, abbreviations, truncation, initials).
    • Example #1: Age = "42", Birthday = "03/07/1997 "
    • Example #2: Rating changed from "1,2,3" to "A, B, C"
    • Example #3: Discrepancies between approximate duplicate records.

To tackle these data quality problems, data cleansing is employed as a key method in data analytics, alongside data quality checking, data normalization, data standardization, data analysis, and data deduplication.

Data Cleansing Chart

Data cleansing performs various functions to improve data quality. One such function is “string matching,” used to identify the same entity from two different datasets (i.e., tables), as illustrated in Figure 3.

Data Cleansing using string matching

Image alt: Data Cleansing using string matching

Data Cleansing Use Cases

Data cleansing finds application in various areas within data analytics:

  • MDM - Master Data Management
  • CRM - Customer Relationship Management
  • DWH - Data Warehousing
  • BI - Business Intelligence

Typical issues addressed include inaccurate inventory levels, banking risks, IT overhead, incorrect KPIs, and poor publicity.

Data Cleaning Challenges

Data cleansing presents several challenges:

  • How to define data quality?
    • This is addressed through data profiling.
  • Semantic complexity
    • Domain experts are often required to validate the correctness of data values.
    • The specific dataset and the desired outcome determine the choice of techniques. Achieving optimal results often requires fine-tuning.
  • Computational complexity
    • Duplicate detection can have quadratic time complexity, making it computationally intensive.
  • Evaluation is difficult
    • The absence of a universally accepted “gold standard” makes evaluating the effectiveness of data cleansing challenging.
Data Analytics Basics: A Comprehensive Tutorial

Data Analytics Basics: A Comprehensive Tutorial

Learn the fundamentals of data analytics, including data quality issues, extraction, profiling, cleaning, and its diverse applications across industries.

data analytics
data cleaning
data science
IoT Smart Retail System Architecture

IoT Smart Retail System Architecture

Explore the architecture of an IoT-based Smart Retail System, including customer feedback mechanisms, IoT integration, and the benefits of smart retail solutions.

iot
smart retail
retail system
Data Mining Tools: OmniViz and Aureka

Data Mining Tools: OmniViz and Aureka

Explore data mining tools like OmniViz and Aureka, their techniques (link analysis, predictive modeling), and their applications across industries for data-driven decisions.

data mining
data analysis
data tool
Data Mining Tutorial: Basics Explained

Data Mining Tutorial: Basics Explained

Learn the fundamentals of data mining, including its architecture, applications, and benefits. Understand the process and how it extracts valuable knowledge.

data mining
data analysis
machine learning