6  Data Quality Assessment

6.1 Why Data Quality Matters

Garbage in, garbage out is the oldest aphorism in computing, and the one most often relearned by analytics teams.

Every analytical result, every dashboard, every machine-learning model rests on the data underneath it. When that data is wrong, late, incomplete, or inconsistent, the analysis is wrong in ways that are often invisible until the decision has been made. Poor data quality is the largest single reason analytics projects fail to deliver business value.

The cost of poor data quality is rarely visible on a single line of the profit and loss statement, but it shows up everywhere — in misdirected marketing spend, in regulatory fines, in operational rework, in management decisions made on the wrong information. A disciplined approach to data quality assessment is therefore not a back-office concern; it is a precondition for every form of analytics covered in this book.

6.2 Defining Data Quality

Data Quality is the degree to which a set of data is fit for the purpose for which it is being used. The seminal work of Richard Y. Wang & Diane M. Strong (1996) in the Journal of Management Information Systems established the modern, multi-dimensional view of data quality, arguing that quality cannot be reduced to accuracy alone. It depends on the consumer of the data, the task, and the context.

Data quality is therefore relative. A customer dataset that is perfectly adequate for a monthly newsletter mail-out may be entirely inadequate for a regulatory return.

6.3 The Dimensions of Data Quality

flowchart TD
    DQ["Data<br>Quality"] --> A["Accuracy"]
    DQ --> B["Completeness"]
    DQ --> C["Consistency"]
    DQ --> D["Timeliness"]
    DQ --> E["Validity"]
    DQ --> F["Uniqueness"]
    DQ --> G["Integrity"]
    DQ --> H["Relevance"]
    style DQ fill:#e3f2fd,stroke:#1976D2
    style A fill:#e8f5e9,stroke:#388E3C
    style B fill:#e8f5e9,stroke:#388E3C
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#fff8e1,stroke:#F9A825
    style E fill:#fff3e0,stroke:#EF6C00
    style F fill:#fff3e0,stroke:#EF6C00
    style G fill:#fce4ec,stroke:#AD1457
    style H fill:#ede7f6,stroke:#4527A0

TipThe Eight Core Dimensions of Data Quality
Dimension Question Answered Typical Failure
Accuracy Does the data correctly describe the real-world entity it represents? A customer’s address records a city in which they no longer live
Completeness Are all the required values present? Twenty per cent of customer records have no email address
Consistency Does the data agree with itself across systems? Customer status is “active” in CRM but “closed” in billing
Timeliness Is the data current enough for the decision it supports? Yesterday’s stock position is used to plan tomorrow’s promotion
Validity Does each value conform to the defined format and rules? Date of birth recorded as 31 February or as a free-text string
Uniqueness Is each real-world entity represented exactly once? The same customer appears as three records under three name spellings
Integrity Are relationships across tables and systems preserved? An order references a customer ID that does not exist
Relevance Is the data appropriate to the question being asked? A national dataset is used to support a city-level decision

The eight dimensions above are the practical core. The full Wang and Strong framework groups dimensions into four broader categories — intrinsic, contextual, representational, and accessibility — which is useful for academic and governance work but more than is needed for day-to-day assessment.

6.3.1 Quantifying Each Dimension

Leo L. Pipino et al. (2002) in Communications of the ACM proposed a small set of standard formulas for turning each dimension into a measurable score, typically a number between zero and one. The most widely used are:

  • Completeness ratio = number of non-missing values / total expected values.
  • Uniqueness ratio = number of distinct entities / total records.
  • Validity ratio = number of records passing format and rule checks / total records.
  • Consistency ratio = number of records that agree across the systems compared / total records.
  • Timeliness ratio = a function of the gap between the data’s reference date and the decision’s reference date, scaled by the maximum acceptable lag.

Once each dimension carries a score, the organisation can set thresholds — for example, no dataset is permitted to enter the production warehouse with a completeness ratio below 0.95 — and treat data quality as something that is monitored, not just complained about.

6.4 The Data Quality Assessment Process

flowchart LR
    A["1. Define quality<br>requirements"] --> B["2. Profile<br>the data"]
    B --> C["3. Measure<br>each dimension"]
    C --> D["4. Diagnose<br>root causes"]
    D --> E["5. Remediate<br>and improve"]
    E --> F["6. Monitor<br>continuously"]
    F -.-> A
    style A fill:#fce4ec,stroke:#AD1457
    style B fill:#fff3e0,stroke:#EF6C00
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#e3f2fd,stroke:#1976D2
    style E fill:#ede7f6,stroke:#4527A0
    style F fill:#e8f5e9,stroke:#388E3C

6.4.1 Step 1 — Define Quality Requirements

Quality is fitness for purpose, and purpose differs by use case. The first step is therefore to specify, with the business owner of the dataset, what good enough means for the decisions the data is supporting. This produces a per-dimension expectation — for example, 95 per cent completeness on customer email, 99 per cent validity on tax identifier, daily refresh latency of no more than 24 hours.

6.4.2 Step 2 — Profile the Data

Data profiling is the diagnostic examination of a dataset to discover its actual structure, content, and quality. A typical profiling exercise asks:

  • What is the distribution of each field? Are there unexpected values?
  • What proportion of values is missing in each field?
  • How many distinct values does each field carry? Is the cardinality plausible?
  • Are there duplicates by candidate keys, by name, by address, by hash?
  • Do dates and numerics fall in plausible ranges?
  • Do referenced foreign keys exist in the parent tables?

Modern data platforms ship with profiling tools — Power BI’s column profiling, Tableau Prep’s data interpretation, Python’s ydata-profiling, R’s skimr and DataExplorer, and dedicated platforms such as Informatica DQ and Talend DQ.

6.4.3 Step 3 — Measure Each Dimension

Apply the formulas from the previous section to compute a numerical score for each of the eight dimensions on the dataset under review. Compare the scores to the requirements set in Step 1 and identify the gaps.

A data quality scorecard is a useful artefact: a one-page summary that lists each critical dataset, the score against each dimension, and the trend over time.

6.4.4 Step 4 — Diagnose Root Causes

Once a gap is identified, the goal is to find the cause of the gap, not just to clean its symptoms. Common causes include:

  • Missing or inadequate validation at the point of data entry.
  • Manual processes that allow free-text fields where coded values were intended.
  • Integration mismatches between source systems.
  • Definitional drift, where the same field name carries different meanings in different systems.
  • Inadequate or missing master data — customer, product, employee, location reference data.
  • Process changes that were not propagated to data consumers.

A clean dataset that re-fills with bad data each week has not been remediated; it has been swept.

6.4.5 Step 5 — Remediate and Improve

Remediation has two layers: cleaning the existing data, and fixing the upstream cause so the problem does not recur.

  • Cleansing corrects, completes, deduplicates, and standardises the existing data.
  • Process redesign changes the upstream process so the cleansed state becomes the default. Examples include adding validation to data-entry forms, introducing reference-data lookups, mandating coded fields, and reconciling cross-system definitions.

A common heuristic is the one-ten-hundred rule: it costs roughly one unit to prevent a data-quality defect at the source, ten units to correct it after capture, and one hundred units to deal with the consequences after a decision has been taken on the bad data.

6.4.6 Step 6 — Monitor Continuously

Data quality is not a project; it is a continuous concern. Mature organisations operate continuous data-quality monitoring, in which the eight dimensions are scored automatically on every refresh, alerts fire when scores fall below thresholds, and trends are reviewed by the data-governance committee on a regular cadence.

Monitoring closes the loop, so that the assessment process feeds back into the quality requirements as the business evolves.

6.5 Common Causes of Poor Data Quality

  • Manual data entry without validation: Free-text fields invite spelling variants, abbreviations, and typos. The same customer becomes three records.

  • System-to-system integration without reconciliation: When two systems were never designed to talk to each other, their identifiers, codes, and definitions diverge.

  • Definitional disagreement: Marketing’s active customer and Finance’s active customer are not the same customer. Reports built on either definition disagree without anyone being technically wrong.

  • Unmaintained reference data: Product, location, and organisational hierarchies that are not actively curated drift out of date as the business changes.

  • Migration scars: Each system migration leaves behind orphan records, fields used for purposes their original designers did not anticipate, and codes whose meaning is lost.

  • Process changes not reflected in data: A change to the operational process that is not communicated to data consumers produces invisible breaks in time series.

  • Sensor and instrumentation drift: Industrial and IoT data degrades silently as sensors age, are replaced, or are recalibrated.

  • Cultural neglect: When data quality is no one’s job, it is no one’s success. Without ownership, it decays.

6.6 Data Quality Governance

Sustained data quality requires more than tools. It requires a governance structure that assigns clear responsibility:

  • Data owner: A senior business leader accountable for the quality of a domain of data — customer, product, finance, employee.
  • Data steward: A subject-matter expert responsible for definitions, rules, and quality monitoring within a domain.
  • Data custodian: The technical role that operates the platforms and pipelines on which the data lives.
  • Data quality forum or council: A cross-functional body that resolves definitional disputes, prioritises remediation, and reports quality trends to leadership.

The DAMA Data Management Body of Knowledge (DAMA-DMBOK) is the most widely adopted industry framework for this governance structure and is often used as a reference by Indian and global organisations setting up enterprise data programmes.

6.7 Common Pitfalls

  • Equating data quality with accuracy alone: Accuracy is one dimension of eight. A perfectly accurate but stale dataset is still poor quality.

  • Cleaning symptoms without fixing causes: Recurring data-quality defects indicate an upstream problem that no amount of downstream cleansing will resolve.

  • Boiling the ocean: Trying to assess every field of every dataset at once. Start with the data that supports the decisions that matter most.

  • Tooling without ownership: Buying a data-quality platform without naming a data owner produces an unloved console of red lights.

  • One-off assessments: Conducting a quality assessment, fixing the issues, and never measuring again. Quality decays without continuous monitoring.

  • Measurement without thresholds: Reporting completeness ratios without saying which datasets must clear which thresholds turns the scorecard into wallpaper.

  • Confusing volume with quality: A larger dataset is not a better dataset. More records of poor quality compound the problem.

  • Ignoring metadata: Lineage, definitions, and provenance are part of data quality. A field whose meaning no one can confirm is, by definition, low-quality.

6.8 Illustrative Cases

The following cases illustrate how the eight dimensions and the six-step process play out in practice. They are based on the kinds of work commonly seen in industry; the framing is the author’s.

Customer Master in a Retail Bank

A retail bank discovers that a single customer is, on average, represented by 1.4 records in its customer master. The cause is years of free-text name entry, branch-level merger of legacy systems, and inconsistent use of the national identifier as a key. Uniqueness is the failed dimension. The bank deploys a deterministic-plus-fuzzy matching pipeline to identify duplicates, introduces strict validation on the national identifier at all entry points, and creates a steward for the customer master. Within six months, the duplicate ratio falls from 0.4 to 0.05, and downstream reports across marketing, risk, and finance begin to agree.

Product Master in a Manufacturing Firm

A manufacturer cannot reconcile its monthly sales report with its monthly production report. Investigation reveals that the same physical SKU carries two different codes in two systems, a legacy of a partial migration. Consistency and integrity are the failed dimensions. Remediation is a one-time mapping followed by retirement of the older code, plus a cross-system reference-data process that prevents the issue recurring.

Sensor Data in a Power Plant

A power plant’s predictive-maintenance model begins issuing implausible alerts. Investigation finds that two vibration sensors were replaced during a routine overhaul without the change being recorded; their calibrated baselines no longer match the model’s training data. Accuracy, timeliness, and integrity are simultaneously failed. Remediation includes recalibration, retroactive correction of the affected period, and a sensor-change-control procedure tied directly to the data pipeline.

Loan-Application Data in a Digital Lender

A digital lender’s credit model degrades over a quarter without any change to the model itself. Investigation finds that one upstream third-party data provider has changed the format of a key field from numeric to a comma-formatted string, and the parser has been silently inserting nulls. Validity and completeness are the failed dimensions. The lender introduces schema-validation contracts with all third-party providers and adds automated tests that fail the pipeline when validity drops below threshold.


Summary

Concept Description
Foundations
Why Data Quality Matters Poor data quality is the largest single reason analytics projects fail to deliver value
Data Quality The degree to which data is fit for the purpose for which it is being used
The Eight Dimensions
Accuracy Does the data correctly describe the real-world entity it represents
Completeness Are all the required values present in the dataset
Consistency Does the data agree with itself across systems and reports
Timeliness Is the data current enough for the decision it supports
Validity Does each value conform to the defined format, type, and rules
Uniqueness Is each real-world entity represented exactly once
Integrity Are relationships across tables and systems preserved
Relevance Is the data appropriate to the question being asked
Quantifying the Dimensions
Completeness Ratio Non-missing values divided by total expected values
Uniqueness Ratio Distinct entities divided by total records
Validity Ratio Records passing format and rule checks divided by total records
Consistency Ratio Records that agree across compared systems divided by total records
Timeliness Ratio Function of the gap between the data's reference date and the decision's reference date
The Six-Step Assessment Process
Define Quality Requirements Specify per-dimension expectations with the business owner of the dataset
Profile the Data Diagnostically examine a dataset to discover its actual structure, content, and quality
Measure Each Dimension Compute a numerical score for each dimension and compare to the requirements
Diagnose Root Causes Find the upstream cause of a quality gap rather than only treating its symptoms
Remediate and Improve Cleanse existing data and redesign the upstream process so the cleansed state is default
Monitor Continuously Score dimensions automatically on every refresh, alert on threshold breaches, and review trends
Tools and Heuristics
Data Profiling Tools and techniques to surface distributions, missingness, cardinality, duplicates, and outliers
One-Ten-Hundred Rule Heuristic that prevention costs roughly one unit, correction ten units, and downstream consequences one hundred
Continuous Data-Quality Monitoring Mature practice of automated, threshold-driven quality monitoring on every dataset refresh
Common Causes of Poor Data Quality
Manual Entry Without Validation Free-text fields without validation invite spelling variants and duplicates
Integration Without Reconciliation Two systems whose identifiers, codes, and definitions diverge over time
Definitional Disagreement The same field name carrying different meanings in different functions
Unmaintained Reference Data Reference data such as products, locations, and hierarchies drifting out of date
Migration Scars Each migration leaves orphan records, repurposed fields, and codes of lost meaning
Sensor and Instrument Drift Industrial and IoT data degrades silently as sensors age, fail, or are recalibrated
Cultural Neglect When data quality is no one's responsibility, it decays continuously
Data Quality Governance
Data Owner Senior business leader accountable for the quality of a data domain
Data Steward Subject-matter expert responsible for definitions, rules, and monitoring within a domain
Data Custodian Technical role that operates the platforms and pipelines on which the data lives
Data Quality Council Cross-functional body that resolves disputes, prioritises remediation, and reports quality trends
DAMA-DMBOK Data Management Body of Knowledge; the most widely adopted industry data-management framework
Common Pitfalls
Equating Quality with Accuracy Pitfall of treating accuracy as the whole of data quality and ignoring the other seven dimensions
Cleaning Symptoms Pitfall of cleansing recurring defects without addressing the upstream cause
Boiling the Ocean Pitfall of trying to assess every field of every dataset rather than starting with what matters
Tooling Without Ownership Pitfall of buying a quality platform without naming a business owner for the data
One-Off Assessments Pitfall of conducting one assessment, fixing issues, and never measuring again
Measurement Without Thresholds Pitfall of reporting scores without thresholds, which turns the scorecard into wallpaper
Confusing Volume with Quality Pitfall of treating bigger data as better data, when more poor records compound the problem
Ignoring Metadata Pitfall of ignoring lineage, definitions, and provenance, which are part of data quality