flowchart TD
DQ["Data<br>Quality"] --> A["Accuracy"]
DQ --> B["Completeness"]
DQ --> C["Consistency"]
DQ --> D["Timeliness"]
DQ --> E["Validity"]
DQ --> F["Uniqueness"]
DQ --> G["Integrity"]
DQ --> H["Relevance"]
style DQ fill:#e3f2fd,stroke:#1976D2
style A fill:#e8f5e9,stroke:#388E3C
style B fill:#e8f5e9,stroke:#388E3C
style C fill:#fff8e1,stroke:#F9A825
style D fill:#fff8e1,stroke:#F9A825
style E fill:#fff3e0,stroke:#EF6C00
style F fill:#fff3e0,stroke:#EF6C00
style G fill:#fce4ec,stroke:#AD1457
style H fill:#ede7f6,stroke:#4527A0
6 Data Quality Assessment
6.1 Why Data Quality Matters
Garbage in, garbage out is the oldest aphorism in computing, and the one most often relearned by analytics teams.
Every analytical result, every dashboard, every machine-learning model rests on the data underneath it. When that data is wrong, late, incomplete, or inconsistent, the analysis is wrong in ways that are often invisible until the decision has been made. Poor data quality is the largest single reason analytics projects fail to deliver business value.
The cost of poor data quality is rarely visible on a single line of the profit and loss statement, but it shows up everywhere — in misdirected marketing spend, in regulatory fines, in operational rework, in management decisions made on the wrong information. A disciplined approach to data quality assessment is therefore not a back-office concern; it is a precondition for every form of analytics covered in this book.
6.2 Defining Data Quality
Data Quality is the degree to which a set of data is fit for the purpose for which it is being used. The seminal work of Richard Y. Wang & Diane M. Strong (1996) in the Journal of Management Information Systems established the modern, multi-dimensional view of data quality, arguing that quality cannot be reduced to accuracy alone. It depends on the consumer of the data, the task, and the context.
Data quality is therefore relative. A customer dataset that is perfectly adequate for a monthly newsletter mail-out may be entirely inadequate for a regulatory return.
6.3 The Dimensions of Data Quality
| Dimension | Question Answered | Typical Failure |
|---|---|---|
| Accuracy | Does the data correctly describe the real-world entity it represents? | A customer’s address records a city in which they no longer live |
| Completeness | Are all the required values present? | Twenty per cent of customer records have no email address |
| Consistency | Does the data agree with itself across systems? | Customer status is “active” in CRM but “closed” in billing |
| Timeliness | Is the data current enough for the decision it supports? | Yesterday’s stock position is used to plan tomorrow’s promotion |
| Validity | Does each value conform to the defined format and rules? | Date of birth recorded as 31 February or as a free-text string |
| Uniqueness | Is each real-world entity represented exactly once? | The same customer appears as three records under three name spellings |
| Integrity | Are relationships across tables and systems preserved? | An order references a customer ID that does not exist |
| Relevance | Is the data appropriate to the question being asked? | A national dataset is used to support a city-level decision |
The eight dimensions above are the practical core. The full Wang and Strong framework groups dimensions into four broader categories — intrinsic, contextual, representational, and accessibility — which is useful for academic and governance work but more than is needed for day-to-day assessment.
6.3.1 Quantifying Each Dimension
Leo L. Pipino et al. (2002) in Communications of the ACM proposed a small set of standard formulas for turning each dimension into a measurable score, typically a number between zero and one. The most widely used are:
- Completeness ratio = number of non-missing values / total expected values.
- Uniqueness ratio = number of distinct entities / total records.
- Validity ratio = number of records passing format and rule checks / total records.
- Consistency ratio = number of records that agree across the systems compared / total records.
- Timeliness ratio = a function of the gap between the data’s reference date and the decision’s reference date, scaled by the maximum acceptable lag.
Once each dimension carries a score, the organisation can set thresholds — for example, no dataset is permitted to enter the production warehouse with a completeness ratio below 0.95 — and treat data quality as something that is monitored, not just complained about.
6.4 The Data Quality Assessment Process
flowchart LR
A["1. Define quality<br>requirements"] --> B["2. Profile<br>the data"]
B --> C["3. Measure<br>each dimension"]
C --> D["4. Diagnose<br>root causes"]
D --> E["5. Remediate<br>and improve"]
E --> F["6. Monitor<br>continuously"]
F -.-> A
style A fill:#fce4ec,stroke:#AD1457
style B fill:#fff3e0,stroke:#EF6C00
style C fill:#fff8e1,stroke:#F9A825
style D fill:#e3f2fd,stroke:#1976D2
style E fill:#ede7f6,stroke:#4527A0
style F fill:#e8f5e9,stroke:#388E3C
6.4.1 Step 1 — Define Quality Requirements
Quality is fitness for purpose, and purpose differs by use case. The first step is therefore to specify, with the business owner of the dataset, what good enough means for the decisions the data is supporting. This produces a per-dimension expectation — for example, 95 per cent completeness on customer email, 99 per cent validity on tax identifier, daily refresh latency of no more than 24 hours.
6.4.2 Step 2 — Profile the Data
Data profiling is the diagnostic examination of a dataset to discover its actual structure, content, and quality. A typical profiling exercise asks:
- What is the distribution of each field? Are there unexpected values?
- What proportion of values is missing in each field?
- How many distinct values does each field carry? Is the cardinality plausible?
- Are there duplicates by candidate keys, by name, by address, by hash?
- Do dates and numerics fall in plausible ranges?
- Do referenced foreign keys exist in the parent tables?
Modern data platforms ship with profiling tools — Power BI’s column profiling, Tableau Prep’s data interpretation, Python’s ydata-profiling, R’s skimr and DataExplorer, and dedicated platforms such as Informatica DQ and Talend DQ.
6.4.3 Step 3 — Measure Each Dimension
Apply the formulas from the previous section to compute a numerical score for each of the eight dimensions on the dataset under review. Compare the scores to the requirements set in Step 1 and identify the gaps.
A data quality scorecard is a useful artefact: a one-page summary that lists each critical dataset, the score against each dimension, and the trend over time.
6.4.4 Step 4 — Diagnose Root Causes
Once a gap is identified, the goal is to find the cause of the gap, not just to clean its symptoms. Common causes include:
- Missing or inadequate validation at the point of data entry.
- Manual processes that allow free-text fields where coded values were intended.
- Integration mismatches between source systems.
- Definitional drift, where the same field name carries different meanings in different systems.
- Inadequate or missing master data — customer, product, employee, location reference data.
- Process changes that were not propagated to data consumers.
A clean dataset that re-fills with bad data each week has not been remediated; it has been swept.
6.4.5 Step 5 — Remediate and Improve
Remediation has two layers: cleaning the existing data, and fixing the upstream cause so the problem does not recur.
- Cleansing corrects, completes, deduplicates, and standardises the existing data.
- Process redesign changes the upstream process so the cleansed state becomes the default. Examples include adding validation to data-entry forms, introducing reference-data lookups, mandating coded fields, and reconciling cross-system definitions.
A common heuristic is the one-ten-hundred rule: it costs roughly one unit to prevent a data-quality defect at the source, ten units to correct it after capture, and one hundred units to deal with the consequences after a decision has been taken on the bad data.
6.4.6 Step 6 — Monitor Continuously
Data quality is not a project; it is a continuous concern. Mature organisations operate continuous data-quality monitoring, in which the eight dimensions are scored automatically on every refresh, alerts fire when scores fall below thresholds, and trends are reviewed by the data-governance committee on a regular cadence.
Monitoring closes the loop, so that the assessment process feeds back into the quality requirements as the business evolves.
6.5 Common Causes of Poor Data Quality
Manual data entry without validation: Free-text fields invite spelling variants, abbreviations, and typos. The same customer becomes three records.
System-to-system integration without reconciliation: When two systems were never designed to talk to each other, their identifiers, codes, and definitions diverge.
Definitional disagreement: Marketing’s active customer and Finance’s active customer are not the same customer. Reports built on either definition disagree without anyone being technically wrong.
Unmaintained reference data: Product, location, and organisational hierarchies that are not actively curated drift out of date as the business changes.
Migration scars: Each system migration leaves behind orphan records, fields used for purposes their original designers did not anticipate, and codes whose meaning is lost.
Process changes not reflected in data: A change to the operational process that is not communicated to data consumers produces invisible breaks in time series.
Sensor and instrumentation drift: Industrial and IoT data degrades silently as sensors age, are replaced, or are recalibrated.
Cultural neglect: When data quality is no one’s job, it is no one’s success. Without ownership, it decays.
6.6 Data Quality Governance
Sustained data quality requires more than tools. It requires a governance structure that assigns clear responsibility:
- Data owner: A senior business leader accountable for the quality of a domain of data — customer, product, finance, employee.
- Data steward: A subject-matter expert responsible for definitions, rules, and quality monitoring within a domain.
- Data custodian: The technical role that operates the platforms and pipelines on which the data lives.
- Data quality forum or council: A cross-functional body that resolves definitional disputes, prioritises remediation, and reports quality trends to leadership.
The DAMA Data Management Body of Knowledge (DAMA-DMBOK) is the most widely adopted industry framework for this governance structure and is often used as a reference by Indian and global organisations setting up enterprise data programmes.
6.7 Common Pitfalls
Equating data quality with accuracy alone: Accuracy is one dimension of eight. A perfectly accurate but stale dataset is still poor quality.
Cleaning symptoms without fixing causes: Recurring data-quality defects indicate an upstream problem that no amount of downstream cleansing will resolve.
Boiling the ocean: Trying to assess every field of every dataset at once. Start with the data that supports the decisions that matter most.
Tooling without ownership: Buying a data-quality platform without naming a data owner produces an unloved console of red lights.
One-off assessments: Conducting a quality assessment, fixing the issues, and never measuring again. Quality decays without continuous monitoring.
Measurement without thresholds: Reporting completeness ratios without saying which datasets must clear which thresholds turns the scorecard into wallpaper.
Confusing volume with quality: A larger dataset is not a better dataset. More records of poor quality compound the problem.
Ignoring metadata: Lineage, definitions, and provenance are part of data quality. A field whose meaning no one can confirm is, by definition, low-quality.
6.8 Illustrative Cases
The following cases illustrate how the eight dimensions and the six-step process play out in practice. They are based on the kinds of work commonly seen in industry; the framing is the author’s.
Customer Master in a Retail Bank
A retail bank discovers that a single customer is, on average, represented by 1.4 records in its customer master. The cause is years of free-text name entry, branch-level merger of legacy systems, and inconsistent use of the national identifier as a key. Uniqueness is the failed dimension. The bank deploys a deterministic-plus-fuzzy matching pipeline to identify duplicates, introduces strict validation on the national identifier at all entry points, and creates a steward for the customer master. Within six months, the duplicate ratio falls from 0.4 to 0.05, and downstream reports across marketing, risk, and finance begin to agree.
Product Master in a Manufacturing Firm
A manufacturer cannot reconcile its monthly sales report with its monthly production report. Investigation reveals that the same physical SKU carries two different codes in two systems, a legacy of a partial migration. Consistency and integrity are the failed dimensions. Remediation is a one-time mapping followed by retirement of the older code, plus a cross-system reference-data process that prevents the issue recurring.
Sensor Data in a Power Plant
A power plant’s predictive-maintenance model begins issuing implausible alerts. Investigation finds that two vibration sensors were replaced during a routine overhaul without the change being recorded; their calibrated baselines no longer match the model’s training data. Accuracy, timeliness, and integrity are simultaneously failed. Remediation includes recalibration, retroactive correction of the affected period, and a sensor-change-control procedure tied directly to the data pipeline.
Loan-Application Data in a Digital Lender
A digital lender’s credit model degrades over a quarter without any change to the model itself. Investigation finds that one upstream third-party data provider has changed the format of a key field from numeric to a comma-formatted string, and the parser has been silently inserting nulls. Validity and completeness are the failed dimensions. The lender introduces schema-validation contracts with all third-party providers and adds automated tests that fail the pipeline when validity drops below threshold.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why Data Quality Matters | Poor data quality is the largest single reason analytics projects fail to deliver value |
| Data Quality | The degree to which data is fit for the purpose for which it is being used |
| The Eight Dimensions | |
| Accuracy | Does the data correctly describe the real-world entity it represents |
| Completeness | Are all the required values present in the dataset |
| Consistency | Does the data agree with itself across systems and reports |
| Timeliness | Is the data current enough for the decision it supports |
| Validity | Does each value conform to the defined format, type, and rules |
| Uniqueness | Is each real-world entity represented exactly once |
| Integrity | Are relationships across tables and systems preserved |
| Relevance | Is the data appropriate to the question being asked |
| Quantifying the Dimensions | |
| Completeness Ratio | Non-missing values divided by total expected values |
| Uniqueness Ratio | Distinct entities divided by total records |
| Validity Ratio | Records passing format and rule checks divided by total records |
| Consistency Ratio | Records that agree across compared systems divided by total records |
| Timeliness Ratio | Function of the gap between the data's reference date and the decision's reference date |
| The Six-Step Assessment Process | |
| Define Quality Requirements | Specify per-dimension expectations with the business owner of the dataset |
| Profile the Data | Diagnostically examine a dataset to discover its actual structure, content, and quality |
| Measure Each Dimension | Compute a numerical score for each dimension and compare to the requirements |
| Diagnose Root Causes | Find the upstream cause of a quality gap rather than only treating its symptoms |
| Remediate and Improve | Cleanse existing data and redesign the upstream process so the cleansed state is default |
| Monitor Continuously | Score dimensions automatically on every refresh, alert on threshold breaches, and review trends |
| Tools and Heuristics | |
| Data Profiling | Tools and techniques to surface distributions, missingness, cardinality, duplicates, and outliers |
| One-Ten-Hundred Rule | Heuristic that prevention costs roughly one unit, correction ten units, and downstream consequences one hundred |
| Continuous Data-Quality Monitoring | Mature practice of automated, threshold-driven quality monitoring on every dataset refresh |
| Common Causes of Poor Data Quality | |
| Manual Entry Without Validation | Free-text fields without validation invite spelling variants and duplicates |
| Integration Without Reconciliation | Two systems whose identifiers, codes, and definitions diverge over time |
| Definitional Disagreement | The same field name carrying different meanings in different functions |
| Unmaintained Reference Data | Reference data such as products, locations, and hierarchies drifting out of date |
| Migration Scars | Each migration leaves orphan records, repurposed fields, and codes of lost meaning |
| Sensor and Instrument Drift | Industrial and IoT data degrades silently as sensors age, fail, or are recalibrated |
| Cultural Neglect | When data quality is no one's responsibility, it decays continuously |
| Data Quality Governance | |
| Data Owner | Senior business leader accountable for the quality of a data domain |
| Data Steward | Subject-matter expert responsible for definitions, rules, and monitoring within a domain |
| Data Custodian | Technical role that operates the platforms and pipelines on which the data lives |
| Data Quality Council | Cross-functional body that resolves disputes, prioritises remediation, and reports quality trends |
| DAMA-DMBOK | Data Management Body of Knowledge; the most widely adopted industry data-management framework |
| Common Pitfalls | |
| Equating Quality with Accuracy | Pitfall of treating accuracy as the whole of data quality and ignoring the other seven dimensions |
| Cleaning Symptoms | Pitfall of cleansing recurring defects without addressing the upstream cause |
| Boiling the Ocean | Pitfall of trying to assess every field of every dataset rather than starting with what matters |
| Tooling Without Ownership | Pitfall of buying a quality platform without naming a business owner for the data |
| One-Off Assessments | Pitfall of conducting one assessment, fixing issues, and never measuring again |
| Measurement Without Thresholds | Pitfall of reporting scores without thresholds, which turns the scorecard into wallpaper |
| Confusing Volume with Quality | Pitfall of treating bigger data as better data, when more poor records compound the problem |
| Ignoring Metadata | Pitfall of ignoring lineage, definitions, and provenance, which are part of data quality |