4  Analytics Project Lifecycle: CRISP-DM Methodology

4.1 The Need for a Structured Lifecycle

Analytics projects fail more often from poor process than from poor algorithms.

A surprising number of analytics projects deliver no business value, not because the data was bad or the model was wrong, but because the project lacked a disciplined process. Goals were not clearly framed, data quality was discovered too late, the model solved a problem nobody owned, or the result was never deployed.

A structured analytics lifecycle is a sequence of phases that turns a vague business question into a deployed analytical solution. It gives the team a shared vocabulary, makes progress visible, and forces the right questions to be asked at the right time. The most widely adopted of these lifecycles is CRISP-DM.

4.2 CRISP-DM Overview

CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It was developed in the late 1990s by a consortium of practitioners from DaimlerChrysler, SPSS, NCR, and OHRA, and was published as a step-by-step user guide and reference model. The original consortium paper by Rüdiger Wirth & Jochen Hipp (2000) framed CRISP-DM as a tool-neutral, industry-neutral, and application-neutral process for data mining and analytics. The widely cited blueprint by Colin Shearer (2000) in the Journal of Data Warehousing set out the model in the form most practitioners learn today.

CRISP-DM remains, more than two decades later, the most widely used analytics project methodology in industry. It is the implicit backbone of most modern data-science workflows even when it is not invoked by name.

4.2.1 The Six Phases at a Glance

flowchart LR
    A["1. Business<br>Understanding"] --> B["2. Data<br>Understanding"]
    B --> C["3. Data<br>Preparation"]
    C --> D["4. Modeling"]
    D --> E["5. Evaluation"]
    E --> F["6. Deployment"]
    B -.-> A
    D -.-> C
    E -.-> A
    F -.-> A
    style A fill:#fce4ec,stroke:#AD1457
    style B fill:#fff3e0,stroke:#EF6C00
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#e3f2fd,stroke:#1976D2
    style E fill:#ede7f6,stroke:#4527A0
    style F fill:#e8f5e9,stroke:#388E3C

TipThe Six Phases of CRISP-DM
Phase Question Answered Key Output
1. Business Understanding What is the business problem and what would success look like? Clear objectives, success criteria, project plan
2. Data Understanding What data do we have, and is it any good? Data audit, initial findings, quality assessment
3. Data Preparation How do we shape the data so it can be modelled? Clean, integrated, feature-engineered analytical dataset
4. Modeling Which technique fits the problem and the data? Trained candidate models with parameters and assumptions
5. Evaluation Does the model meet the business success criteria? Validated model, list of decisions confirmed or revised
6. Deployment How do we put the result into the hands of users? Production model, dashboards, monitoring, documentation

4.3 The Six Phases in Detail

4.3.1 Phase 1 — Business Understanding

Business understanding turns a vague request into a precisely framed analytical question. It is the phase most often shortchanged, and the phase whose neglect is most often fatal.

The phase has four tasks:

  • Determine business objectives: Identify the stakeholder, the decision the analytics will support, and the business outcome by which success will be judged.
  • Assess the situation: Review the resources, constraints, assumptions, costs, benefits, risks, and contingencies that surround the project.
  • Determine data-mining goals: Translate the business objective into a precise analytical objective. “Reduce churn” becomes “Predict, for every active customer, the probability of attrition within ninety days.”
  • Produce a project plan: Sequence the remaining phases, allocate resources, and identify the techniques and tools likely to be used.

The phase ends when the team can answer three questions:

  • What decision will be made differently because of this project?
  • What does success look like, expressed as a measurable target?
  • Who owns the action that follows the analysis?

4.3.2 Phase 2 — Data Understanding

Data understanding is an honest audit of the raw material. It is the phase in which optimistic assumptions about data availability and quality meet reality.

The phase has four tasks:

  • Collect initial data: Identify, request, and acquire the datasets the project will draw on, and document their sources.
  • Describe the data: Catalogue the format, volume, structure, and meaning of each variable.
  • Explore the data: Use descriptive statistics, frequency tables, and visualisations to surface initial patterns, distributions, and surprises.
  • Verify data quality: Check for completeness, consistency, accuracy, and timeliness. Flag missing values, outliers, duplicates, and definitional disagreements.

The most important output of this phase is sometimes a list of reasons to redefine the project, because the data needed to answer the original question turns out not to exist or not to be trustworthy.

4.3.3 Phase 3 — Data Preparation

Data preparation produces the analytical dataset on which models will be trained. It is the phase in which most of a project’s time is actually spent — typically sixty to seventy per cent — and the phase in which the eventual quality of the model is largely determined.

The phase has five tasks:

  • Select data: Choose which records and which variables will enter the model and document the rationale.
  • Clean data: Handle missing values, correct errors, resolve inconsistencies, and remove duplicates.
  • Construct data: Engineer derived variables — ratios, lagged values, interaction terms, aggregations — that capture the patterns relevant to the problem.
  • Integrate data: Join data from multiple sources into a single analytical table.
  • Format data: Convert variables to the form required by the selected modelling tool — encoding, scaling, type conversion, partitioning.

A clean, well-engineered dataset is often a more valuable asset than the model that is eventually built on it, because the same dataset will be reused across many subsequent models.

4.3.4 Phase 4 — Modeling

Modeling applies analytical techniques to the prepared dataset to produce candidate solutions to the analytical question.

The phase has four tasks:

  • Select modelling technique: Choose techniques appropriate to the problem type — regression, classification, clustering, time series, optimisation — and to the data available.
  • Generate test design: Decide how the model’s performance will be measured and how the data will be split into training, validation, and test sets.
  • Build model: Fit the chosen technique to the training data, tuning hyperparameters as required.
  • Assess model: Evaluate the model on the validation set and compare candidate models on the agreed performance metrics.

Several techniques are usually tried in parallel. The result of this phase is a short-list of candidate models, with their parameters, assumptions, and validation performance documented.

4.3.5 Phase 5 — Evaluation

Evaluation tests whether the technically successful model is also a business success. It is the phase that separates a good experiment from a deliverable solution.

The phase has three tasks:

  • Evaluate results: Test the model against the business success criteria agreed in Phase 1. A model that hits a 0.85 AUC may still fail if the cost of false positives in production is unacceptable.
  • Review process: Conduct a structured retrospective on the project to date. Has anything important been overlooked? Has the data been used appropriately? Are the model’s assumptions defensible?
  • Determine next steps: Decide whether the model is ready for deployment, whether further iterations are needed, or whether a new project should be initiated.

The honest possible outcome of evaluation is do not deploy. CRISP-DM treats this not as a failure of the lifecycle but as one of its successes.

4.3.6 Phase 6 — Deployment

Deployment puts the result into the hands of the people or systems whose decisions it is meant to support. A model that lives only in an analyst’s notebook delivers no business value.

The phase has four tasks:

  • Plan deployment: Decide how the result will be delivered — a dashboard, a scoring API, a report, an automated decision engine — and what infrastructure that requires.
  • Plan monitoring and maintenance: Define how model performance will be tracked, how data drift will be detected, and how often the model will be retrained.
  • Produce final report: Document the project from end to end so that successors can understand, audit, and build on what was done.
  • Review project: Capture lessons learned, including what should be done differently next time.

Deployment is not the end of the project; it is the beginning of the model’s working life.

4.4 Iteration and the Cyclical Nature of CRISP-DM

CRISP-DM is presented as a sequence of phases, but the original consortium guide is explicit that the process is iterative rather than linear. In practice, almost every project loops back at least once.

The most common loops are:

  • Data Understanding back to Business Understanding: The data turns out not to support the original question; the question is reshaped to fit the data that exists.
  • Modeling back to Data Preparation: The first round of modelling reveals features that need to be engineered or recoded.
  • Evaluation back to Business Understanding: The model meets the technical specification but not the business intent; the business question is sharpened.
  • Deployment back to Business Understanding: A model in production reveals new questions, new opportunities, or new problems that begin the next cycle.

The dotted arrows in the diagram above represent these feedback paths. A project that never loops is usually a project that has not looked closely enough.

4.5 Other Analytics Lifecycle Methodologies

TipComparison of Analytics Lifecycle Methodologies
Methodology Origin Phases Distinctive Emphasis
CRISP-DM DaimlerChrysler / SPSS / NCR / OHRA consortium, 1996–2000 Six phases: Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment Tool-neutral, industry-neutral, business-driven framing
SEMMA SAS Institute Five phases: Sample → Explore → Modify → Model → Assess Closely tied to SAS Enterprise Miner; data-mining focus
KDD Process Fayyad, Piatetsky-Shapiro, and Smyth, 1996 Five phases: Selection → Pre-processing → Transformation → Data Mining → Interpretation/Evaluation Academic origin; emphasis on knowledge discovery
TDSP Microsoft Team Data Science Process Lifecycle of business understanding, data acquisition, modelling, deployment, customer acceptance Designed for collaborative data-science teams using Azure tooling
ASUM-DM IBM Analytics Solutions Unified Method for Data Mining Iterative, agile-flavoured extension of CRISP-DM Adds project management, infrastructure, and operations layers

CRISP-DM remains the broadest and most adopted of these. SEMMA is closer to a tool workflow than a project methodology. KDD is the academic ancestor. TDSP and ASUM-DM are vendor-aligned modernisations that retain CRISP-DM’s six-phase backbone.

4.6 Best Practices for Running CRISP-DM Projects

  • Spend disproportionate time on Business Understanding: A weak Phase 1 is the single largest cause of failed analytics projects. Pay for it up front.

  • Treat Data Understanding as honest auditing, not optimistic confirmation: Look for what is broken, missing, or inconsistent before you fall in love with the data.

  • Budget realistically for Data Preparation: Plan for sixty to seventy per cent of project effort here. Communicate this to sponsors at the start.

  • Run Modeling as a comparison, not a coronation: Try several techniques. Document why one was preferred.

  • Include the deployment owner from Phase 1: A model designed without the eventual operator will fail in deployment.

  • Document as you go: A project that is rigorously documented can be audited, reused, and revisited. One that is not is a future archaeology problem.

  • Build feedback loops: Plan from the start how the deployed model will be monitored, retrained, and eventually retired.

  • Treat ethical and privacy review as part of every phase: Bias entering in Phase 2 is harder to remove in Phase 5.

4.7 Common Pitfalls

  • Skipping Phase 1: Diving into the data without a clear business question produces interesting findings nobody acts on.

  • Treating Data Preparation as a one-off: A data-prep step that is not codified will be redone, inconsistently, by the next analyst.

  • Modeling Heroics: Investing weeks in a marginal improvement to model accuracy when the bottleneck is in deployment or adoption.

  • Optimising the Wrong Metric: Tuning a model to a technical metric (accuracy, AUC, RMSE) that does not map onto the business outcome the project was set up to improve.

  • Throw-It-Over-the-Wall Deployment: Treating deployment as someone else’s problem after Phase 5 ends. A model not designed for deployment will not be deployed.

  • Forgotten Models in Production: Deploying a model and then never monitoring or retraining it. Performance silently decays as the world changes.

  • No Retrospective: Closing a project without writing down what was learned, so that the next project repeats the same mistakes.

  • Methodology Theatre: Applying CRISP-DM as a sequence of templates to be filled in, rather than as a discipline of asking the right questions at the right time.

4.8 Illustrative Cases

The following short cases illustrate how the CRISP-DM phases play out in practice. They are based on the kinds of projects commonly seen in industry; the framing is the author’s.

Customer Churn for a Telecommunications Operator

A telecommunications operator wishes to reduce post-paid customer attrition. Phase 1 translates the goal into a measurable target — reducing the ninety-day attrition rate by two percentage points among customers in their twelfth-to-eighteenth month of tenure. Phase 2 audits billing, usage, and complaint data and discovers a recurring data-quality issue with hand-set inventory records. Phase 3 integrates the cleaned datasets and engineers usage-trend, complaint-recency, and competitive-offer features. Phase 4 compares logistic regression with gradient-boosted trees. Phase 5 confirms that the boosted-tree model meets the business target on a holdout, and that retention-offer cost remains within budget. Phase 6 integrates the model with the customer-care call-handling system so that attrition risk and the recommended offer appear on the agent’s screen for inbound calls. The cycle then loops back to Phase 1 to scope a separate project on inbound-channel optimisation.

Predictive Maintenance in a Manufacturing Plant

A manufacturer wants to reduce unplanned downtime on a critical line. Phase 1 sets the target — a thirty per cent reduction in unplanned stops on a named line over six months. Phase 2 ingests sensor and maintenance-log data and finds that several sensors have been re-instrumented mid-period without the change being recorded. Phase 3 corrects sensor lineage, engineers vibration- and temperature-trend features, and aligns the data on a unified time index. Phase 4 trains a survival-style model on time to next failure. Phase 5 establishes that the model brings useful early warning for two of three failure modes and that the third needs additional sensors. Phase 6 deploys the early-warning system to the plant’s maintenance dispatch board and starts monitoring drift. The third failure mode becomes the seed of a follow-on project.

Fraud Detection in a Retail Bank

A retail bank wants to reduce fraudulent online card-not-present transactions. Phase 1 sets the target as a measurable reduction in fraud loss subject to a maximum tolerable false-positive rate that does not damage genuine customer experience. Phase 2 audits the transaction stream, the customer master, and the disputes ledger and discovers that the disputes ledger does not always carry the original transaction identifier. Phase 3 rebuilds a clean transaction-and-disputes table and engineers velocity, geolocation, and merchant-category features. Phase 4 compares an ensemble model with the existing rules engine. Phase 5 demonstrates a meaningful uplift in precision at fixed recall, but discovers that the model decisions need to be explainable to satisfy regulatory requirements. Phase 6 deploys the model in shadow mode for two months alongside the rules engine, and only then begins to take live action. Monitoring is continuous, and the model is retrained on a regular cadence as fraud patterns evolve.


Summary

Concept Description
Foundations
Analytics Lifecycle Sequence of phases that turns a vague business question into a deployed analytical solution
CRISP-DM Cross-Industry Standard Process for Data Mining; the most widely adopted analytics methodology in industry
Phase 1: Business Understanding
Business Understanding Phase 1: clarifies the business problem, success criteria, and analytical objective
Determine Business Objectives Identify the stakeholder, the decision being supported, and the business outcome
Assess the Situation Review resources, constraints, assumptions, costs, benefits, and risks
Determine Data-Mining Goals Translate the business objective into a precise analytical objective
Produce a Project Plan Sequence the remaining phases, allocate resources, and identify likely techniques
Phase 2: Data Understanding
Data Understanding Phase 2: an honest audit of the raw material the project will rely on
Collect Initial Data Identify, request, and acquire the datasets and document their sources
Describe the Data Catalogue the format, volume, structure, and meaning of each variable
Explore the Data Use descriptive statistics and visualisations to surface patterns and surprises
Verify Data Quality Check completeness, consistency, accuracy, and timeliness; flag issues early
Phase 3: Data Preparation
Data Preparation Phase 3: shapes the data into the analytical dataset on which models are trained
Select Data Choose which records and which variables will enter the model
Clean Data Handle missing values, correct errors, and remove duplicates
Construct Data Engineer derived variables that capture patterns relevant to the problem
Integrate Data Join data from multiple sources into a single analytical table
Format Data Convert variables to the form required by the selected modelling tool
Phase 4: Modeling
Modeling Phase 4: applies analytical techniques to produce candidate solutions
Select Modelling Technique Choose techniques appropriate to the problem type and the data available
Generate Test Design Decide how performance will be measured and how data will be split
Build Model Fit the technique to the training data and tune hyperparameters
Assess Model Evaluate the model on the validation set and compare candidates
Phase 5: Evaluation
Evaluation Phase 5: tests whether the technically successful model is also a business success
Evaluate Results Test the model against the business success criteria agreed in Phase 1
Review Process Conduct a structured retrospective on the project to date
Determine Next Steps Decide whether to deploy, iterate further, or initiate a new project
Phase 6: Deployment
Deployment Phase 6: puts the result into the hands of users or operational systems
Plan Deployment Decide how the result will be delivered and what infrastructure is required
Plan Monitoring and Maintenance Define how performance will be tracked, drift detected, and the model retrained
Produce Final Report Document the project end-to-end so successors can audit and build on it
Review Project Capture lessons learned and improvements for the next project
Iteration and Other Methodologies
Iteration CRISP-DM is iterative; loops between phases are normal and expected
SEMMA SAS Institute methodology: Sample, Explore, Modify, Model, Assess
KDD Process Knowledge Discovery in Databases process: Selection, Pre-processing, Transformation, Mining, Interpretation
TDSP Microsoft Team Data Science Process; team-oriented and Azure-aligned
ASUM-DM IBM Analytics Solutions Unified Method; agile-flavoured extension of CRISP-DM
Common Pitfalls
Skipping Phase 1 Pitfall of diving into the data without a clear business question
Modeling Heroics Pitfall of investing in marginal accuracy gains when the bottleneck is elsewhere
Optimising the Wrong Metric Pitfall of tuning to a technical metric that does not map to the business outcome
Throw-It-Over-the-Wall Deployment Pitfall of treating deployment as someone else's problem after Phase 5
Forgotten Models in Production Pitfall of deploying a model and never monitoring or retraining it
No Retrospective Pitfall of closing a project without capturing what was learned
Methodology Theatre Pitfall of applying CRISP-DM as templates rather than disciplined questioning