flowchart LR
A["1. Business<br>Understanding"] --> B["2. Data<br>Understanding"]
B --> C["3. Data<br>Preparation"]
C --> D["4. Modeling"]
D --> E["5. Evaluation"]
E --> F["6. Deployment"]
B -.-> A
D -.-> C
E -.-> A
F -.-> A
style A fill:#fce4ec,stroke:#AD1457
style B fill:#fff3e0,stroke:#EF6C00
style C fill:#fff8e1,stroke:#F9A825
style D fill:#e3f2fd,stroke:#1976D2
style E fill:#ede7f6,stroke:#4527A0
style F fill:#e8f5e9,stroke:#388E3C
4 Analytics Project Lifecycle: CRISP-DM Methodology
4.1 The Need for a Structured Lifecycle
Analytics projects fail more often from poor process than from poor algorithms.
A surprising number of analytics projects deliver no business value, not because the data was bad or the model was wrong, but because the project lacked a disciplined process. Goals were not clearly framed, data quality was discovered too late, the model solved a problem nobody owned, or the result was never deployed.
A structured analytics lifecycle is a sequence of phases that turns a vague business question into a deployed analytical solution. It gives the team a shared vocabulary, makes progress visible, and forces the right questions to be asked at the right time. The most widely adopted of these lifecycles is CRISP-DM.
4.2 CRISP-DM Overview
CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It was developed in the late 1990s by a consortium of practitioners from DaimlerChrysler, SPSS, NCR, and OHRA, and was published as a step-by-step user guide and reference model. The original consortium paper by Rüdiger Wirth & Jochen Hipp (2000) framed CRISP-DM as a tool-neutral, industry-neutral, and application-neutral process for data mining and analytics. The widely cited blueprint by Colin Shearer (2000) in the Journal of Data Warehousing set out the model in the form most practitioners learn today.
CRISP-DM remains, more than two decades later, the most widely used analytics project methodology in industry. It is the implicit backbone of most modern data-science workflows even when it is not invoked by name.
4.2.1 The Six Phases at a Glance
| Phase | Question Answered | Key Output |
|---|---|---|
| 1. Business Understanding | What is the business problem and what would success look like? | Clear objectives, success criteria, project plan |
| 2. Data Understanding | What data do we have, and is it any good? | Data audit, initial findings, quality assessment |
| 3. Data Preparation | How do we shape the data so it can be modelled? | Clean, integrated, feature-engineered analytical dataset |
| 4. Modeling | Which technique fits the problem and the data? | Trained candidate models with parameters and assumptions |
| 5. Evaluation | Does the model meet the business success criteria? | Validated model, list of decisions confirmed or revised |
| 6. Deployment | How do we put the result into the hands of users? | Production model, dashboards, monitoring, documentation |
4.3 The Six Phases in Detail
4.3.1 Phase 1 — Business Understanding
Business understanding turns a vague request into a precisely framed analytical question. It is the phase most often shortchanged, and the phase whose neglect is most often fatal.
The phase has four tasks:
- Determine business objectives: Identify the stakeholder, the decision the analytics will support, and the business outcome by which success will be judged.
- Assess the situation: Review the resources, constraints, assumptions, costs, benefits, risks, and contingencies that surround the project.
- Determine data-mining goals: Translate the business objective into a precise analytical objective. “Reduce churn” becomes “Predict, for every active customer, the probability of attrition within ninety days.”
- Produce a project plan: Sequence the remaining phases, allocate resources, and identify the techniques and tools likely to be used.
The phase ends when the team can answer three questions:
- What decision will be made differently because of this project?
- What does success look like, expressed as a measurable target?
- Who owns the action that follows the analysis?
4.3.2 Phase 2 — Data Understanding
Data understanding is an honest audit of the raw material. It is the phase in which optimistic assumptions about data availability and quality meet reality.
The phase has four tasks:
- Collect initial data: Identify, request, and acquire the datasets the project will draw on, and document their sources.
- Describe the data: Catalogue the format, volume, structure, and meaning of each variable.
- Explore the data: Use descriptive statistics, frequency tables, and visualisations to surface initial patterns, distributions, and surprises.
- Verify data quality: Check for completeness, consistency, accuracy, and timeliness. Flag missing values, outliers, duplicates, and definitional disagreements.
The most important output of this phase is sometimes a list of reasons to redefine the project, because the data needed to answer the original question turns out not to exist or not to be trustworthy.
4.3.3 Phase 3 — Data Preparation
Data preparation produces the analytical dataset on which models will be trained. It is the phase in which most of a project’s time is actually spent — typically sixty to seventy per cent — and the phase in which the eventual quality of the model is largely determined.
The phase has five tasks:
- Select data: Choose which records and which variables will enter the model and document the rationale.
- Clean data: Handle missing values, correct errors, resolve inconsistencies, and remove duplicates.
- Construct data: Engineer derived variables — ratios, lagged values, interaction terms, aggregations — that capture the patterns relevant to the problem.
- Integrate data: Join data from multiple sources into a single analytical table.
- Format data: Convert variables to the form required by the selected modelling tool — encoding, scaling, type conversion, partitioning.
A clean, well-engineered dataset is often a more valuable asset than the model that is eventually built on it, because the same dataset will be reused across many subsequent models.
4.3.4 Phase 4 — Modeling
Modeling applies analytical techniques to the prepared dataset to produce candidate solutions to the analytical question.
The phase has four tasks:
- Select modelling technique: Choose techniques appropriate to the problem type — regression, classification, clustering, time series, optimisation — and to the data available.
- Generate test design: Decide how the model’s performance will be measured and how the data will be split into training, validation, and test sets.
- Build model: Fit the chosen technique to the training data, tuning hyperparameters as required.
- Assess model: Evaluate the model on the validation set and compare candidate models on the agreed performance metrics.
Several techniques are usually tried in parallel. The result of this phase is a short-list of candidate models, with their parameters, assumptions, and validation performance documented.
4.3.5 Phase 5 — Evaluation
Evaluation tests whether the technically successful model is also a business success. It is the phase that separates a good experiment from a deliverable solution.
The phase has three tasks:
- Evaluate results: Test the model against the business success criteria agreed in Phase 1. A model that hits a 0.85 AUC may still fail if the cost of false positives in production is unacceptable.
- Review process: Conduct a structured retrospective on the project to date. Has anything important been overlooked? Has the data been used appropriately? Are the model’s assumptions defensible?
- Determine next steps: Decide whether the model is ready for deployment, whether further iterations are needed, or whether a new project should be initiated.
The honest possible outcome of evaluation is do not deploy. CRISP-DM treats this not as a failure of the lifecycle but as one of its successes.
4.3.6 Phase 6 — Deployment
Deployment puts the result into the hands of the people or systems whose decisions it is meant to support. A model that lives only in an analyst’s notebook delivers no business value.
The phase has four tasks:
- Plan deployment: Decide how the result will be delivered — a dashboard, a scoring API, a report, an automated decision engine — and what infrastructure that requires.
- Plan monitoring and maintenance: Define how model performance will be tracked, how data drift will be detected, and how often the model will be retrained.
- Produce final report: Document the project from end to end so that successors can understand, audit, and build on what was done.
- Review project: Capture lessons learned, including what should be done differently next time.
Deployment is not the end of the project; it is the beginning of the model’s working life.
4.4 Iteration and the Cyclical Nature of CRISP-DM
CRISP-DM is presented as a sequence of phases, but the original consortium guide is explicit that the process is iterative rather than linear. In practice, almost every project loops back at least once.
The most common loops are:
- Data Understanding back to Business Understanding: The data turns out not to support the original question; the question is reshaped to fit the data that exists.
- Modeling back to Data Preparation: The first round of modelling reveals features that need to be engineered or recoded.
- Evaluation back to Business Understanding: The model meets the technical specification but not the business intent; the business question is sharpened.
- Deployment back to Business Understanding: A model in production reveals new questions, new opportunities, or new problems that begin the next cycle.
The dotted arrows in the diagram above represent these feedback paths. A project that never loops is usually a project that has not looked closely enough.
4.5 Other Analytics Lifecycle Methodologies
| Methodology | Origin | Phases | Distinctive Emphasis |
|---|---|---|---|
| CRISP-DM | DaimlerChrysler / SPSS / NCR / OHRA consortium, 1996–2000 | Six phases: Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment | Tool-neutral, industry-neutral, business-driven framing |
| SEMMA | SAS Institute | Five phases: Sample → Explore → Modify → Model → Assess | Closely tied to SAS Enterprise Miner; data-mining focus |
| KDD Process | Fayyad, Piatetsky-Shapiro, and Smyth, 1996 | Five phases: Selection → Pre-processing → Transformation → Data Mining → Interpretation/Evaluation | Academic origin; emphasis on knowledge discovery |
| TDSP | Microsoft Team Data Science Process | Lifecycle of business understanding, data acquisition, modelling, deployment, customer acceptance | Designed for collaborative data-science teams using Azure tooling |
| ASUM-DM | IBM Analytics Solutions Unified Method for Data Mining | Iterative, agile-flavoured extension of CRISP-DM | Adds project management, infrastructure, and operations layers |
CRISP-DM remains the broadest and most adopted of these. SEMMA is closer to a tool workflow than a project methodology. KDD is the academic ancestor. TDSP and ASUM-DM are vendor-aligned modernisations that retain CRISP-DM’s six-phase backbone.
4.6 Best Practices for Running CRISP-DM Projects
Spend disproportionate time on Business Understanding: A weak Phase 1 is the single largest cause of failed analytics projects. Pay for it up front.
Treat Data Understanding as honest auditing, not optimistic confirmation: Look for what is broken, missing, or inconsistent before you fall in love with the data.
Budget realistically for Data Preparation: Plan for sixty to seventy per cent of project effort here. Communicate this to sponsors at the start.
Run Modeling as a comparison, not a coronation: Try several techniques. Document why one was preferred.
Include the deployment owner from Phase 1: A model designed without the eventual operator will fail in deployment.
Document as you go: A project that is rigorously documented can be audited, reused, and revisited. One that is not is a future archaeology problem.
Build feedback loops: Plan from the start how the deployed model will be monitored, retrained, and eventually retired.
Treat ethical and privacy review as part of every phase: Bias entering in Phase 2 is harder to remove in Phase 5.
4.7 Common Pitfalls
Skipping Phase 1: Diving into the data without a clear business question produces interesting findings nobody acts on.
Treating Data Preparation as a one-off: A data-prep step that is not codified will be redone, inconsistently, by the next analyst.
Modeling Heroics: Investing weeks in a marginal improvement to model accuracy when the bottleneck is in deployment or adoption.
Optimising the Wrong Metric: Tuning a model to a technical metric (accuracy, AUC, RMSE) that does not map onto the business outcome the project was set up to improve.
Throw-It-Over-the-Wall Deployment: Treating deployment as someone else’s problem after Phase 5 ends. A model not designed for deployment will not be deployed.
Forgotten Models in Production: Deploying a model and then never monitoring or retraining it. Performance silently decays as the world changes.
No Retrospective: Closing a project without writing down what was learned, so that the next project repeats the same mistakes.
Methodology Theatre: Applying CRISP-DM as a sequence of templates to be filled in, rather than as a discipline of asking the right questions at the right time.
4.8 Illustrative Cases
The following short cases illustrate how the CRISP-DM phases play out in practice. They are based on the kinds of projects commonly seen in industry; the framing is the author’s.
Customer Churn for a Telecommunications Operator
A telecommunications operator wishes to reduce post-paid customer attrition. Phase 1 translates the goal into a measurable target — reducing the ninety-day attrition rate by two percentage points among customers in their twelfth-to-eighteenth month of tenure. Phase 2 audits billing, usage, and complaint data and discovers a recurring data-quality issue with hand-set inventory records. Phase 3 integrates the cleaned datasets and engineers usage-trend, complaint-recency, and competitive-offer features. Phase 4 compares logistic regression with gradient-boosted trees. Phase 5 confirms that the boosted-tree model meets the business target on a holdout, and that retention-offer cost remains within budget. Phase 6 integrates the model with the customer-care call-handling system so that attrition risk and the recommended offer appear on the agent’s screen for inbound calls. The cycle then loops back to Phase 1 to scope a separate project on inbound-channel optimisation.
Predictive Maintenance in a Manufacturing Plant
A manufacturer wants to reduce unplanned downtime on a critical line. Phase 1 sets the target — a thirty per cent reduction in unplanned stops on a named line over six months. Phase 2 ingests sensor and maintenance-log data and finds that several sensors have been re-instrumented mid-period without the change being recorded. Phase 3 corrects sensor lineage, engineers vibration- and temperature-trend features, and aligns the data on a unified time index. Phase 4 trains a survival-style model on time to next failure. Phase 5 establishes that the model brings useful early warning for two of three failure modes and that the third needs additional sensors. Phase 6 deploys the early-warning system to the plant’s maintenance dispatch board and starts monitoring drift. The third failure mode becomes the seed of a follow-on project.
Fraud Detection in a Retail Bank
A retail bank wants to reduce fraudulent online card-not-present transactions. Phase 1 sets the target as a measurable reduction in fraud loss subject to a maximum tolerable false-positive rate that does not damage genuine customer experience. Phase 2 audits the transaction stream, the customer master, and the disputes ledger and discovers that the disputes ledger does not always carry the original transaction identifier. Phase 3 rebuilds a clean transaction-and-disputes table and engineers velocity, geolocation, and merchant-category features. Phase 4 compares an ensemble model with the existing rules engine. Phase 5 demonstrates a meaningful uplift in precision at fixed recall, but discovers that the model decisions need to be explainable to satisfy regulatory requirements. Phase 6 deploys the model in shadow mode for two months alongside the rules engine, and only then begins to take live action. Monitoring is continuous, and the model is retrained on a regular cadence as fraud patterns evolve.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Analytics Lifecycle | Sequence of phases that turns a vague business question into a deployed analytical solution |
| CRISP-DM | Cross-Industry Standard Process for Data Mining; the most widely adopted analytics methodology in industry |
| Phase 1: Business Understanding | |
| Business Understanding | Phase 1: clarifies the business problem, success criteria, and analytical objective |
| Determine Business Objectives | Identify the stakeholder, the decision being supported, and the business outcome |
| Assess the Situation | Review resources, constraints, assumptions, costs, benefits, and risks |
| Determine Data-Mining Goals | Translate the business objective into a precise analytical objective |
| Produce a Project Plan | Sequence the remaining phases, allocate resources, and identify likely techniques |
| Phase 2: Data Understanding | |
| Data Understanding | Phase 2: an honest audit of the raw material the project will rely on |
| Collect Initial Data | Identify, request, and acquire the datasets and document their sources |
| Describe the Data | Catalogue the format, volume, structure, and meaning of each variable |
| Explore the Data | Use descriptive statistics and visualisations to surface patterns and surprises |
| Verify Data Quality | Check completeness, consistency, accuracy, and timeliness; flag issues early |
| Phase 3: Data Preparation | |
| Data Preparation | Phase 3: shapes the data into the analytical dataset on which models are trained |
| Select Data | Choose which records and which variables will enter the model |
| Clean Data | Handle missing values, correct errors, and remove duplicates |
| Construct Data | Engineer derived variables that capture patterns relevant to the problem |
| Integrate Data | Join data from multiple sources into a single analytical table |
| Format Data | Convert variables to the form required by the selected modelling tool |
| Phase 4: Modeling | |
| Modeling | Phase 4: applies analytical techniques to produce candidate solutions |
| Select Modelling Technique | Choose techniques appropriate to the problem type and the data available |
| Generate Test Design | Decide how performance will be measured and how data will be split |
| Build Model | Fit the technique to the training data and tune hyperparameters |
| Assess Model | Evaluate the model on the validation set and compare candidates |
| Phase 5: Evaluation | |
| Evaluation | Phase 5: tests whether the technically successful model is also a business success |
| Evaluate Results | Test the model against the business success criteria agreed in Phase 1 |
| Review Process | Conduct a structured retrospective on the project to date |
| Determine Next Steps | Decide whether to deploy, iterate further, or initiate a new project |
| Phase 6: Deployment | |
| Deployment | Phase 6: puts the result into the hands of users or operational systems |
| Plan Deployment | Decide how the result will be delivered and what infrastructure is required |
| Plan Monitoring and Maintenance | Define how performance will be tracked, drift detected, and the model retrained |
| Produce Final Report | Document the project end-to-end so successors can audit and build on it |
| Review Project | Capture lessons learned and improvements for the next project |
| Iteration and Other Methodologies | |
| Iteration | CRISP-DM is iterative; loops between phases are normal and expected |
| SEMMA | SAS Institute methodology: Sample, Explore, Modify, Model, Assess |
| KDD Process | Knowledge Discovery in Databases process: Selection, Pre-processing, Transformation, Mining, Interpretation |
| TDSP | Microsoft Team Data Science Process; team-oriented and Azure-aligned |
| ASUM-DM | IBM Analytics Solutions Unified Method; agile-flavoured extension of CRISP-DM |
| Common Pitfalls | |
| Skipping Phase 1 | Pitfall of diving into the data without a clear business question |
| Modeling Heroics | Pitfall of investing in marginal accuracy gains when the bottleneck is elsewhere |
| Optimising the Wrong Metric | Pitfall of tuning to a technical metric that does not map to the business outcome |
| Throw-It-Over-the-Wall Deployment | Pitfall of treating deployment as someone else's problem after Phase 5 |
| Forgotten Models in Production | Pitfall of deploying a model and never monitoring or retraining it |
| No Retrospective | Pitfall of closing a project without capturing what was learned |
| Methodology Theatre | Pitfall of applying CRISP-DM as templates rather than disciplined questioning |