flowchart TD
A["Data"] --> B["Structured<br>Tabular, schema-defined"]
A --> C["Semi-Structured<br>Self-describing, flexible schema"]
A --> D["Unstructured<br>No predefined schema"]
B --> B1["RDBMS, spreadsheets,<br>data warehouses"]
C --> C1["JSON, XML, NoSQL,<br>log files, sensor feeds"]
D --> D1["Text, images, audio,<br>video, social media"]
style A fill:#eceff1,stroke:#455A64
style B fill:#e3f2fd,stroke:#1976D2
style C fill:#fff8e1,stroke:#F9A825
style D fill:#e8f5e9,stroke:#388E3C
5 Data Types: Structured, Unstructured, Internal, and External
5.1 Why Data Types Matter
Before any analytical method can be chosen, the analyst must understand what kind of data is on the table.
The choice of analytical technique, the design of the data platform, the cost of storage, the speed of insight, and the legal obligations on the firm are all conditioned by the type of data being handled. A model designed for tabular sales data is useless against a stream of free-text customer reviews; a database that handles point-of-sale transactions cannot store a million product images efficiently.
Data is therefore classified along several intersecting dimensions:
- By structure — structured, semi-structured, or unstructured.
- By source — internal or external to the organisation.
- By measurement nature — quantitative or qualitative.
- By origin — primary or secondary.
- By temporal pattern — cross-sectional, time-series, or streaming.
A given dataset can sit on every dimension at once: an external, semi-structured, qualitative, secondary, streaming social-media feed is a perfectly coherent description.
5.2 Classifying Data by Structure
5.2.1 Structured Data
Structured data is organised in a predefined schema — typically rows and columns — where every record has the same fields and every field has a defined data type. It fits naturally into relational databases and spreadsheets and is queried with SQL.
- Examples: Sales transactions, customer master records, general-ledger entries, employee tables, product catalogues, inventory positions.
- Storage: Relational databases (Oracle, SQL Server, MySQL, PostgreSQL), spreadsheets, enterprise data warehouses (Snowflake, BigQuery, Redshift, Synapse).
- Strengths: Easy to store, query, and aggregate. Mature tools, well-understood governance, fast at scale.
- Limitations: Inflexible to schema change. Cannot natively represent text, images, or signals.
- Estimated share: Historically the dominant form of business data, but Amir Gandomi & Murtaza Haider (2015) note that structured data accounts for only about ten to twenty per cent of the data generated today; the majority of new data is unstructured.
5.2.2 Semi-Structured Data
Semi-structured data carries its own schema inside each record — through tags, key–value pairs, or hierarchical markers — but does not require all records to share the same fields. It is flexible enough to represent variability in records, yet structured enough to be parsed programmatically.
- Examples: JSON documents, XML files, NoSQL document stores, web-server logs, sensor and IoT telemetry, email message headers.
- Storage: Document databases (MongoDB, Couchbase), key-value stores (Redis, DynamoDB), data lakes (Amazon S3, Azure Data Lake), search platforms (Elasticsearch).
- Strengths: Flexible schema, natural fit for web and mobile data, scales horizontally.
- Limitations: Joins and aggregations are harder than in pure relational stores; governance and quality controls are less mature.
5.2.3 Unstructured Data
Unstructured data has no predefined data model. The information is meaningful to a human reader but does not fit into rows and columns without significant pre-processing.
- Examples: Free-text customer reviews, emails, social-media posts, call-centre transcripts, contracts, scanned documents, photographs, surveillance footage, audio recordings.
- Storage: Object storage (S3, Azure Blob, Google Cloud Storage), data lakes, content-management systems, search and retrieval platforms.
- Strengths: Captures rich human and sensory information that structured data cannot.
- Limitations: Requires natural-language processing, computer vision, or speech-recognition pipelines to extract analysable features. Storage and processing costs are higher.
- Estimated share: Amir Gandomi & Murtaza Haider (2015) estimate that roughly eighty per cent of enterprise-relevant data is unstructured, and the share is rising.
5.2.4 Comparison: Structured vs Semi-Structured vs Unstructured
| Aspect | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Schema | Fixed, predefined | Flexible, self-describing | None |
| Examples | Sales records, ledgers | JSON, XML, log files, IoT telemetry | Text, images, audio, video |
| Storage | RDBMS, data warehouse | NoSQL, data lake | Object storage, data lake |
| Query language | SQL | SQL extensions, document queries | Search, NLP, computer vision |
| Tooling maturity | Very mature | Mature | Rapidly evolving |
| Typical share of enterprise data | 10–20 % | 5–10 % | 70–80 % |
| Governance ease | High | Medium | Lower |
| Analytical readiness | Immediate | Moderate pre-processing | Substantial pre-processing |
5.3 Classifying Data by Source
A second classification asks where the data comes from. The distinction matters because internal and external sources have different cost, control, and quality profiles, and the most powerful analytical insights usually come from combining the two.
5.3.1 Internal Data
Internal data is generated and held by the organisation itself, in its operational and administrative systems.
- Examples: Sales transactions, CRM customer records, ERP financials, HRIS payroll data, point-of-sale logs, manufacturing-line sensors, website-clickstream logs, internal email and document repositories.
- Strengths: Directly owned and controlled, strongly relevant to the organisation’s own decisions, available at granular detail, governed by internal policies.
- Limitations: Reflects only the firm’s own activity, may have systematic blind spots, can be siloed across business units.
5.3.2 External Data
External data is generated outside the organisation and acquired or accessed through some channel.
- Examples: Government statistics (population, GDP, inflation, agricultural output), market data (stock prices, commodity prices), industry reports, weather feeds, satellite imagery, social-media feeds, third-party demographic and credit data, web-scraped competitor information, paid data services (Bloomberg, Refinitiv, Nielsen, Kantar).
- Strengths: Brings context the organisation could not generate itself, enables benchmarking and external sensing, supports macro and competitive analysis.
- Limitations: Quality and timeliness vary; licensing and privacy obligations apply; integration with internal data requires careful entity resolution and key matching.
5.3.3 Comparison: Internal vs External
| Aspect | Internal | External |
|---|---|---|
| Origin | Generated within the firm | Acquired from outside |
| Cost | Largely sunk in operational systems | Often involves licensing or acquisition fees |
| Control | Full ownership and governance | Limited; subject to provider terms |
| Relevance | Directly tied to the firm’s operations | Provides context, benchmarks, market signals |
| Quality | Reflects internal data discipline | Variable across providers |
| Privacy | Subject to internal policy and applicable law | Subject to source’s terms and applicable law |
| Best used for | Operational decisions, customer-level analysis | Market sizing, benchmarking, environmental sensing |
The most valuable analytical work usually combines both. A bank’s churn model improves when internal transaction data is enriched with external macroeconomic indicators; a retailer’s demand forecast improves when internal sales data is enriched with weather and event data.
5.4 Other Useful Classifications of Data
5.4.1 Quantitative versus Qualitative
- Quantitative data is numerical and supports arithmetic operations. It includes counts, measurements, ratios, and percentages. Subdivides into discrete (counts of items) and continuous (measurements on a scale).
- Qualitative data is categorical or descriptive and does not support arithmetic. It includes labels, ranks, and free text. Subdivides into nominal (categories with no order — gender, region) and ordinal (categories with a natural order — satisfaction rating, education level).
The distinction matters because it determines which statistical techniques are valid: means and standard deviations apply to quantitative data, frequencies and modes to qualitative data, and ordinal data sits in between.
5.4.2 Primary versus Secondary
- Primary data is collected first-hand by the analyst or the organisation, for the specific purpose at hand. Surveys, customer interviews, A/B-test logs, and proprietary sensor readings are primary data.
- Secondary data is collected by someone else, for some other purpose, and reused. Government statistics, syndicated market reports, academic datasets, and web-scraped competitor information are secondary data.
Primary data is more expensive but precisely fit-for-purpose; secondary data is cheaper but always requires the analyst to ask whether the original purpose, definitions, and methods match the current problem.
5.4.3 Cross-Sectional, Time-Series, and Streaming
- Cross-sectional data is a snapshot at a single point in time across many units — for example, the customer base on the last day of a quarter.
- Time-series data is repeated observations of the same units over time — for example, monthly sales of each product over five years.
- Panel (longitudinal) data combines the two — many units observed over many time periods.
- Streaming data is generated continuously and must be processed in motion — point-of-sale events, sensor telemetry, web clicks, and financial-market ticks.
The temporal pattern determines whether the question is about levels, trends, seasonality, or real-time response, and whether batch or stream-processing infrastructure is appropriate.
5.5 Big Data and the Vs
The phrase big data describes data whose volume, velocity, or variety exceeds the capacity of conventional data-processing tools. The original three-V framing was introduced by Doug Laney (2001) to characterise the new data-management challenge facing firms; later authors extended it to four and five Vs to capture concerns about quality and business value (Amir Gandomi & Murtaza Haider, 2015).
flowchart TD
BD["Big Data"] --> V1["Volume<br>Scale of data"]
BD --> V2["Velocity<br>Speed of generation<br>and processing"]
BD --> V3["Variety<br>Structured, semi-,<br>and unstructured"]
BD --> V4["Veracity<br>Trustworthiness<br>and accuracy"]
BD --> V5["Value<br>Business outcome<br>and ROI"]
style BD fill:#e3f2fd,stroke:#1976D2
style V1 fill:#e8f5e9,stroke:#388E3C
style V2 fill:#fff8e1,stroke:#F9A825
style V3 fill:#fff3e0,stroke:#EF6C00
style V4 fill:#fce4ec,stroke:#AD1457
style V5 fill:#ede7f6,stroke:#4527A0
| V | Meaning | Implication for Analytics |
|---|---|---|
| Volume | Sheer scale of data, from terabytes into petabytes and beyond | Distributed storage and computation; cost-aware data lifecycle |
| Velocity | Speed at which data is generated, captured, and processed | Streaming platforms; real-time vs batch architecture decisions |
| Variety | Mix of structured, semi-structured, and unstructured forms | Polyglot persistence; pipelines that handle multiple modalities |
| Veracity | Trustworthiness, accuracy, and consistency of data | Data-quality pipelines; provenance, lineage, and validation |
| Value | Business outcome that the data delivers when analysed | Prioritise data investments by expected decision value |
The fifth V — Value — is the one that distinguishes a strategic data programme from a costly hoarding exercise. Volume, velocity, variety, and veracity describe properties of the data; value is the property of the analytical use that justifies the investment.
5.6 Implications for Analytics
The classification of data has direct consequences for the rest of the analytics work:
Choice of technique: Tabular structured data is the natural home of regression, classification, and time-series modelling. Unstructured text is the home of natural-language processing. Images and video are the home of computer vision. Mixing types — for example, augmenting a churn model with sentiment scores from call-centre transcripts — often produces better results than either source alone.
Choice of platform: Pure structured data lives well in a data warehouse. Mixed structured and unstructured data lives in a data lake or modern lakehouse. Streaming data needs purpose-built platforms such as Kafka, Flink, or Spark Streaming.
Cost profile: Storage and compute costs are roughly proportional to volume and velocity, but the cost of the human work of cleaning, labelling, and modelling rises faster with variety than any other dimension.
Governance and ethics: External data brings licence terms; personal data brings privacy law; unstructured data brings copyright and consent considerations. Each type carries its own legal and ethical envelope.
5.7 Common Pitfalls
Confusing volume with value: Hoarding data because it is cheap to store, without ever defining the decisions it is meant to support.
Forcing unstructured data into a relational schema: Discarding most of the signal in the process. The right answer is usually a different storage layer, not a contorted table.
Treating internal data as sufficient: Many strategic questions cannot be answered without external context. Internal data alone produces inward-looking analytics.
Treating external data as gospel: Third-party data has its own quality issues. Always understand how it was collected before relying on it.
Ignoring the temporal dimension: Modelling time-series data with cross-sectional methods, or vice versa, introduces predictable but easily overlooked errors.
Underestimating unstructured pre-processing: Text, image, and audio pipelines require substantial engineering before a model can be trained on them. Plan for it.
Neglecting veracity: Velocity and volume are emphasised in vendor marketing, but veracity determines whether the analytics are trustworthy enough to act on.
Privacy by accident: Internal log files often contain personal information that the firm is legally obliged to protect. Treat sensitive data as sensitive from the moment it lands.
5.8 Illustrative Cases
The following short cases illustrate how data type choices play out in practice. They are based on the kinds of work commonly seen in industry; the framing is the author’s.
Retail Demand Forecasting — Internal Plus External
A retailer wants to forecast daily demand at the store-and-product level. The team starts with internal point-of-sale and inventory data — structured, well-governed, but inward-looking. Forecast accuracy improves when the team adds external weather data (a weather event materially shifts ice-cream and umbrella demand) and external local-events data (a stadium event near a store shifts beverage and snack demand). The dataset becomes a join of internal structured data and external semi-structured feeds, and the resulting forecast outperforms either source alone.
Customer Experience — Structured Scores Plus Unstructured Verbatims
A service business runs a Net Promoter Score programme. The numerical NPS scores are structured, easy to dashboard, and correlate weakly with churn. The free-text verbatim comments accompanying the scores are unstructured and, until recently, were read by analysts in small samples. The team builds a natural-language pipeline that extracts themes and sentiment from the verbatims and joins them to the structured score and to the customer master. The blended dataset produces a churn model meaningfully better than the score-only model, and an actionable map of the specific service failures driving negative sentiment.
Industrial IoT — Streaming Semi-Structured Data
A manufacturer instruments a critical production line with hundreds of sensors. The data is semi-structured (each sensor emits a tagged time-stamped reading), high-velocity (thousands of readings per second), and variable in quality (sensors fail and drift). Storing every reading in a relational database is impractical; the team builds a streaming pipeline that lands raw data in object storage, computes aggregates and anomaly scores in motion, and surfaces alerts to the maintenance dispatch board. The architectural choice is driven directly by the data type: streaming, semi-structured, high-volume, with veracity concerns front and centre.
Credit Scoring — Combining Primary and Secondary, Internal and External
An Indian retail bank builds a credit-scoring model for personal loans. The bank’s own application data and repayment history are internal, primary, and structured. Credit-bureau scores from CIBIL or Experian are external and structured. Account-aggregator-derived bank-statement summaries are external, semi-structured, and require parsing. Demographic and macroeconomic indicators from official sources are external, secondary, and structured. The blended dataset uses every classification this chapter has introduced, and the model’s performance reflects the disciplined integration of all of them.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why Data Types Matter | The choice of analytical technique, platform, cost, and legal envelope all flow from the type of data on the table |
| Structure-Based Classification | |
| Structured Data | Tabular, schema-defined data that fits naturally into relational databases and is queried with SQL |
| Semi-Structured Data | Self-describing data with flexible schema such as JSON, XML, log files, and IoT telemetry |
| Unstructured Data | Data with no predefined model such as text, images, audio, and video; estimated at around eighty per cent of enterprise data |
| Schema | The set of fields and types that defines the structure of a dataset; fixed for structured, flexible for semi-structured, absent for unstructured |
| Object Storage | Storage layer such as Amazon S3 or Azure Blob designed for unstructured and semi-structured data at scale |
| Source-Based Classification | |
| Internal Data | Data generated within the organisation in its operational and administrative systems |
| External Data | Data acquired from outside the organisation, including government, market, and third-party sources |
| Internal-External Combination | The most powerful analyses usually combine internal granularity with external context |
| Measurement Nature | |
| Quantitative Data | Numerical data that supports arithmetic, including counts, measurements, ratios, and percentages |
| Qualitative Data | Categorical or descriptive data such as labels, ranks, and free text |
| Discrete and Continuous | Quantitative subdivisions: discrete counts versus continuous measurements on a scale |
| Nominal and Ordinal | Qualitative subdivisions: nominal categories with no order versus ordinal categories with a natural order |
| Origin | |
| Primary Data | Data collected first-hand for the specific purpose at hand |
| Secondary Data | Data collected by someone else for some other purpose and then reused |
| Temporal Pattern | |
| Cross-Sectional Data | Snapshot at a single point in time across many units |
| Time-Series Data | Repeated observations of the same units over time |
| Panel Data | Many units observed over many time periods, combining cross-sectional and time-series |
| Streaming Data | Data generated continuously and processed in motion |
| Big Data and the Five Vs | |
| Big Data | Data whose volume, velocity, or variety exceeds conventional processing capacity |
| Volume | Sheer scale of data, requiring distributed storage and computation |
| Velocity | Speed at which data is generated, captured, and processed |
| Variety | Mix of structured, semi-structured, and unstructured forms within the same problem |
| Veracity | Trustworthiness, accuracy, and consistency of data |
| Value | Business outcome that the data delivers when analysed; the property that justifies investment |
| Implications for Analytics | |
| Choice of Technique | Tabular methods for structured data, NLP for text, computer vision for images, mixed pipelines for blends |
| Choice of Platform | Warehouse for structured, lake or lakehouse for mixed, purpose-built platforms for streaming |
| Cost Profile | Storage and compute scale with volume and velocity; human effort scales fastest with variety |
| Governance and Ethics | External data brings licence terms; personal data brings privacy law; unstructured data brings copyright |
| Common Pitfalls | |
| Confusing Volume with Value | Pitfall of hoarding data because storage is cheap, without defining the decisions it supports |
| Forcing Unstructured into Tables | Pitfall of contorting unstructured signal into relational tables and discarding most of it |
| Internal-Only Blind Spot | Pitfall of relying only on internal data and producing inward-looking analytics |
| External Data as Gospel | Pitfall of trusting third-party data without auditing how it was collected |
| Ignoring the Temporal Dimension | Pitfall of modelling time-series data with cross-sectional methods or vice versa |
| Underestimating Pre-Processing | Pitfall of underestimating the engineering required to make text, image, and audio analysable |
| Neglecting Veracity | Pitfall of emphasising volume and velocity in vendor marketing while ignoring trustworthiness |
| Privacy by Accident | Pitfall of internal log files quietly capturing personal information without the protections that requires |