5 Data Types: Structured, Unstructured, Internal, and External

5.1 Why Data Types Matter

Before any analytical method can be chosen, the analyst must understand what kind of data is on the table.

The choice of analytical technique, the design of the data platform, the cost of storage, the speed of insight, and the legal obligations on the firm are all conditioned by the type of data being handled. A model designed for tabular sales data is useless against a stream of free-text customer reviews; a database that handles point-of-sale transactions cannot store a million product images efficiently.

Data is therefore classified along several intersecting dimensions:

By structure — structured, semi-structured, or unstructured.
By source — internal or external to the organisation.
By measurement nature — quantitative or qualitative.
By origin — primary or secondary.
By temporal pattern — cross-sectional, time-series, or streaming.

A given dataset can sit on every dimension at once: an external, semi-structured, qualitative, secondary, streaming social-media feed is a perfectly coherent description.

5.2 Classifying Data by Structure

flowchart TD
    A["Data"] --> B["Structured<br>Tabular, schema-defined"]
    A --> C["Semi-Structured<br>Self-describing, flexible schema"]
    A --> D["Unstructured<br>No predefined schema"]
    B --> B1["RDBMS, spreadsheets,<br>data warehouses"]
    C --> C1["JSON, XML, NoSQL,<br>log files, sensor feeds"]
    D --> D1["Text, images, audio,<br>video, social media"]
    style A fill:#eceff1,stroke:#455A64
    style B fill:#e3f2fd,stroke:#1976D2
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#e8f5e9,stroke:#388E3C

5.2.1 Structured Data

Structured data is organised in a predefined schema — typically rows and columns — where every record has the same fields and every field has a defined data type. It fits naturally into relational databases and spreadsheets and is queried with SQL.

Examples: Sales transactions, customer master records, general-ledger entries, employee tables, product catalogues, inventory positions.
Storage: Relational databases (Oracle, SQL Server, MySQL, PostgreSQL), spreadsheets, enterprise data warehouses (Snowflake, BigQuery, Redshift, Synapse).
Strengths: Easy to store, query, and aggregate. Mature tools, well-understood governance, fast at scale.
Limitations: Inflexible to schema change. Cannot natively represent text, images, or signals.
Estimated share: Historically the dominant form of business data, but Amir Gandomi & Murtaza Haider (2015) note that structured data accounts for only about ten to twenty per cent of the data generated today; the majority of new data is unstructured.

5.2.2 Semi-Structured Data

Semi-structured data carries its own schema inside each record — through tags, key–value pairs, or hierarchical markers — but does not require all records to share the same fields. It is flexible enough to represent variability in records, yet structured enough to be parsed programmatically.

Examples: JSON documents, XML files, NoSQL document stores, web-server logs, sensor and IoT telemetry, email message headers.
Storage: Document databases (MongoDB, Couchbase), key-value stores (Redis, DynamoDB), data lakes (Amazon S3, Azure Data Lake), search platforms (Elasticsearch).
Strengths: Flexible schema, natural fit for web and mobile data, scales horizontally.
Limitations: Joins and aggregations are harder than in pure relational stores; governance and quality controls are less mature.

5.2.3 Unstructured Data

Unstructured data has no predefined data model. The information is meaningful to a human reader but does not fit into rows and columns without significant pre-processing.

Examples: Free-text customer reviews, emails, social-media posts, call-centre transcripts, contracts, scanned documents, photographs, surveillance footage, audio recordings.
Storage: Object storage (S3, Azure Blob, Google Cloud Storage), data lakes, content-management systems, search and retrieval platforms.
Strengths: Captures rich human and sensory information that structured data cannot.
Limitations: Requires natural-language processing, computer vision, or speech-recognition pipelines to extract analysable features. Storage and processing costs are higher.
Estimated share: Amir Gandomi & Murtaza Haider (2015) estimate that roughly eighty per cent of enterprise-relevant data is unstructured, and the share is rising.

5.2.4 Comparison: Structured vs Semi-Structured vs Unstructured

Structure-Based Comparison

Aspect	Structured	Semi-Structured	Unstructured
Schema	Fixed, predefined	Flexible, self-describing	None
Examples	Sales records, ledgers	JSON, XML, log files, IoT telemetry	Text, images, audio, video
Storage	RDBMS, data warehouse	NoSQL, data lake	Object storage, data lake
Query language	SQL	SQL extensions, document queries	Search, NLP, computer vision
Tooling maturity	Very mature	Mature	Rapidly evolving
Typical share of enterprise data	10–20 %	5–10 %	70–80 %
Governance ease	High	Medium	Lower
Analytical readiness	Immediate	Moderate pre-processing	Substantial pre-processing

5.3 Classifying Data by Source

A second classification asks where the data comes from. The distinction matters because internal and external sources have different cost, control, and quality profiles, and the most powerful analytical insights usually come from combining the two.

5.3.1 Internal Data

Internal data is generated and held by the organisation itself, in its operational and administrative systems.

Examples: Sales transactions, CRM customer records, ERP financials, HRIS payroll data, point-of-sale logs, manufacturing-line sensors, website-clickstream logs, internal email and document repositories.
Strengths: Directly owned and controlled, strongly relevant to the organisation’s own decisions, available at granular detail, governed by internal policies.
Limitations: Reflects only the firm’s own activity, may have systematic blind spots, can be siloed across business units.

5.3.2 External Data

External data is generated outside the organisation and acquired or accessed through some channel.

Examples: Government statistics (population, GDP, inflation, agricultural output), market data (stock prices, commodity prices), industry reports, weather feeds, satellite imagery, social-media feeds, third-party demographic and credit data, web-scraped competitor information, paid data services (Bloomberg, Refinitiv, Nielsen, Kantar).
Strengths: Brings context the organisation could not generate itself, enables benchmarking and external sensing, supports macro and competitive analysis.
Limitations: Quality and timeliness vary; licensing and privacy obligations apply; integration with internal data requires careful entity resolution and key matching.

5.3.3 Comparison: Internal vs External

Source-Based Comparison

Aspect	Internal	External
Origin	Generated within the firm	Acquired from outside
Cost	Largely sunk in operational systems	Often involves licensing or acquisition fees
Control	Full ownership and governance	Limited; subject to provider terms
Relevance	Directly tied to the firm’s operations	Provides context, benchmarks, market signals
Quality	Reflects internal data discipline	Variable across providers
Privacy	Subject to internal policy and applicable law	Subject to source’s terms and applicable law
Best used for	Operational decisions, customer-level analysis	Market sizing, benchmarking, environmental sensing

The most valuable analytical work usually combines both. A bank’s churn model improves when internal transaction data is enriched with external macroeconomic indicators; a retailer’s demand forecast improves when internal sales data is enriched with weather and event data.

5.4 Other Useful Classifications of Data

5.4.1 Quantitative versus Qualitative

Quantitative data is numerical and supports arithmetic operations. It includes counts, measurements, ratios, and percentages. Subdivides into discrete (counts of items) and continuous (measurements on a scale).
Qualitative data is categorical or descriptive and does not support arithmetic. It includes labels, ranks, and free text. Subdivides into nominal (categories with no order — gender, region) and ordinal (categories with a natural order — satisfaction rating, education level).

The distinction matters because it determines which statistical techniques are valid: means and standard deviations apply to quantitative data, frequencies and modes to qualitative data, and ordinal data sits in between.

5.4.2 Primary versus Secondary

Primary data is collected first-hand by the analyst or the organisation, for the specific purpose at hand. Surveys, customer interviews, A/B-test logs, and proprietary sensor readings are primary data.
Secondary data is collected by someone else, for some other purpose, and reused. Government statistics, syndicated market reports, academic datasets, and web-scraped competitor information are secondary data.

Primary data is more expensive but precisely fit-for-purpose; secondary data is cheaper but always requires the analyst to ask whether the original purpose, definitions, and methods match the current problem.

5.4.3 Cross-Sectional, Time-Series, and Streaming

Cross-sectional data is a snapshot at a single point in time across many units — for example, the customer base on the last day of a quarter.
Time-series data is repeated observations of the same units over time — for example, monthly sales of each product over five years.
Panel (longitudinal) data combines the two — many units observed over many time periods.
Streaming data is generated continuously and must be processed in motion — point-of-sale events, sensor telemetry, web clicks, and financial-market ticks.

The temporal pattern determines whether the question is about levels, trends, seasonality, or real-time response, and whether batch or stream-processing infrastructure is appropriate.

5.5 Big Data and the Vs

The phrase big data describes data whose volume, velocity, or variety exceeds the capacity of conventional data-processing tools. The original three-V framing was introduced by Doug Laney (2001) to characterise the new data-management challenge facing firms; later authors extended it to four and five Vs to capture concerns about quality and business value (Amir Gandomi & Murtaza Haider, 2015).

flowchart TD
    BD["Big Data"] --> V1["Volume<br>Scale of data"]
    BD --> V2["Velocity<br>Speed of generation<br>and processing"]
    BD --> V3["Variety<br>Structured, semi-,<br>and unstructured"]
    BD --> V4["Veracity<br>Trustworthiness<br>and accuracy"]
    BD --> V5["Value<br>Business outcome<br>and ROI"]
    style BD fill:#e3f2fd,stroke:#1976D2
    style V1 fill:#e8f5e9,stroke:#388E3C
    style V2 fill:#fff8e1,stroke:#F9A825
    style V3 fill:#fff3e0,stroke:#EF6C00
    style V4 fill:#fce4ec,stroke:#AD1457
    style V5 fill:#ede7f6,stroke:#4527A0

The Five Vs of Big Data

V	Meaning	Implication for Analytics
Volume	Sheer scale of data, from terabytes into petabytes and beyond	Distributed storage and computation; cost-aware data lifecycle
Velocity	Speed at which data is generated, captured, and processed	Streaming platforms; real-time vs batch architecture decisions
Variety	Mix of structured, semi-structured, and unstructured forms	Polyglot persistence; pipelines that handle multiple modalities
Veracity	Trustworthiness, accuracy, and consistency of data	Data-quality pipelines; provenance, lineage, and validation
Value	Business outcome that the data delivers when analysed	Prioritise data investments by expected decision value

The fifth V — Value — is the one that distinguishes a strategic data programme from a costly hoarding exercise. Volume, velocity, variety, and veracity describe properties of the data; value is the property of the analytical use that justifies the investment.

5.6 Implications for Analytics

The classification of data has direct consequences for the rest of the analytics work:

Choice of technique: Tabular structured data is the natural home of regression, classification, and time-series modelling. Unstructured text is the home of natural-language processing. Images and video are the home of computer vision. Mixing types — for example, augmenting a churn model with sentiment scores from call-centre transcripts — often produces better results than either source alone.
Choice of platform: Pure structured data lives well in a data warehouse. Mixed structured and unstructured data lives in a data lake or modern lakehouse. Streaming data needs purpose-built platforms such as Kafka, Flink, or Spark Streaming.
Cost profile: Storage and compute costs are roughly proportional to volume and velocity, but the cost of the human work of cleaning, labelling, and modelling rises faster with variety than any other dimension.
Governance and ethics: External data brings licence terms; personal data brings privacy law; unstructured data brings copyright and consent considerations. Each type carries its own legal and ethical envelope.

5.7 Common Pitfalls

Confusing volume with value: Hoarding data because it is cheap to store, without ever defining the decisions it is meant to support.
Forcing unstructured data into a relational schema: Discarding most of the signal in the process. The right answer is usually a different storage layer, not a contorted table.
Treating internal data as sufficient: Many strategic questions cannot be answered without external context. Internal data alone produces inward-looking analytics.
Treating external data as gospel: Third-party data has its own quality issues. Always understand how it was collected before relying on it.
Ignoring the temporal dimension: Modelling time-series data with cross-sectional methods, or vice versa, introduces predictable but easily overlooked errors.
Underestimating unstructured pre-processing: Text, image, and audio pipelines require substantial engineering before a model can be trained on them. Plan for it.
Neglecting veracity: Velocity and volume are emphasised in vendor marketing, but veracity determines whether the analytics are trustworthy enough to act on.
Privacy by accident: Internal log files often contain personal information that the firm is legally obliged to protect. Treat sensitive data as sensitive from the moment it lands.

5.8 Illustrative Cases

The following short cases illustrate how data type choices play out in practice. They are based on the kinds of work commonly seen in industry; the framing is the author’s.

Retail Demand Forecasting — Internal Plus External

A retailer wants to forecast daily demand at the store-and-product level. The team starts with internal point-of-sale and inventory data — structured, well-governed, but inward-looking. Forecast accuracy improves when the team adds external weather data (a weather event materially shifts ice-cream and umbrella demand) and external local-events data (a stadium event near a store shifts beverage and snack demand). The dataset becomes a join of internal structured data and external semi-structured feeds, and the resulting forecast outperforms either source alone.

Customer Experience — Structured Scores Plus Unstructured Verbatims

A service business runs a Net Promoter Score programme. The numerical NPS scores are structured, easy to dashboard, and correlate weakly with churn. The free-text verbatim comments accompanying the scores are unstructured and, until recently, were read by analysts in small samples. The team builds a natural-language pipeline that extracts themes and sentiment from the verbatims and joins them to the structured score and to the customer master. The blended dataset produces a churn model meaningfully better than the score-only model, and an actionable map of the specific service failures driving negative sentiment.

Industrial IoT — Streaming Semi-Structured Data

A manufacturer instruments a critical production line with hundreds of sensors. The data is semi-structured (each sensor emits a tagged time-stamped reading), high-velocity (thousands of readings per second), and variable in quality (sensors fail and drift). Storing every reading in a relational database is impractical; the team builds a streaming pipeline that lands raw data in object storage, computes aggregates and anomaly scores in motion, and surfaces alerts to the maintenance dispatch board. The architectural choice is driven directly by the data type: streaming, semi-structured, high-volume, with veracity concerns front and centre.

Credit Scoring — Combining Primary and Secondary, Internal and External

An Indian retail bank builds a credit-scoring model for personal loans. The bank’s own application data and repayment history are internal, primary, and structured. Credit-bureau scores from CIBIL or Experian are external and structured. Account-aggregator-derived bank-statement summaries are external, semi-structured, and require parsing. Demographic and macroeconomic indicators from official sources are external, secondary, and structured. The blended dataset uses every classification this chapter has introduced, and the model’s performance reflects the disciplined integration of all of them.

Summary

Concept	Description
Foundations
Why Data Types Matter	The choice of analytical technique, platform, cost, and legal envelope all flow from the type of data on the table
Structure-Based Classification
Structured Data	Tabular, schema-defined data that fits naturally into relational databases and is queried with SQL
Semi-Structured Data	Self-describing data with flexible schema such as JSON, XML, log files, and IoT telemetry
Unstructured Data	Data with no predefined model such as text, images, audio, and video; estimated at around eighty per cent of enterprise data
Schema	The set of fields and types that defines the structure of a dataset; fixed for structured, flexible for semi-structured, absent for unstructured
Object Storage	Storage layer such as Amazon S3 or Azure Blob designed for unstructured and semi-structured data at scale
Source-Based Classification
Internal Data	Data generated within the organisation in its operational and administrative systems
External Data	Data acquired from outside the organisation, including government, market, and third-party sources
Internal-External Combination	The most powerful analyses usually combine internal granularity with external context
Measurement Nature
Quantitative Data	Numerical data that supports arithmetic, including counts, measurements, ratios, and percentages
Qualitative Data	Categorical or descriptive data such as labels, ranks, and free text
Discrete and Continuous	Quantitative subdivisions: discrete counts versus continuous measurements on a scale
Nominal and Ordinal	Qualitative subdivisions: nominal categories with no order versus ordinal categories with a natural order
Origin
Primary Data	Data collected first-hand for the specific purpose at hand
Secondary Data	Data collected by someone else for some other purpose and then reused
Temporal Pattern
Cross-Sectional Data	Snapshot at a single point in time across many units
Time-Series Data	Repeated observations of the same units over time
Panel Data	Many units observed over many time periods, combining cross-sectional and time-series
Streaming Data	Data generated continuously and processed in motion
Big Data and the Five Vs
Big Data	Data whose volume, velocity, or variety exceeds conventional processing capacity
Volume	Sheer scale of data, requiring distributed storage and computation
Velocity	Speed at which data is generated, captured, and processed
Variety	Mix of structured, semi-structured, and unstructured forms within the same problem
Veracity	Trustworthiness, accuracy, and consistency of data
Value	Business outcome that the data delivers when analysed; the property that justifies investment
Implications for Analytics
Choice of Technique	Tabular methods for structured data, NLP for text, computer vision for images, mixed pipelines for blends
Choice of Platform	Warehouse for structured, lake or lakehouse for mixed, purpose-built platforms for streaming
Cost Profile	Storage and compute scale with volume and velocity; human effort scales fastest with variety
Governance and Ethics	External data brings licence terms; personal data brings privacy law; unstructured data brings copyright
Common Pitfalls
Confusing Volume with Value	Pitfall of hoarding data because storage is cheap, without defining the decisions it supports
Forcing Unstructured into Tables	Pitfall of contorting unstructured signal into relational tables and discarding most of it
Internal-Only Blind Spot	Pitfall of relying only on internal data and producing inward-looking analytics
External Data as Gospel	Pitfall of trusting third-party data without auditing how it was collected
Ignoring the Temporal Dimension	Pitfall of modelling time-series data with cross-sectional methods or vice versa
Underestimating Pre-Processing	Pitfall of underestimating the engineering required to make text, image, and audio analysable
Neglecting Veracity	Pitfall of emphasising volume and velocity in vendor marketing while ignoring trustworthiness
Privacy by Accident	Pitfall of internal log files quietly capturing personal information without the protections that requires