What is a data pipeline in simple terms?

A data pipeline is a system that moves data from where it is created to where it is used, transforming and cleaning it along the way. Every pipeline has four stages: ingestion (collecting data), transformation (shaping it), storage (storing the clean version), and serving (making it available for analysis).

What are the main types of data pipelines?

The main types are batch pipelines (process large volumes of data at scheduled intervals), streaming pipelines (process events in real time as they arrive), and Lambda Architecture (runs both batch and streaming in parallel for systems that need low latency and historical accuracy at the same time).

What is the difference between ETL and ELT pipelines?

ETL (Extract, Transform, Load) transforms data before storing it, which was standard in the on-premise warehouse era. ELT (Extract, Load, Transform) stores raw data first and transforms it inside the destination platform using cloud compute power. Most modern pipelines on platforms like Databricks use ELT because it is faster and more flexible.

What makes a data pipeline fragile?

Pipelines break most often because of unhandled schema changes (a source system changes its data format), missing idempotency (re-running a failed pipeline creates duplicate records), and lack of observability (nobody notices a problem until it has already affected downstream reports or models).

What is Lakeflow and how does it relate to data pipelines?

Lakeflow is Databricks' unified framework for building, running, and orchestrating data pipelines. It includes Lakeflow Connect for ingestion, Lakeflow Spark Declarative Pipelines for transformation, and Lakeflow Jobs for orchestration and scheduling. All three components work together inside the Databricks lakehouse with built-in governance through Unity Catalog.

What is Lambda Architecture and when should you use it?

Lambda Architecture runs two parallel pipeline paths: a Speed Layer for real-time processing and a Batch Layer for accurate historical processing. Both feed a Serving Layer that merges results. It is best for systems that need both low latency (fraud detection, live dashboards) and batch accuracy (historical reports, model training). It is more complex to maintain than a pure batch or pure streaming approach.

How does Databricks handle pipeline failures?

Lakeflow Spark Declarative Pipelines on Databricks are designed to be re-runnable safely. When a pipeline run fails, it tracks exactly which records were processed, so the next run picks up from where it left off without duplicating data. Lakeflow Jobs provides retry logic, failure alerts, and detailed event logs to help engineers diagnose and fix issues quickly.

How Data Pipelines Work: Architecture, Stages, and Patterns 2026

TL;DR

A data pipeline is the system that moves data from where it is created to where it is used, reliably and at scale. Every pipeline follows the same four core stages: ingestion, transformation, storage, and serving. In 2026, modern pipelines on platforms like Databricks handle both batch and real-time streaming workloads through a declarative framework called Lakeflow, which automates much of what teams used to manage by hand. This article explains exactly how each stage works, what makes pipelines break, and how the architecture decisions you make early affect everything downstream.

Think about the last time your team looked at a dashboard, pulled a report or made a data-driven decision. Numbers appeared. Insights surfaced. A choice got made.

None of that happened by accident. Behind every piece of useful data is a pipeline that carried it from where it was created to where it was needed. The pipeline cleaned it, shaped it and delivered it. You probably never noticed it. That is the point.

Algoscale describes it well in their 2026 enterprise pipeline guide: data teams across industries currently spend 60% to 80% of their time maintaining fragile pipeline systems, firefighting instead of building, patching instead of scaling. When pipelines are designed well, they are invisible. When they are designed poorly, they become the most expensive problem in the building.

A data pipeline is a series of processing steps that move data from one or more sources to one or more destinations, transforming it along the way. As Oneuptime's January 2026 engineering post puts it, building a pipeline that works in development and one that survives production traffic are two very different challenges. This article covers both.

If you are new to data engineering overall, our pillar article Modern Data Engineering: The Complete Guide covers the full landscape of tools, concepts, and platforms before you dive into pipeline specifics.

Why Pipelines Are Not Just "Moving Files from A to B"

Most data pipelines are not a straight line. They are networks. A single pipeline might pull from five source systems, join data from three of them, apply quality rules, route cleaned records to two different destinations, and trigger downstream jobs when it finishes.

Monte Carlo Data's pipeline architecture breakdown makes this clear: most data pipelines are not a linear movement of data from source A to target B, but rather consist of a series of highly complex and interdependent processes. The more interdependencies, the more places where a schema change, a late delivery, or a missed validation can cascade into a much bigger failure.

Understanding the structure of a pipeline before you build one is what separates engineers who design systems from engineers who just connect tools.

The 4 Core Stages of Every Data Pipeline

Every data pipeline, regardless of how complex it gets, is built on four stages. Each stage has a specific job. Getting any one of them wrong affects everything that comes after.

Stage 1: Ingestion: Where Data Enters the System

Ingestion is the first step. The pipeline pulls data out of its source. That source could be a REST API, a relational database, a file server, a streaming event queue or a SaaS application.

As CUFinder's data pipeline guide describes it, there are two fundamental ways to ingest data. Batch ingestion collects data at scheduled intervals, like pulling all transactions from the previous day every morning at 2am. Streaming ingestion captures data in real time as events occur, like processing every user click the moment it happens.

The choice between them at the ingestion stage has consequences that ripple through every other stage of the pipeline. More on this in our dedicated guide on Batch vs Streaming Pipelines.

On Databricks, ingestion happens through Lakeflow Connect, which provides fully managed connectors for sources including Salesforce, Workday, SQL Server, Apache Kafka, Amazon Kinesis, Google Pub/Sub, and cloud storage systems like S3 and Azure Data Lake Storage.

Stage 2: Transformation: Where Raw Data Gets Shaped

Transformation is where the real work happens. This is the stage where raw records become useful data.

Transformation includes cleaning (removing nulls, fixing formats), filtering (dropping records that do not meet quality criteria), deduplicating (removing repeated records), enriching (joining data from multiple sources), and aggregating (summarizing rows into totals, averages, or counts).

The two main patterns for transformation are ETL and ELT. In ETL (Extract, Transform, Load), you transform data before storing it. In ELT (Extract, Load, Transform), you store raw data first and transform it inside the destination platform. The detailed breakdown of when to use each pattern is covered in ETL vs ELT in Modern Data Engineering.

Modern cloud platforms like Databricks run transformation inside the lakehouse itself, using distributed compute powered by Apache Spark. This makes ELT the natural default for most modern pipelines.

Stage 3: Storage: Where Processed Data Lives

After transformation, data lands somewhere. The storage layer determines how queryable, reliable, and cost-effective your data will be.

The three main storage options are data warehouses, data lakes, and the lakehouse model that combines both. Each has tradeoffs in cost, flexibility, and query performance. The full comparison is covered in Data Warehouse vs Data Lake vs Lakehouse.

On Databricks, the storage layer is built on Delta Lake, which adds ACID transactions, schema enforcement, and time travel on top of cloud object storage. This means your data is protected from partial writes, schema drift, and concurrent update conflicts at the storage layer itself.

Stage 4: Serving: Where Data Reaches the People Who Need It

The serving layer is what most stakeholders actually see. This is where clean, transformed data is made available to the people and tools that need it, including BI dashboards, SQL analysts, machine learning models, and downstream applications.

A pipeline that delivers data nobody can access has failed at the last step. The serving layer requires performance optimization (indexed tables, materialized views), access control (who can query what), and freshness guarantees (how current does the data need to be).

This is where Databricks SQL operates, giving analysts fast, governed query access to the Gold layer tables produced by the pipeline.

The Most Common Data Pipeline Architecture Patterns in 2026

Understanding the four stages tells you what a pipeline does. Understanding architectural patterns tells you how it is designed to do it reliably at scale.

Batch Architecture: Scheduled, Reliable and Cost-Efficient

Batch pipelines process large amounts of data at scheduled intervals. Once an hour, once a day or once a week. They collect everything that accumulated since the last run and process it together.

Batch is easier to manage, easier to debug, and less expensive to run than streaming. It suits reporting and analytics workloads where near-real-time data is not required.

As Estuary's March 2026 pipeline architecture guide notes, for infrequent or heavy workloads, batch may be more cost-effective and reliable than streaming. Not everything needs to be real-time. The question is whether your use case actually requires low latency, or whether you are adding complexity without a business reason for it.

Streaming Architecture: Real-Time and Event-Driven

Streaming pipelines process events as they arrive. There is no waiting for the next scheduled run. The moment a record enters the system, the pipeline handles it.

Streaming is the right pattern for fraud detection (you need to catch suspicious transactions within seconds), real-time personalization (serving updated recommendations as a user browses), and operational monitoring (detecting system failures before they escalate).

Algoscale explains it clearly: streaming architectures process events as they arrive, suited for fraud detection, real-time monitoring, or customer-facing personalization. Apache Kafka is the most common event streaming backbone, with Databricks supporting ingestion from Kafka, Amazon Kinesis, Google Pub/Sub, Azure EventHub, and Apache Pulsar natively through Lakeflow.

Lambda Architecture: Running Both Paths Together

Most production systems at scale actually need both batch and streaming. The Lambda Architecture pattern handles this by splitting the pipeline into two parallel tracks that serve the same query layer.

The Speed Layer processes incoming events in real time for low-latency reads. The Batch Layer processes the same data more thoroughly in scheduled runs for accuracy and completeness. The Serving Layer merges results from both tracks so that queries can access either recent data or historical data without knowing which path produced it.

As DevOps.dev's April 2026 Lambda Architecture walkthrough explains, consumers do not need to know which pipeline produced a given result. They simply query the Serving Layer and get back either real-time data from the Speed Layer or historical data from the Batch Layer, depending on what they need.

Pattern	Processing Mode	Best For	Tradeoff
Batch	Scheduled intervals	Analytics, reporting, heavy loads	Higher latency, simpler to manage
Streaming	Continuous real-time	Fraud detection, personalization, monitoring	Lower latency, higher complexity
Lambda	Both in parallel	Production systems needing both	Most complete, most complex to maintain
Kappa	Streaming only, replayable	Simplified streaming without batch layer	Requires reprocessing from log for corrections

What Makes a Pipeline Reliable vs Fragile?

A pipeline that runs once in a demo is not the same as a pipeline that runs 100,000 times in production without losing or corrupting a single record. The difference comes down to a handful of design decisions that engineers often skip when they are moving fast.

Schema Changes Are the Number One Silent Killer

A schema change happens when a source system changes the structure of data it sends. A column gets renamed. A data type changes. A field that was always populated starts arriving as null.

If your pipeline does not handle schema evolution, that change breaks the pipeline silently or corrupts downstream tables. Most pipelines fail in exactly this way.

Cygnet's 2026 pipeline architecture guide notes this directly: data sources evolve, business requirements change, and scale assumptions break. Pipelines are not deployed once and forgotten. Integration tests validate end-to-end behavior. Unit tests cover transformation logic. Load tests surface bottlenecks before they surface in production.

On Databricks, Lakeflow Spark Declarative Pipelines handles schema evolution automatically for CDC pipelines when you use the APPLY CHANGES INTO or AUTO CDC API patterns. The pipeline tracks which rows changed and applies schema updates without requiring manual intervention.

Observability: You Cannot Fix What You Cannot See

Observability means you can see inside your pipeline at any time. You know how much data passed through each stage, whether quality checks passed or failed, how long each step took, and whether anything looks unusual compared to historical norms.

According to Algoscale, automated testing, version control, CI/CD, and pipeline observability used to be what advanced teams did. In 2026, they are what every team is expected to do. The organizations still deploying pipelines manually are already behind.

Databricks provides built-in observability for Lakeflow pipelines through the Lakeflow event log and Lakehouse Monitoring, which gives you real-time metrics on pipeline health, custom alerts when issues occur, and detailed failure traces so you can pinpoint root cause quickly.

Idempotency: Pipelines Must Survive Failures

An idempotent pipeline is one you can run multiple times without producing different results. If a pipeline fails halfway through and you re-run it, it processes only what it has not already processed. It does not duplicate records or corrupt state.

Oneuptime's engineering guide is blunt about this: you cannot fix what you cannot see. And you cannot safely re-run what was not designed to be re-run. Idempotency and observability are not nice-to-have features. They are requirements for any pipeline that runs in production.

How Modern Data Pipelines Work on Databricks with Lakeflow

Understanding the concepts is one thing. Seeing how they connect on a real platform makes it concrete.

Databricks organizes all pipeline functionality under Lakeflow. As the official Lakeflow product page describes, Lakeflow provides a single solution to collect and clean all your data, with built-in unified governance and lineage, declarative transformations, AI-assisted code authoring, and auto-optimized resource usage for both batch and real-time use cases.

The Three Dataset Types in Lakeflow Spark Declarative Pipelines

When building pipelines in Lakeflow, you work with three types of dataset objects. Choosing the right one for each stage of your pipeline avoids wasted compute and keeps your code easy to reason about.

As the Databricks AWS documentation for Lakeflow best practices explains

Dataset Type	Best For	How It Works
Streaming Table	Ingestion and low-latency streaming	Each row is read and processed only once. Ideal for append-only, high-volume, event-driven workloads.
Materialized View	Complex transformations and analytics	Results are pre-computed and refreshed incrementally. Fast for downstream queries.
Temporary View	Intermediate logic steps	Pipeline-scoped, no data materialized to storage. Used to organize complex transformation logic.

The practical result is that engineers describe what data should look like at each stage and Lakeflow manages orchestration, incremental processing, dependency resolution and operational behavior automatically.

Jamesm.blog's 2026 Databricks engineering guide captures the shift well: the modern Databricks approach is increasingly declarative. Use streaming tables when you want low-latency append or upsert-style ingestion. Use materialized views when correctness on recomputation matters more than row-by-row streaming semantics. This is a meaningful distinction in 2026 because Databricks is giving teams higher-level objects instead of forcing every transformation into a hand-managed Spark job.

The most recent 2026 Lakeflow release notes confirm that pipelines now support queued execution mode (multiple update requests queue automatically instead of failing with conflicts), centralized data quality expectations stored in Unity Catalog, and automatic MANAGE permission propagation to pipeline outputs. These are not minor updates. They reflect how much operational complexity Databricks is absorbing at the platform level so engineers do not have to manage it manually.

How a Lakeflow Pipeline Flows Through Medallion Architecture

In practice, a Lakeflow pipeline on Databricks follows the Medallion Architecture: Bronze, Silver, and Gold layers.

Bronze layer: Streaming tables ingest raw data from Lakeflow Connect connectors (Salesforce, Kafka, S3, and others). Data lands exactly as it arrived, unchanged.
Silver layer: Materialized views clean, deduplicate, and validate the Bronze data. CDC flows using AUTO CDC API apply changes from source systems incrementally.
Gold layer: Materialized views aggregate Silver data into business-ready tables optimized for specific analytics, reporting, or ML use cases.

Lakeflow Jobs then orchestrate the full sequence, scheduling runs, handling failures, and surfacing alerts when something goes wrong. That coordination layer is covered in our upcoming guide on Lakeflow Pipelines for Data Engineering.

For teams ready to move beyond understanding pipelines conceptually and into building production-ready ones, the implementation path is covered in How to Build Production-Grade Data Pipelines on Databricks. And for the broader Databricks platform picture that puts pipelines in context, Databricks for Data Engineering: Architecture, Components, and Best Practices is the right next step.

How to Use the Rest of This Series

This article covered the anatomy of a data pipeline, the four core stages, the main architectural patterns, and how Lakeflow implements them on Databricks.

Each of the concepts introduced here has its own dedicated deep-dive article in this series:

How data moves: ETL vs ELT in Modern Data Engineering covers the transformation pattern decision in full detail.
When data moves: Batch vs Streaming Pipelines covers the timing and architecture tradeoffs for each approach.
Where data lands: Data Warehouse vs Data Lake vs Lakehouse covers the storage architecture decision.
The full platform: Databricks for Data Engineering: Architecture, Components, and Best Practices covers how all the Lakeflow pieces fit together inside the Databricks stack.
Building for production: How to Build Production-Grade Data Pipelines on Databricks covers testing, observability, failure handling, and deployment patterns.

If you are reading this series from the beginning, start with Modern Data Engineering: The Complete Guide for the full landscape before diving into any specific topic.

Krunal Kanojiya

Technical Content Writer

Facing a Challenge? Let's Talk.

Whether it's AI, data engineering, or commerce tell us what's not working yet. Our team will respond within 1 business day.

How Modern Data Pipelines Actually Work