What is Lakeflow Spark Declarative Pipelines in Databricks?

Lakeflow Spark Declarative Pipelines is the Databricks framework for building batch and streaming pipelines in SQL or Python by declaring what output tables should contain rather than coding how to produce them. It is the renamed and evolved version of Delta Live Tables, with deeper Unity Catalog integration for governance, lineage, and centralized expectation management.

What is the difference between a streaming table and a materialized view in Lakeflow?

A streaming table processes each incoming row exactly once and is suited for low-latency ingestion and append-heavy sources. A materialized view pre-computes query results and refreshes them incrementally from upstream changes, making it correctable and better suited for transformations where full accuracy on recompute matters more than latency.

What replaced Delta Live Tables in Databricks?

Delta Live Tables was rebranded as Lakeflow Spark Declarative Pipelines in 2025. The underlying engine, syntax, and behavior are unchanged. The key additions are deeper Unity Catalog integration, centralized expectation storage in Unity Catalog tables, automatic permission propagation to pipeline outputs, and expanded AUTO CDC capabilities for SCD Type 1 and Type 2 patterns.

When should you use Lakeflow Pipelines vs Lakeflow Jobs?

Use Lakeflow Pipelines for transformation logic, dataset dependency management, and data quality enforcement inside a pipeline. Use Lakeflow Jobs to schedule pipeline runs, coordinate across multiple pipelines, and build workflows that include notebooks, dbt models, ML jobs, and SQL queries. In production, Lakeflow Jobs typically triggers Lakeflow Pipelines as one task within a larger cross-system workflow.

What is AUTO CDC in Lakeflow Pipelines?

AUTO CDC is the current standard for building CDC pipelines inside Lakeflow. It handles SCD Type 1 (upsert to latest value) and SCD Type 2 (maintain full change history) patterns on streaming tables with minimal configuration, replacing the more verbose APPLY CHANGES INTO syntax. It handles deduplication and natural key matching automatically

How do you monitor a Lakeflow Pipeline in production?

Query the pipeline event log using the event_log() table-valued function. The event log is a Delta table that records execution progress, expectation pass/fail counts, data lineage, and error details for every pipeline run. Build data quality dashboards against it to track expectation trends over time, and configure event hooks to fire alerts to Slack or PagerDuty when failure thresholds are breached.

Lakeflow Pipelines for Data Engineering: Complete Guide 2026

TL;DR

Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) is the Databricks framework for building batch and streaming pipelines declaratively in SQL or Python. You define what each output table should look like. The platform handles dependency resolution, incremental processing, retries, and schema evolution automatically. The three dataset types engineers work with are streaming tables (exact-once ingestion), materialized views (incrementally refreshed transformations), and temporary views (intermediate logic, no storage). The biggest mistake teams make: still writing ad hoc Spark jobs for transformation when Lakeflow reduces hundreds of lines of manual orchestration code to a handful of declarative definitions.

Most teams running Databricks still have at least one corner of their stack that looks like this: a scheduled notebook, a manual spark.read, a df.write.mode("overwrite"), and a Lakeflow Job holding it together with duct tape.

It works. Until it does not. Then someone spends two hours figuring out why the Silver table is empty, only to discover the Bronze read job ran before the source data landed.

Lakeflow Spark Declarative Pipelines exists to remove this entire category of problem. Not by adding more orchestration tooling, but by changing the model entirely. You declare what the data should look like. The platform figures out the order, the retries, and the incremental state.

This article is part of the Modern Data Engineering: The Complete Guide series. The architecture context for where Lakeflow fits in the full Databricks stack is in Databricks for Data Engineering: Architecture, Components, and Best Practices. The storage layer that every Lakeflow pipeline writes to is covered in Delta Lake Explained for Data Engineers.

What Lakeflow Spark Declarative Pipelines Is and What It Replaced

Lakeflow Spark Declarative Pipelines, referred to as SDP or Lakeflow Pipelines, is the current name for what was previously called Delta Live Tables (DLT). The branding changed in 2025. The engine is the same.

As the Databricks official SDP documentation describes it: Lakeflow SDP is a framework for creating batch and streaming data pipelines in SQL and Python. It extends and is interoperable with Apache Spark Declarative Pipelines, running on the performance-optimized Databricks Runtime.

The naming matters for one reason: teams searching for DLT documentation, tutorials, or stack overflow answers will find content that still uses the old terminology. All of it applies. The concepts, the syntax, and the behavior are the same.

What changed beyond the name: Unity Catalog integration deepened significantly. Expectations are now stored in Unity Catalog tables rather than pipeline-local storage. Pipeline configuration can now be stored and read from Unity Catalog table properties. MANAGE permissions propagate automatically to pipeline outputs. The platform moved from a pipeline-scoped tool to a governance-aware platform feature.

Declarative vs Imperative: Why the Difference Matters

Most engineers learn data engineering imperatively. Write a Spark job. Tell it exactly what to read, how to join, what to write, and in which order. The engineer manages everything: the sequence, the incremental state, the retry logic, the dependency chain.

Declarative pipelines flip this. You define the output. The platform works out how to produce it.

As the Databricks SDP concepts documentation states directly: SDP can reduce hundreds or even thousands of lines of manual Spark and Structured Streaming code to only a few lines. Automatic orchestration ensures the correct execution order and maximum parallelism. Retry logic starts at the Spark task level, escalates to the flow level, then to the full pipeline if needed.

In practice this means:

A Bronze-to-Silver pipeline does not need code to check whether Bronze ran first. SDP resolves the dependency automatically.
A transformation that should process only new rows does not need manual checkpoint tracking. SDP handles incremental state.
A failed flow does not need a human to restart the right subset of tasks. SDP retries at the most granular level that makes sense.

The engineer's job becomes defining the transformation logic and the quality rules. Not the operational plumbing.

The Three Dataset Types: When to Use Each

Every Lakeflow pipeline is built from three dataset types. Choosing the right one for each step is the single most consequential pipeline design decision.

Dataset Type	Processing Model	Storage	Best For
Streaming Table	Each row processed exactly once	Delta table in Unity Catalog	Bronze ingestion, low-latency Silver updates
Materialized View	Incrementally refreshed, full correctness	Delta table in Unity Catalog	Silver and Gold transformations, aggregations
Temporary View	Computed at query time, no storage	None	Intermediate logic, reusable SQL within a pipeline

Streaming Tables: Exact-Once Ingestion

A streaming table processes each incoming row exactly once. It stays running continuously or triggers on new data arrival, depending on your pipeline mode.

Use streaming tables for Bronze layer ingestion from Auto Loader, Kafka, Kinesis, or any append-heavy source. Use them for Silver updates where low-latency propagation from Bronze matters. As jamesm.blog's April 2026 Databricks engineering guide states: use streaming tables when you want low-latency append or upsert-style ingestion.

The exact-once guarantee matters more than it sounds. Without it, a pipeline restart processes events that already landed, producing duplicate rows in your Silver table. Streaming tables handle this via checkpointing. The engine tracks the last processed offset. On restart it picks up from exactly that point.

Materialized Views: Incremental Correctness

A materialized view pre-computes a query result and stores it as a Delta table. On each pipeline run, the engine processes only new or changed upstream data and updates the materialized view accordingly.

Use materialized views for Silver and Gold transformations. Complex joins, window functions, aggregations, deduplication logic. As jamesm.blog recommends: use materialized views when correctness on recomputation matters more than latency.

The distinction teams miss: a materialized view can be fully recomputed if something upstream changes significantly. A streaming table cannot reprocess history by default. If you discover a Silver transformation bug six months in, a materialized view can be corrected and refreshed. A streaming table may require replaying from Bronze.

Temporary Views: Intermediate Logic Without Storage

Temporary views compute results at query time within the pipeline. They do not write to storage. Use them to break complex transformation logic into readable named steps without paying the cost of materializing intermediate results.

A temporary view that joins three Bronze tables before a deduplication step runs each time it is referenced within the pipeline. Nothing persists. Nothing is queryable outside the pipeline context.

AUTO CDC: The Right Way to Handle Change Data in 2026

Most teams building CDC pipelines in Databricks before 2025 used APPLY CHANGES INTO. It worked. It was also verbose, required careful sequencing, and produced brittle pipelines when source schemas evolved.

AUTO CDC is the current standard. It is available on streaming tables inside Lakeflow Pipelines and handles SCD Type 1 and SCD Type 2 patterns with minimal configuration.

As confirmed in the Databricks 2026 SDP release notes: SCD Type 1 materialization with AUTO CDC is now supported, providing a simpler CDC pattern that upserts the latest value without maintaining full change history. SCD Type 2 operations now automatically coalesce duplicate records with the same natural key, ensuring data consistency.

The practical impact: a CDC pipeline that previously required 40 to 60 lines of explicit APPLY CHANGES INTO logic with sequence columns and key specifications now works with a streamlined AUTO CDC declaration. The platform handles deduplication, key matching, and history tracking based on the pattern you select.

When to use SCD Type 1 with AUTO CDC: Dimension tables where only the current state matters. Customer address, product category, status flags.
When to use SCD Type 2 with AUTO CDC: Any table where analysts or ML models need to reconstruct what a record looked like at a specific historical point. Order state, subscription tier, pricing.

Incremental Loads, CDC, and Change Data Feed in Delta Lake covers the full implementation of both patterns including late-arriving record handling and how AUTO CDC integrates with Change Data Feed on source tables.

Pipeline Observability: The Event Log Is Your Debug Tool

Every Lakeflow pipeline run writes structured records to an event log. This is the primary observability primitive. It is a Delta table you can query directly.

As the Databricks best practices documentation for SDP explains: every pipeline run writes records covering execution progress, data quality expectation results, data lineage, and error details. Query the event log using the event_log() table-valued function.

SELECT * FROM event_log('<pipeline-id>')
WHERE event_type = 'flow_progress'
ORDER BY timestamp
DESC LIMIT 100;

What this gives you that generic Databricks job logs do not:

Expectation pass/fail counts per flow per run, queryable as time-series data
Flow-level lineage showing exactly which source tables fed which output tables in that specific run
Error details at the flow level, not just the job level
Data volume metrics showing rows processed per flow

Teams that build data quality dashboards against the event log catch expectation degradation before it reaches analysts. A Silver table expectation that was passing 99.8% of rows last week and is passing 94% this week is a source system problem in progress, not a random anomaly.

For automated alerting, event hooks trigger custom webhooks when a pipeline fails or when a specific expectation failure threshold is breached. The hook fires before anyone has to check a dashboard.

Lakeflow Pipelines vs Lakeflow Jobs: Which One for What

This is the question that consistently trips up teams new to the Databricks stack.

As the Databricks pipeline task documentation frames it directly: Lakeflow Jobs provide a procedural approach to defining relationships between tasks. Lakeflow Spark Declarative Pipelines provide a declarative approach to defining relationships between datasets and transformations.

Decision	Use Lakeflow Pipelines	Use Lakeflow Jobs
Defining transformation logic	Yes	No
Scheduling pipeline runs	No (Jobs triggers the pipeline)	Yes
Managing dependencies between datasets	Yes	No
Coordinating across pipelines, notebooks, ML jobs	No	Yes
Data quality enforcement	Yes (expectations)	No native equivalent
Cross-system workflow (dbt + pipeline + SQL query)	No	Yes

The correct mental model: Lakeflow Pipelines owns the data transformation logic and the dataset dependency graph. Lakeflow Jobs owns the scheduling, cross-system coordination, and operational workflow.

A typical production setup: Lakeflow Jobs triggers a Lakeflow Pipeline run as one task in a larger workflow that also runs a dbt model and a downstream SQL query. The pipeline manages its internal dependencies. The job manages the external sequence.

Workflow Orchestration with Lakeflow Jobs covers how to wire pipelines into multi-task job workflows, repair runs, and event-driven triggering for production-grade orchestration.

Three Things That Break Lakeflow Pipelines in Production

Writing manual reads and writes inside a pipeline. Using spark.read.table("bronze.events") and df.write.mode("overwrite") inside a Lakeflow Pipeline bypasses the entire incremental processing engine. The pipeline has no idea what data has already been processed. Every run full-scans and full-overwrites. The pipeline works but delivers none of the cost or reliability benefits of the declarative model.

Not querying the event log for expectation monitoring. Teams that define expectations but never build monitoring against the event log are flying blind. An expectation that fails 0.01% of rows today and 8% of rows in three weeks looks the same in job status (both show green). Only the event log shows the trend.

Running pipelines in triggered mode when continuous mode is needed, or vice versa. Triggered pipelines run on a schedule or external trigger and stop when the update completes. Continuous pipelines run indefinitely, processing new data as it arrives. Teams that run a streaming Bronze ingestion pipeline in triggered mode miss data that arrives between runs. Teams that run a Gold aggregation materialized view in continuous mode pay for always-on compute for a workload that a scheduled nightly trigger would handle at a fraction of the cost.

What This Series Covers Next

Designing Scalable ETL Pipelines on Databricks covers the implementation patterns for building transformation pipelines that handle real data volumes, schema evolution, and partition design using Delta Lake inside the Lakeflow framework.
Incremental Loads, CDC, and Change Data Feed in Delta Lake goes deep on AUTO CDC, SCD patterns, and how Change Data Feed powers incremental Silver updates from Bronze source tables.
Medallion Architecture in Databricks covers how streaming tables and materialized views map to Bronze, Silver, and Gold tier responsibilities and how expectations enforce quality at each layer boundary.

Krunal Kanojiya

Technical Content Writer

Facing a Challenge? Let's Talk.

Whether it's AI, data engineering, or commerce tell us what's not working yet. Our team will respond within 1 business day.