Most teams running Databricks still have at least one corner of their stack that looks like this: a scheduled notebook, a manual spark.read, a df.write.mode("overwrite"), and a Lakeflow Job holding it together with duct tape.
It works. Until it does not. Then someone spends two hours figuring out why the Silver table is empty, only to discover the Bronze read job ran before the source data landed.
Lakeflow Spark Declarative Pipelines exists to remove this entire category of problem. Not by adding more orchestration tooling, but by changing the model entirely. You declare what the data should look like. The platform figures out the order, the retries, and the incremental state.
This article is part of the Modern Data Engineering: The Complete Guide series. The architecture context for where Lakeflow fits in the full Databricks stack is in Databricks for Data Engineering: Architecture, Components, and Best Practices. The storage layer that every Lakeflow pipeline writes to is covered in Delta Lake Explained for Data Engineers.
What Lakeflow Spark Declarative Pipelines Is and What It Replaced
Lakeflow Spark Declarative Pipelines, referred to as SDP or Lakeflow Pipelines, is the current name for what was previously called Delta Live Tables (DLT). The branding changed in 2025. The engine is the same.
As the Databricks official SDP documentation describes it: Lakeflow SDP is a framework for creating batch and streaming data pipelines in SQL and Python. It extends and is interoperable with Apache Spark Declarative Pipelines, running on the performance-optimized Databricks Runtime.
The naming matters for one reason: teams searching for DLT documentation, tutorials, or stack overflow answers will find content that still uses the old terminology. All of it applies. The concepts, the syntax, and the behavior are the same.
What changed beyond the name: Unity Catalog integration deepened significantly. Expectations are now stored in Unity Catalog tables rather than pipeline-local storage. Pipeline configuration can now be stored and read from Unity Catalog table properties. MANAGE permissions propagate automatically to pipeline outputs. The platform moved from a pipeline-scoped tool to a governance-aware platform feature.

Declarative vs Imperative: Why the Difference Matters
Most engineers learn data engineering imperatively. Write a Spark job. Tell it exactly what to read, how to join, what to write, and in which order. The engineer manages everything: the sequence, the incremental state, the retry logic, the dependency chain.
Declarative pipelines flip this. You define the output. The platform works out how to produce it.
As the Databricks SDP concepts documentation states directly: SDP can reduce hundreds or even thousands of lines of manual Spark and Structured Streaming code to only a few lines. Automatic orchestration ensures the correct execution order and maximum parallelism. Retry logic starts at the Spark task level, escalates to the flow level, then to the full pipeline if needed.
In practice this means:
- A Bronze-to-Silver pipeline does not need code to check whether Bronze ran first. SDP resolves the dependency automatically.
- A transformation that should process only new rows does not need manual checkpoint tracking. SDP handles incremental state.
- A failed flow does not need a human to restart the right subset of tasks. SDP retries at the most granular level that makes sense.
The engineer's job becomes defining the transformation logic and the quality rules. Not the operational plumbing.
The Three Dataset Types: When to Use Each
Every Lakeflow pipeline is built from three dataset types. Choosing the right one for each step is the single most consequential pipeline design decision.
| Dataset Type | Processing Model | Storage | Best For |
|---|---|---|---|
| Streaming Table | Each row processed exactly once | Delta table in Unity Catalog | Bronze ingestion, low-latency Silver updates |
| Materialized View | Incrementally refreshed, full correctness | Delta table in Unity Catalog | Silver and Gold transformations, aggregations |
| Temporary View | Computed at query time, no storage | None | Intermediate logic, reusable SQL within a pipeline |
Streaming Tables: Exact-Once Ingestion
A streaming table processes each incoming row exactly once. It stays running continuously or triggers on new data arrival, depending on your pipeline mode.
Use streaming tables for Bronze layer ingestion from Auto Loader, Kafka, Kinesis, or any append-heavy source. Use them for Silver updates where low-latency propagation from Bronze matters. As jamesm.blog's April 2026 Databricks engineering guide states: use streaming tables when you want low-latency append or upsert-style ingestion.
The exact-once guarantee matters more than it sounds. Without it, a pipeline restart processes events that already landed, producing duplicate rows in your Silver table. Streaming tables handle this via checkpointing. The engine tracks the last processed offset. On restart it picks up from exactly that point.
Materialized Views: Incremental Correctness
A materialized view pre-computes a query result and stores it as a Delta table. On each pipeline run, the engine processes only new or changed upstream data and updates the materialized view accordingly.
Use materialized views for Silver and Gold transformations. Complex joins, window functions, aggregations, deduplication logic. As jamesm.blog recommends: use materialized views when correctness on recomputation matters more than latency.
The distinction teams miss: a materialized view can be fully recomputed if something upstream changes significantly. A streaming table cannot reprocess history by default. If you discover a Silver transformation bug six months in, a materialized view can be corrected and refreshed. A streaming table may require replaying from Bronze.
Temporary Views: Intermediate Logic Without Storage
Temporary views compute results at query time within the pipeline. They do not write to storage. Use them to break complex transformation logic into readable named steps without paying the cost of materializing intermediate results.
A temporary view that joins three Bronze tables before a deduplication step runs each time it is referenced within the pipeline. Nothing persists. Nothing is queryable outside the pipeline context.

AUTO CDC: The Right Way to Handle Change Data in 2026
Most teams building CDC pipelines in Databricks before 2025 used APPLY CHANGES INTO. It worked. It was also verbose, required careful sequencing, and produced brittle pipelines when source schemas evolved.
AUTO CDC is the current standard. It is available on streaming tables inside Lakeflow Pipelines and handles SCD Type 1 and SCD Type 2 patterns with minimal configuration.
As confirmed in the Databricks 2026 SDP release notes: SCD Type 1 materialization with AUTO CDC is now supported, providing a simpler CDC pattern that upserts the latest value without maintaining full change history. SCD Type 2 operations now automatically coalesce duplicate records with the same natural key, ensuring data consistency.
The practical impact: a CDC pipeline that previously required 40 to 60 lines of explicit APPLY CHANGES INTO logic with sequence columns and key specifications now works with a streamlined AUTO CDC declaration. The platform handles deduplication, key matching, and history tracking based on the pattern you select.
- When to use SCD Type 1 with AUTO CDC: Dimension tables where only the current state matters. Customer address, product category, status flags.
- When to use SCD Type 2 with AUTO CDC: Any table where analysts or ML models need to reconstruct what a record looked like at a specific historical point. Order state, subscription tier, pricing.
Incremental Loads, CDC, and Change Data Feed in Delta Lake covers the full implementation of both patterns including late-arriving record handling and how AUTO CDC integrates with Change Data Feed on source tables.
Pipeline Observability: The Event Log Is Your Debug Tool
Every Lakeflow pipeline run writes structured records to an event log. This is the primary observability primitive. It is a Delta table you can query directly.
As the Databricks best practices documentation for SDP explains: every pipeline run writes records covering execution progress, data quality expectation results, data lineage, and error details. Query the event log using the event_log() table-valued function.
SELECT * FROM event_log('<pipeline-id>')
WHERE event_type = 'flow_progress'
ORDER BY timestamp
DESC LIMIT 100;
What this gives you that generic Databricks job logs do not:
- Expectation pass/fail counts per flow per run, queryable as time-series data
- Flow-level lineage showing exactly which source tables fed which output tables in that specific run
- Error details at the flow level, not just the job level
- Data volume metrics showing rows processed per flow
Teams that build data quality dashboards against the event log catch expectation degradation before it reaches analysts. A Silver table expectation that was passing 99.8% of rows last week and is passing 94% this week is a source system problem in progress, not a random anomaly.
For automated alerting, event hooks trigger custom webhooks when a pipeline fails or when a specific expectation failure threshold is breached. The hook fires before anyone has to check a dashboard.
Lakeflow Pipelines vs Lakeflow Jobs: Which One for What
This is the question that consistently trips up teams new to the Databricks stack.
As the Databricks pipeline task documentation frames it directly: Lakeflow Jobs provide a procedural approach to defining relationships between tasks. Lakeflow Spark Declarative Pipelines provide a declarative approach to defining relationships between datasets and transformations.
| Decision | Use Lakeflow Pipelines | Use Lakeflow Jobs |
|---|---|---|
| Defining transformation logic | Yes | No |
| Scheduling pipeline runs | No (Jobs triggers the pipeline) | Yes |
| Managing dependencies between datasets | Yes | No |
| Coordinating across pipelines, notebooks, ML jobs | No | Yes |
| Data quality enforcement | Yes (expectations) | No native equivalent |
| Cross-system workflow (dbt + pipeline + SQL query) | No | Yes |
The correct mental model: Lakeflow Pipelines owns the data transformation logic and the dataset dependency graph. Lakeflow Jobs owns the scheduling, cross-system coordination, and operational workflow.
A typical production setup: Lakeflow Jobs triggers a Lakeflow Pipeline run as one task in a larger workflow that also runs a dbt model and a downstream SQL query. The pipeline manages its internal dependencies. The job manages the external sequence.
Workflow Orchestration with Lakeflow Jobs covers how to wire pipelines into multi-task job workflows, repair runs, and event-driven triggering for production-grade orchestration.
Three Things That Break Lakeflow Pipelines in Production
Writing manual reads and writes inside a pipeline. Using spark.read.table("bronze.events") and df.write.mode("overwrite") inside a Lakeflow Pipeline bypasses the entire incremental processing engine. The pipeline has no idea what data has already been processed. Every run full-scans and full-overwrites. The pipeline works but delivers none of the cost or reliability benefits of the declarative model.
Not querying the event log for expectation monitoring. Teams that define expectations but never build monitoring against the event log are flying blind. An expectation that fails 0.01% of rows today and 8% of rows in three weeks looks the same in job status (both show green). Only the event log shows the trend.
Running pipelines in triggered mode when continuous mode is needed, or vice versa. Triggered pipelines run on a schedule or external trigger and stop when the update completes. Continuous pipelines run indefinitely, processing new data as it arrives. Teams that run a streaming Bronze ingestion pipeline in triggered mode miss data that arrives between runs. Teams that run a Gold aggregation materialized view in continuous mode pay for always-on compute for a workload that a scheduled nightly trigger would handle at a fraction of the cost.
What This Series Covers Next
- Designing Scalable ETL Pipelines on Databricks covers the implementation patterns for building transformation pipelines that handle real data volumes, schema evolution, and partition design using Delta Lake inside the Lakeflow framework.
- Incremental Loads, CDC, and Change Data Feed in Delta Lake goes deep on AUTO CDC, SCD patterns, and how Change Data Feed powers incremental Silver updates from Bronze source tables.
- Medallion Architecture in Databricks covers how streaming tables and materialized views map to Bronze, Silver, and Gold tier responsibilities and how expectations enforce quality at each layer boundary.
