Databricks for Data Engineering: Architecture, Components and Best Practices
IT Insights

Databricks for Data Engineering: Architecture, Components and Best Practices

Krunal Kanojiya|June 22, 2026|19 Minute read|Listen
TL;DR

Databricks data engineering architecture has three layers that must work together: Delta Lake for storage reliability, Lakeflow for pipeline execution (ingestion, transformation, orchestration), and Unity Catalog for governance across all of it. The platform has shifted to serverless compute and declarative pipelines as the default in 2026. Teams still running hand-managed clusters and ad hoc Spark jobs are operating a version of Databricks the platform has largely moved past. Medallion Architecture (Bronze, Silver, Gold) is not optional design theory. It is the structural pattern that makes every other component in the stack behave predictably at scale. The most expensive Databricks mistakes are architectural, not syntactical: missing Unity Catalog from day one, building pipelines that reprocess full tables instead of incremental loads, and using personal credentials instead of service principals. 

Most Databricks documentation tells you what each component does. It does not tell you how the components break when they are wired together.

This article does both. It explains the full architecture from storage layer to orchestration, how the components connect in production, and where teams consistently go wrong when they skip steps that look optional but are not.

If you are new to the platform, What Is Databricks and Why Data Teams Use It covers the fundamentals before this article. This article assumes you already know what Databricks is and want to understand how to actually build on it.

The Architecture in One Model: Three Layers, One Direction of Data Flow

Databricks data engineering architecture is three layers stacked on each other. Data flows in one direction through them. As the Databricks Data Intelligence Platform scope documentation describes it, the platform is an open and unified foundation for ETL, ML/AI, and analytics workloads with Unity Catalog as the central governance solution across all of them.

  • Layer 1: Storage. Delta Lake on cloud object storage. Every table your pipelines read and write is a Delta table. Delta Lake provides the ACID transactions, schema enforcement, time travel, and Change Data Feed that make production pipelines possible.
  • Layer 2: Processing and pipelines. Lakeflow. This is the engine that moves and transforms data: Lakeflow Connect for ingestion, Lakeflow Spark Declarative Pipelines for transformation, Lakeflow Jobs for orchestration.
  • Layer 3: Governance. Unity Catalog. Every asset in the first two layers, every table, every pipeline, every file, is registered, secured, and tracked here.

Teams that treat these as three separate concerns end up with governance bolted on after the fact. Unity Catalog is not something you add later. It is the foundation that makes the storage and processing layers trustworthy from the start.

Layer 1: Storage Architecture with Delta Lake

Every table in a Databricks lakehouse is a Delta table. This is not a recommendation. It is the prerequisite for every other architectural feature the platform offers.

Delta Lake stores data as Parquet files in cloud object storage with a _delta_log directory alongside them. That transaction log is the mechanism behind ACID guarantees, time travel, schema enforcement, and Change Data Feed. Without it, you have a folder of files with no coordination layer. The Databricks platform cannot make reliability guarantees about data that lives outside Delta format.

Delta Lake Explained for Data Engineers covers the full technical detail of how the transaction log, ACID properties, and Change Data Feed work. What matters architecturally is one decision: use Unity Catalog managed tables, not external tables, for all new production work.

Managed vs External Tables: Get This Right First

Managed tables store data in Unity Catalog's managed storage location. When you drop a managed table, the data is deleted. Unity Catalog owns the lifecycle.

External tables point to data stored at a path you specify. Unity Catalog tracks the metadata but does not manage the underlying files. When you drop an external table, the data stays on disk.

Most teams default to external tables because it feels safer. It is actually the less safe choice for production pipelines. External tables accumulate orphaned files, require manual lifecycle management, and complicate the lineage tracking Unity Catalog provides automatically for managed tables.

The correct default: use managed tables unless you have a specific reason not to, such as sharing data with a non-Databricks system that must read files directly from a fixed path.

Liquid Clustering Is the Default Partitioning Strategy in 2026

If your production Delta tables still use PARTITIONED BY (date), your performance strategy is based on a Databricks version from several years ago.

Liquid clustering, now the official Databricks recommendation per their best practices documentation, replaces manual partitioning and ZORDER for all new tables. It uses Z-order curves to co-locate related data adaptively. When your query patterns change, you change the clustering columns and run OPTIMIZE. No full table rewrite. No data migration.

The one scenario where manual partitioning still makes sense: tables with extremely high cardinality date columns where partition pruning eliminates entire months of data from a query scan. Even then, combining a coarse date partition with liquid clustering on high-cardinality secondary columns is usually better than either approach alone.

Layer 2: The Lakeflow Pipeline Architecture

Lakeflow is the unified data engineering solution inside Databricks. It has three components that map directly to the three pipeline stages every data engineering team needs.

Lakeflow Connect: Ingestion Without Custom Code

Lakeflow Connect provides fully managed connectors for ingesting data from enterprise applications, operational databases, cloud storage, and streaming message buses.

What this means in practice: no custom ingestion code for the sources it covers. Salesforce, Workday, ServiceNow, SQL Server, PostgreSQL, MySQL, Oracle, S3, Azure Data Lake Storage, Apache Kafka, Amazon Kinesis. The connector handles authentication, incremental reads, schema detection, and error recovery. Ingested data lands directly in Unity Catalog managed Delta tables.

For cloud object storage specifically, Auto Loader is the Lakeflow Connect mechanism for file-based ingestion. It monitors a storage location, detects new files as they arrive, and processes them incrementally without scanning the full directory on every run. This matters for Bronze layer design: teams that use a batch spark.read against an entire S3 prefix instead of Auto Loader are reprocessing every file on every pipeline run.

The connectors use serverless compute automatically. There is no cluster to configure. This is the single most common source of surprise for engineers migrating from classic Spark jobs: the cluster configuration UI simply does not appear because serverless handles it.

According to the Databricks blog on Lakeflow, teams using Lakeflow on Azure Databricks report building pipelines up to 25x faster and reducing ETL costs by up to 83% compared to custom connector code. Those numbers reflect the engineering hours that previously went into building and maintaining ingestion connectors that Lakeflow now provides out of the box.

Lakeflow Spark Declarative Pipelines: How Transformation Actually Works

Lakeflow Spark Declarative Pipelines is the transformation layer. It replaced Delta Live Tables in naming in 2025 but the underlying engine is the same: Apache Spark and Structured Streaming running a unified batch and streaming execution model.

The key concept that separates declarative pipelines from ad hoc Spark jobs: you define what each output table should look like, not how to compute it. The platform handles dependency resolution, incremental processing, retries, and schema evolution.

Engineers work with three dataset types inside a pipeline:

  • Streaming tables process each incoming row exactly once. Use these for Bronze layer ingestion and any Silver layer table that needs low-latency updates.
  • Materialized views pre-compute query results and refresh them incrementally. Use these for Silver and Gold layer transformations where correctness matters more than latency.
  • Temporary views hold intermediate transformation logic without writing anything to storage.

The practical result: a Bronze-to-Silver transformation written as a materialized view only processes new or changed rows since the last pipeline run. It does not full-scan the entire Bronze table on every execution. Teams that switch from scheduled Spark jobs to declarative pipelines typically see 60-80% compute cost reduction on their transformation layer simply because they stop reprocessing data that has not changed.

As jamesm.blog's 2026 Databricks engineering guide states directly: if your platform still depends on ad hoc Spark jobs for transformation, you are probably optimizing for an older Databricks era. The declarative approach is not just more convenient. It removes an entire category of engineering work around incremental state management that teams used to build and maintain themselves.

What to avoid: writing pipeline logic that uses spark.read.table("bronze.events") and then df.write.mode("overwrite") inside a Lakeflow pipeline. This defeats the entire incremental processing mechanism. The pipeline engine manages reads and writes automatically. Manually reading and writing bypasses the state tracking and produces a pipeline that full-scans on every run.

Lakeflow Jobs: Orchestrating the Full Stack

Lakeflow Jobs is the orchestration layer. It schedules and coordinates everything in the Databricks stack: pipeline runs, notebook jobs, Python scripts, dbt models, ML training jobs, SQL queries.

The architecture features that matter for production:

  • Task dependencies. A job is a directed acyclic graph of tasks. Each task can depend on one or more upstream tasks. A Silver layer pipeline only runs after its Bronze pipeline completes successfully. A Gold layer job only runs after both Silver pipelines finish. Dependency chains are visual and testable before deployment.
  • For-each task loops. A single task definition fans out across a list of inputs. If you process data for 50 merchant accounts with the same pipeline logic, a for-each loop runs that pipeline once per merchant without duplicating 50 task definitions.
  • Repair runs. When a multi-task job fails at task 8 of 20, repair run reruns only from task 8. Teams that do not use repair runs restart the entire job, reprocessing seven tasks that already completed successfully. On large jobs, this is the difference between a 15-minute recovery and a 3-hour one.

Orchestration decisions that belong in Lakeflow Jobs versus ones that belong inside a pipeline are a common point of confusion. The line is clear: pipeline-internal dependencies go inside Lakeflow Pipelines. Cross-pipeline and cross-system dependencies go in Lakeflow Jobs. If task A produces data that task B reads, that dependency belongs in a pipeline. If pipeline A completes and then a separate SQL query must run before pipeline B starts, that sequence belongs in a job.

Workflow Orchestration with Lakeflow Jobs covers the full production design for complex multi-task workflows, including error handling patterns, alerting configuration, and CI/CD deployment with Declarative Automation Bundles.

Layer 3: Unity Catalog Governance Architecture

Unity Catalog is where most Databricks architectures have the largest gap between what is configured and what should be.

The common pattern: a team builds a functioning data pipeline, everything works in development, and Unity Catalog is either not enabled or only partially configured. Then the platform goes to production. The security review finds uncontrolled data access. The compliance audit finds no lineage records. The onboarding of a second team creates conflicts because two engineers own tables that both teams need.

All of these problems have the same root: Unity Catalog was treated as optional configuration rather than foundational architecture.

The Three-Level Namespace

Unity Catalog organizes every asset in a three-level hierarchy: catalog > schema > table.

Every table reference in production code should use the full three-level path: catalog.schema.table.

A standard production namespace design:

Catalog Purpose Who Has Write Access
raw Bronze layer, landing zone Pipeline service principals only
silver Cleaned, validated data Pipeline service principals only
gold Business-ready aggregations Pipeline service principals only
sandbox Ad hoc engineering work Individual engineers
analytics BI and reporting views Analytics team service principals

Analysts and data scientists get read access to silver and gold. They never write to those catalogs. Nothing in the data stack is more expensive to fix than a production Gold table overwritten by an ad hoc notebook.

Data Lineage Is Automatic, But Only If Unity Catalog Is Active

Every pipeline run in Databricks writes lineage records to Unity Catalog automatically: which source tables fed which output tables, through which pipeline, at what time. This is column-level lineage for managed tables.

The catch: lineage only records for assets registered in Unity Catalog. Data that flows through external tables without Unity Catalog registration produces no lineage record. Teams that skip Unity Catalog for "convenience" in early stages discover during their first compliance audit that they have no traceable record of where any of their Gold layer data came from.

Data Governance with Unity Catalog in Databricks covers the full governance architecture, including row-level security, column masking for PII, audit log configuration, and the lineage explorer.

How the Medallion Architecture Connects the Layers

Medallion Architecture is the data organization pattern that makes the three-layer Databricks architecture function as a coherent system. Without it, the storage layer becomes a collection of unrelated tables with no clear flow of data quality or reliability.

Bronze layer holds data exactly as it arrived from the source. No transformations. No cleaning. Schema is whatever the source sent. The value of Bronze is that it is the system of record for raw data. When a downstream transformation produces wrong results, you debug from Bronze. When a source system changes its schema, Bronze absorbs the change and downstream pipelines handle it in Silver.

Silver layer is where data becomes trustworthy. Deduplication, schema validation, type casting, null handling, business key reconciliation. A Silver table has a defined schema that every downstream consumer can rely on. Expectations inside Lakeflow Pipelines enforce data quality rules at Silver write time: a row that fails an expectation either gets quarantined or fails the pipeline, depending on how the expectation is configured.

Gold layer is where data becomes useful. Aggregations, business metrics, dimensional models built for specific reporting or ML use cases. Gold tables are optimized for the consumers that read them, not for the pipelines that produce them.

The mistake teams make at Gold is over-materializing. Every business unit requests a custom Gold table. Within six months there are 80 Gold tables, many covering similar logic with slightly different definitions, and no single source of truth for any metric. Gold tables should serve defined consumer use cases, not every possible aggregation a team might want. Databricks SQL views on top of Silver tables can handle exploratory querying without creating permanent Gold table proliferation.

Medallion Architecture in Databricks covers the full design for each tier, including how expectations enforce quality at Silver boundaries and how CDC feeds from Bronze keep Silver tables updated incrementally.

Serverless Compute: The Default Stack in 2026

Databricks serverless compute in 2026 is not a premium option. It is the recommended default for most new workloads. The platform has moved on from the model where engineers provision clusters, configure autoscaling policies, and manage Databricks Runtime versions.

Serverless compute allocates resources automatically when a job, pipeline, or SQL query runs. It terminates those resources when the workload completes. You pay only for the time your code is actually executing.

What this changes architecturally: the cluster configuration that used to be a significant engineering decision is now handled by the platform. Engineers who previously spent hours debugging cluster sizing, autoscaling events, and out-of-memory errors on the wrong instance type spend that time on pipeline logic instead.

When Classic Compute Still Makes Sense

Serverless is the right default. It is not the right choice for every workload.

As Unravel's compute comparison guide explains directly: serverless excels at high-concurrency, short-duration queries. Classic compute dominates for long-running jobs where you need resource control and predictable costs. Using the wrong one has a real cost impact: a two-hour nightly ETL job that costs $16 on classic compute can cost $60 on serverless.

Workload Recommended Compute Why
Lakeflow Pipelines (declarative) Serverless Auto-scaling, no cluster management, governed
SQL queries and BI workloads Serverless SQL Warehouse Instant start, scales with concurrency
Short scheduled jobs under 30 min Serverless Cost-efficient for short durations
Long-running ETL jobs over 2 hours Classic job clusters More cost-predictable at sustained duration
ML training with GPU requirements Classic clusters GPU instance types, custom configuration
Streaming with Real-Time Mode Classic (standard access mode) RTM requires classic compute currently

One operational rule that prevents the most expensive serverless mistakes: set auto-termination to 1 minute for development SQL warehouses. The default is 10 to 20 minutes. At that setting, engineers who walk away from an active session are billing for idle compute. One minute of idle time before termination eliminates that cost entirely in dev environments where instant restart is irrelevant.

The Production Architecture Checklist

Teams that get Databricks production-ready follow these decisions in roughly this order. Teams that skip steps usually discover which one they missed when something breaks at an inconvenient time.

Governance foundation

  • Unity Catalog enabled with a three-level namespace designed before any tables are created
  • Service principals created for all pipeline execution, no personal credentials in job configurations
  • Column-level permissions and PII masking policies defined for all Silver and Gold tables containing sensitive data
  • Audit log retention configured before production data enters the platform

Storage design

  • Managed tables as the default for all production Delta tables
  • Liquid clustering enabled for new tables, PARTITIONED BY only where clearly justified
  • delta.logRetentionDuration set to match the time travel window your team actually needs
  • Predictive Optimization enabled for Unity Catalog managed tables to handle OPTIMIZE and VACUUM automatically

Pipeline architecture

  • Auto Loader for all file-based Bronze ingestion, not full directory reads
  • Declarative Pipelines for all Silver and Gold transformations, not ad hoc Spark jobs
  • Data quality expectations defined at Bronze-to-Silver and Silver-to-Gold boundaries
  • CDC enabled via Change Data Feed on Silver tables that feed downstream incremental consumers

Orchestration

  • All production pipelines deployed via Declarative Automation Bundles with version control, not manually configured in the UI
  • Repair run configured for all multi-task jobs with more than three tasks
  • Alert policies set for job duration, not just job failure

Three Architecture Decisions That Cost Teams the Most

These are not edge cases. They show up consistently in production Databricks environments that were built without a deliberate architecture review.

Running pipelines as individual user identities instead of service principals. The engineer who set up the pipeline leaves the company or changes roles. Their personal access token expires. Twenty pipelines fail simultaneously. As Filip Pastuszka's February 2026 guide on Databricks mistakes frames it directly: jobs they owned break or disappear, permissions vanish. Use OAuth service principals from the start. Migrating away from personal credentials in a production system with hundreds of jobs is a multi-week project.

Building the data platform without Unity Catalog, then trying to add it later. Unity Catalog cannot be retroactively applied to data already in place without migrating every table, updating every pipeline reference, and rewriting every access policy. Teams that skip Unity Catalog in their initial build because it feels like overhead spend three to six months migrating later, while operating with zero lineage, no column-level security, and untracked access to production data in the meantime.

Writing transformation pipelines that overwrite full tables instead of processing incrementally. A pipeline that reads the full Bronze events table on every run and overwrites the Silver table processes the same historical data repeatedly. On day one this takes five minutes. On month six it takes four hours. Declarative Pipelines with streaming tables and materialized views process only new or changed data. Switching a full-overwrite pipeline to incremental processing after it has been running in production requires careful state management and testing. Designing for incremental processing from the start is substantially easier.

What This Series Covers Next

This article is the architecture reference for data engineering series. Every article that follows builds a specific component in detail.

  • Medallion Architecture in Databricks goes deep on Bronze, Silver, and Gold layer design: how to structure tables at each tier, how expectations enforce data quality at tier boundaries, and how CDC flows keep Silver tables incrementally updated.
  • Lakeflow Pipelines for Data Engineering covers the full implementation of Lakeflow Spark Declarative Pipelines: streaming tables vs materialized views, expectation design, schema evolution handling, and how to structure a production pipeline for reliability.
  • Data Governance with Unity Catalog in Databricks covers the full governance architecture: catalog and schema design, row-level security, column masking, lineage explorer, and audit log configuration for regulated environments.
  • Workflow Orchestration with Lakeflow Jobs covers production orchestration patterns: multi-task job design, for-each loops, repair runs, alerting, and CI/CD deployment using Declarative Automation Bundles.

New to the series? Modern Data Engineering: The Complete Guide is the starting point.

SHARE

Krunal Kanojiya
Krunal Kanojiya
Technical Content Writer

Facing a Challenge? Let's Talk.

Whether it's AI, data engineering, or commerce tell us what's not working yet. Our team will respond within 1 business day.

Frequently Asked Questions

Still have Questions?

Let’s Talk

What is the Databricks data engineering architecture in 2026?

arrow

What is Lakeflow in Databricks and what does it replace?

arrow

When should you use Databricks serverless compute vs classic compute?

arrow

What is Medallion Architecture and why does Databricks use it?

arrow

What is the difference between a streaming table and a materialized view in Databricks?

arrow

Why should you use service principals instead of personal credentials in Databricks?

arrow

How does Unity Catalog fit into the Databricks architecture?

arrow