Why does my Databricks ETL pipeline slow down as data grows?

The most common causes are full directory reads that reprocess all historical files instead of just new ones, Python UDFs that process rows one at a time instead of using vectorised execution, and data skew in join keys where one partition receives a disproportionate share of the data. Each of these is invisible at small data volumes and becomes the bottleneck at production scale.

What is the difference between Auto Loader and COPY INTO in Databricks?

Auto Loader tracks processed files via checkpoint state and handles continuous streaming ingestion at any file volume without metadata degradation. COPY INTO tracks files in a Delta transaction log and works well for batch loads with file counts under roughly 100,000. For any Bronze ingestion pipeline that accumulates files continuously or at high volume, Auto Loader is the correct choice. COPY INTO is better suited for controlled batch migrations and structured one-time loads.

How do you fix data skew in a Databricks ETL pipeline?

First confirm skew by checking the Spark UI Stages tab for tasks with data sizes significantly larger than the median. Then apply salting: add a random numeric suffix to the skewed join key, aggregate on the salted key, then strip the suffix and re-aggregate. For Delta tables, liquid clustering on the join column reduces skew at read time by physically co-locating related data in the same files.

Does schema evolution work automatically across all Databricks components?

No. Schema evolution must be configured per component and does not propagate automatically. Auto Loader supports it with cloudFiles.schemaEvolutionMode. Delta tables support it with mergeSchema = true. The from_json parser does not support schema evolution at all: new fields added by the source are silently dropped. Any pipeline that parses JSON with from_json and expects schema evolution to handle new fields will lose those fields without any error or warning.

When should you use serverless compute vs classic clusters for ETL pipelines?

Serverless is the right default for Lakeflow Declarative Pipelines, SQL warehouse queries, and jobs that complete in under two hours. Classic clusters are better for long-running ETL jobs over two hours where cost predictability matters more than elasticity, and for GPU-based ML training workloads. A three-hour nightly batch job that runs continuously at high memory utilisation costs less on a fixed classic cluster than on serverless DBU billing.

Scalable ETL Pipelines on Databricks: Design Guide 2026

TL;DR

Scalable ETL pipelines on Databricks require four design decisions to be correct from the start: use Auto Loader for file-based ingestion instead of full directory reads, write transformation logic in SQL or vectorised PySpark functions rather than row-level Python UDFs, configure schema evolution per component because Databricks does not apply it globally, and handle data skew actively since Spark's Adaptive Query Execution helps but does not eliminate skew caused by genuinely imbalanced join keys. Getting any of these wrong produces a pipeline that works in development and degrades quietly in production as data volume grows.

A Databricks ETL pipeline that runs in 12 minutes on 100GB can take 4 hours on 10TB. The pipeline code is identical. The difference is in a handful of design decisions made before the data ever lands in Bronze.

This article covers those decisions. Not the tooling overview you have already read. The specific choices that determine whether your pipeline scales or degrades: ingestion strategy, transformation code patterns, schema evolution handling, data skew, and compute configuration.

However, this article also, builds directly on Lakeflow Pipelines for Data Engineering, which covers the declarative pipeline framework, and Medallion Architecture in Databricks, which defines the Bronze-Silver-Gold structure these pipelines populate. For the foundational ETL vs ELT pattern decision, see ETL vs ELT in Modern Data Engineering. The series starts at Modern Data Engineering: The Complete Guide.

Why Pipelines That Work in Dev Break at Scale

The failure mode is always the same pattern. A pipeline is built and tested on a sample: one month of events, a few hundred million rows, a dev cluster sized appropriately for that volume. It works. It gets promoted to production. Three months later, someone notices it is taking six hours instead of forty minutes.

Nothing in the code changed. The data volume did.

Four things cause this specific failure:

Full directory reads instead of incremental ingestion. Reading an entire S3 prefix on every run means the job processes 3 months of accumulated files each time instead of just the new ones.
Row-level Python UDFs in transformation logic. A UDF that processes one row at a time does not vectorise. At 1 billion rows, that serialisation overhead becomes the bottleneck.
Unhandled data skew on join keys. A customer_id join where one customer generated 40% of events sends 40% of the data to one executor while the other executors wait.
Schema evolution applied at the wrong layer, silently dropping or nulling new fields that arrived from the source.

Each of these is a design decision, not a bug. They can be avoided before the first line of production data lands.

Auto Loader vs COPY INTO: Choosing the Right Ingestion Pattern

Both Auto Loader and COPY INTO handle incremental file ingestion from cloud object storage. They solve the same problem differently. Choosing the wrong one has direct performance consequences.

Dimension	Auto Loader	COPY INTO
State tracking	Tracks processed files via checkpointing in RocksDB or DynamoDB	Tracks processed files in a Delta table transaction log
Scale	Handles millions of files efficiently, no metadata degradation	Degrades when file count exceeds ~100K across tracked directories
Schema evolution	Supports automatic schema inference and evolution	Schema must be explicitly defined or specified
Streaming support	Native streaming source, processes files as they arrive	Batch only, not a streaming source
Best for	High-volume, continuously arriving Bronze ingestion	Smaller controlled loads, one-time migrations, structured batch loads
Complexity	Slightly more setup, requires stream configuration	Simpler syntax, SQL-native

The practical rule: for any Bronze ingestion pipeline that will accumulate more than 100,000 files over its lifetime or receives files continuously, use Auto Loader. As the Databricks schema evolution documentation confirms, Auto Loader supports schema evolution for connectors including cloud storage, while COPY INTO requires explicit schema definitions.

The moment your source starts adding fields without warning you, Auto Loader handles it; COPY INTO breaks.

The common mistake: starting with COPY INTO because the syntax is simpler, then migrating to Auto Loader twelve months later when the metadata tracking table has grown to a size that makes every run slow. Design for the file volume you will have in twelve months, not for the file volume you have today.

Schema Evolution: When It Works and When It Silently Corrupts Data

Schema evolution in Databricks does not apply globally. Each component in your pipeline handles schema changes independently, and this independence is what catches teams off guard.

As the Databricks schema evolution documentation defines it: schema evolution covers four component categories: connectors (Auto Loader, Kafka, Lakeflow connectors), format parsers (from_json, from_avro, from_xml), processing engines (Structured Streaming), and datasets (Delta tables, streaming tables, materialized views). Each must be configured separately.

The critical gotcha that almost no article covers: from_json does not support schema evolution. If your Bronze ingestion reads JSON strings and parses them with from_json, a new field added by the source will be silently dropped. The Bronze table will not fail. It will not warn you. It will simply not have the new column. The downstream Silver transformation will not have it either. Three weeks later, an analyst asks why a field is missing and the debug process starts.

The Four Schema Change Scenarios and What Each Does

Source Change	Component	Configured?	Result
New column added	Auto Loader	`cloudFiles.schemaEvolutionMode = addNewColumns`	Column added to Bronze, existing rows get null
New column added	`from_json` parser	Not configurable	Column silently dropped
Column type widened (INT to LONG)	Delta table	`mergeSchema = true`	Type widened, existing data cast
Column renamed	Any component	No automatic handling	Old column persists, new column added separately

The practical approach: configure schema evolution explicitly at each component rather than assuming it propagates. For Lakeflow Spark Declarative Pipelines, type widening (INT to LONG, FLOAT to DOUBLE) is now supported natively as of the 2026 SDP release notes, without requiring a full pipeline reset. For renamed columns, there is no automatic handler. You need a transformation rule or a source-side agreement.

Track schema changes in a log table or version-controlled schema file. As Gema Correa's field engineering guide on schema evolution learned the hard way: when a column gets added to a production Delta table, views do not update automatically. Add a spark.catalog.refreshTable() call as a downstream task after schema evolution events, or stale views will serve incorrect results for days.

Transformation Code Patterns That Scale vs Patterns That Degrade

The transformation layer is where most performance problems originate. The root cause is almost always the same: Python code doing work that SQL or native Spark functions could do faster.

SQL and Native Functions Beat Python UDFs at Scale

Python UDFs run row by row. Spark serialises each row into Python, processes it, and deserialises the result back to JVM. At 100 million rows this is slow. At 10 billion rows it is the bottleneck.

Native Spark SQL functions and built-in PySpark functions use vectorised columnar execution via the Photon engine. They process entire column batches in a single CPU operation.

As the Databricks community ETL optimisation thread states directly: SQL with Delta and the Photon execution engine is generally faster than row-by-row Python processing. Window functions, joins, and aggregations outperform custom Python logic. Use vectorised UDFs only when a transformation genuinely cannot be expressed in SQL.

The specific pattern to avoid:

# Slow: row-level Python UDF
@udf(returnType=StringType())
def clean_phone(phone):
    return re.sub(r'\D', '', phone) if phone else None

df = df.withColumn("phone_clean", clean_phone(df.phone))

The faster equivalent:

SELECT regexp_replace(phone, '[^0-9]', '') AS phone_clean FROM bronze.events

The SQL version runs through Photon's vectorised engine. The UDF version does not.

Broadcast Joins Eliminate Shuffle for Small Dimension Tables

A shuffle join between a 50-billion-row fact table and a 10,000-row dimension table sends every row of the fact table through a network shuffle to find its matching dimension row. Spark has to redistribute the fact table across the cluster to co-locate matching keys.

A broadcast join sends the small dimension table to every executor once. The fact table never moves. For joins where one table fits in memory (under 10MB by default, configurable), broadcasting it eliminates the most expensive operation in a join-heavy transformation pipeline.

from pyspark.sql.functions import broadcast

df_result = df_events.join(
    broadcast(df_customers),
    on="customer_id",
    how="left"
)

Spark's Adaptive Query Execution (AQE) will broadcast small tables automatically when it can detect their size at runtime. But AQE cannot broadcast a table it does not know is small until the query runs. Explicit broadcast hints eliminate the decision latency.

Data Skew: How to Detect It and What to Do

Data skew is the performance problem that looks like a hardware problem. A Spark job runs 47 tasks in 90 seconds, then stalls on the 48th task for 20 minutes. The cluster shows 49 executors idle and 1 at 100% CPU. The job eventually completes or runs out of memory.

This is skew. One partition received a disproportionate share of the data. Everything else finished and is waiting for that one partition.

Detecting Skew in the Spark UI

In the Spark UI, navigate to the Stages tab for the stalled job. Look at the partition size histogram under the Tasks section. If one or two tasks show data sizes 10x or 100x larger than the median, that is skew.

The source is almost always a join or group-by on a column where one value represents a large fraction of the dataset. A customer_id join where one customer is a test account with 200 million synthetic events. A date groupBy where all historical data gets routed to one partition because of a default null handling.

Fixing Skew with Salting

Salting adds a random suffix to the skewed key before grouping, splits the computation across multiple partitions, then aggregates the results.

import pyspark.sql.functions as F

# Salt the skewed key into 10 buckets
df_salted = df.withColumn(
    "customer_id_salted",
    F.concat(F.col("customer_id"), F.lit("_"), (F.rand() * 10).cast("int"))
)

# Aggregate on the salted key
df_agg = df_salted.groupBy("customer_id_salted").agg(F.sum("revenue"))

# Strip the salt and re-aggregate
df_result = df_agg.withColumn(
    "customer_id",
    F.split(F.col("customer_id_salted"), "_")[0]
).groupBy("customer_id").agg(F.sum("sum(revenue)").alias("total_revenue"))

As Unravel's data skew guide explains: Spark processes data in parallel by dividing it into partitions. When skew concentrates large amounts of data in few partitions, those partitions become bottlenecks that throttle overall query performance. Salting distributes the skewed key artificially across partitions, trading one large partition for many equal ones.

For tables stored in Delta Lake, liquid clustering on the join key also reduces skew by physically co-locating related data at write time, so the join reads less data from fewer files.

Compute Configuration: Sizing Jobs Without Overpaying

Compute configuration is where teams either over-provision by default or under-provision until something breaks.

Three decisions matter most:

Serverless vs classic compute for your workload type. Serverless is the right default for Lakeflow Pipelines and SQL workloads that run in under two hours. For long-running batch jobs over two hours or jobs requiring GPU instances, classic job clusters are more cost-predictable. A nightly ETL job that runs for 3 hours at a sustained high-memory workload costs less on a fixed classic cluster than on serverless, which bills per DBU consumed continuously.

Adaptive Query Execution enabled by default. AQE is on by default in Databricks Runtime 7.0 and above. It handles partition coalescing (merging small post-shuffle partitions), skew join optimisation (splitting skewed partitions automatically), and broadcast join conversion at runtime. Teams that set spark.sql.adaptive.enabled = false in job configurations to "reduce overhead" are disabling the single most impactful automatic optimisation in Spark. Leave it on.

Shuffle partition count matched to data volume. The default shuffle partition count is 200. A 50GB transformation with 200 shuffle partitions produces 250MB partitions, which is reasonable. A 5TB transformation with 200 shuffle partitions produces 25GB partitions. Each task tries to process 25GB in memory and spills to disk. The job slows by 5x to 10x. Set spark.sql.shuffle.partitions to approximately 2x to 4x your cluster's total core count, then tune from there based on Spark UI task duration distribution.

The Pipeline Design Checklist Before Going to Production

Run through these before promoting any ETL pipeline from development to production.

Ingestion layer:

Auto Loader used for file-based Bronze ingestion with schema evolution configured
from_json calls replaced or documented with explicit schema definitions
Schema change handling tested with a source sample that includes an extra column

Transformation layer:

Python UDFs replaced with SQL equivalents or vectorised Pandas UDFs where possible
Broadcast joins used for all dimension tables under 100MB
AQE confirmed enabled (not explicitly disabled in job config)
Shuffle partition count set based on expected data volume, not left at default 200

Performance validation:

Pipeline tested on a data volume at least 10x larger than current production size
Spark UI reviewed for skewed tasks in any stage involving joins or group-bys
Job run duration benchmarked at 3x expected data growth for the next 12 months

Operational readiness:

Expectations defined in Lakeflow Pipeline for Silver layer key fields
Event log monitoring configured with pass rate trending, not just job success/failure
Pipeline deployed via Declarative Automation Bundle, not manually configured in UI

What This Series Covers Next

How to Build Production-Grade Data Pipelines on Databricks covers the operational layer: monitoring, alerting, error handling, recovery strategies, and what a production runbook looks like for a Databricks pipeline that runs 24/7.
Incremental Loads, CDC, and Change Data Feed in Delta Lake covers the incremental processing patterns that keep transformation pipelines efficient as data volumes grow, including AUTO CDC, SCD patterns, and how Change Data Feed eliminates full-table scans.
Data Quality and Reliability Patterns for Databricks Pipelines covers how to design quality rules that catch problems at tier boundaries before they reach Gold layer consumers.

Krunal Kanojiya

Technical Content Writer

Facing a Challenge? Let's Talk.

Whether it's AI, data engineering, or commerce tell us what's not working yet. Our team will respond within 1 business day.

Designing Scalable ETL Pipelines on Databricks