How to Optimize Databricks Clusters After Snowflake Migration

if you’ve done it. The big migration is over. The terabytes of data that once lived in Snowflake are now accessible from Databricks, and the legacy SQL scripts have been painstakingly translated into Spark notebooks. The project is marked "complete," and the team celebrates a successful "lift and shift."

Then the first cloud bill arrives. And the first on-call rotation begins.

The jobs that were supposed to be faster are crawling. The clusters, configured with what seemed like ample resources, are throwing out-of-memory errors. The cost, which was meant to be a key driver for the move, is spiraling. This isn't the data and AI nirvana you were promised.

This is the moment the real work begins. The "lift and shift" was just the prologue. The true optimization of your Databricks environment after a Snowflake migration is not about tweaking a few settings; it's about fundamentally rewiring your team's thinking from a warehouse-centric model to a distributed-compute, Lakehouse-native approach.

The Great Paradigm Mismatch: Why Your Snowflake Instincts Fail You in Databricks

The root of post-migration pain is almost always a mismatch in core concepts. In the Snowflake world, you optimize by interacting with a managed, black-box service.

  • Problem: A query is slow.
  • Snowflake Solution: Scale up the virtual warehouse (from an X-Small to a Small, or a Medium to a Large). Maybe you add a clustering key. You are essentially paying for a bigger, faster black box.

In the Databricks world, you are in direct control of the engine itself. You are not renting a bigger box; you are designing the assembly line.

  • Problem: A job is slow.
  • Databricks Solution: Is the driver node a bottleneck? Are the worker nodes memory-constrained or CPU-constrained? Is there a data skew in a partition? Is the shuffle spilling to disk? Are you using an inefficient UDF? Are you reading too much data from the source?

Applying the Snowflake mindset to Databricks is like trying to fix a custom-built race car by asking for a "bigger engine" instead of looking at the fuel injectors, the gear ratios, and the aerodynamics. It leads to oversized, overpriced clusters that are brute-forcing a problem that requires finesse.

The journey to optimization starts by unlearning old habits and embracing the granular control Databricks provides. We'll break this down into three phases: the immediate quick wins, the essential re-architecture, and a real-world case study to tie it all together.

Phase 1: The Low-Hanging Fruit – Immediate Post-Migration Triage

Before you start rewriting entire pipelines, there are significant gains to be had by addressing the most common configuration mistakes made during a "lift and shift." These are the changes you can make this week to see an immediate impact on performance and cost.

Right-Sizing Your Clusters: Precision Over Power

The most common mistake is picking a large, general-purpose node type (like AWS's m5d.8xlarge or Azure's Standard_D32as_v5) for everything, assuming bigger is always better. This is a direct translation of the "scale up the warehouse" mentality.

Instead, you need to match the hardware to the workload. Databricks workloads generally fall into three categories, and your node selection should reflect that.

Workload Type Characteristics Recommended Instance Families (AWS/Azure) Why It Matters
Compute-Intensive Heavy use of CPU-bound UDFs, complex string manipulations, ML model training, parsing raw text/JSON. AWS: c5/c6 series
Azure: F-series
These instances have a higher ratio of CPU cores to memory. Paying for excess memory you don't use is pure waste. A compute-optimized instance can have the same number of cores as a general-purpose one but cost 15-20% less.
Memory-Intensive Large, complex joins; shuffles with high cardinality keys; aggregations over wide windows; caching large tables. AWS: r5/r6 series
Azure: E-series
These instances provide more RAM per vCPU. If your Spark UI shows significant data spillage to disk or long garbage collection (GC) pauses, it's a clear sign you're memory-starved. More memory allows Spark to keep intermediate data in RAM, avoiding costly disk I/O.
General Purpose A balanced mix of CPU and memory usage. Standard ETL, simple transformations, and ad-hoc queries. AWS: m5/m6 series
Azure: D-series
This is a safe starting point, but you should actively monitor your workloads to see if you can specialize. Sticking with general purpose everywhere is leaving money on the table.

Practical Steps:

  1. Look at the Spark UI: Go to the "Executors" tab for a completed job. Look at the "Shuffle Read/Write" and "Spilled to Disk" columns. High disk spill means you need more memory. Look at the "Task Time (GC Time)" column. High GC time also points to memory pressure.
  2. Use Ganglia or Databricks Metrics: Check the CPU and memory utilization charts. Are your CPUs pegged at 100% while memory is sitting at 50%? Switch to a compute-optimized instance. Is memory constantly high while CPUs are idle? Go memory-optimized.
  3. Don't Forget the Driver: The driver node collects data from workers (e.g., using .collect()), holds broadcast variables, and orchestrates the job. If you have a join with a small-but-not-tiny table or you're collecting a large result set back to the notebook, the driver can become a bottleneck. It's often wise to give the driver a memory-optimized instance type, even if the workers are compute-optimized.

Enabling Photon: The "Free Speed" Button

Photon is Databricks' native vectorized query engine, rewritten in C++ to take advantage of modern CPU architecture. For workloads that are primarily Spark SQL or DataFrame API calls, enabling Photon is often the single most impactful change you can make.

Think of it this way: The standard Spark engine is written in Scala and runs on the JVM. The JVM has overheads like garbage collection and just-in-time compilation. Photon bypasses much of this, operating closer to the metal.

How to Enable It: When you configure your cluster, there is a simple checkbox: Use Photon Acceleration.

The Impact: I've seen standard ETL jobs that were bottlenecked on wide transformations (joins and aggregations) get a 2x to 4x performance boost just by ticking this box. The cost for Photon-enabled clusters (DBUs) is slightly higher, but since the job finishes much faster, the total cost is almost always lower.

  • Before Photon: A 1-hour job on a 10-node cluster = 10 cluster-hours.
  • After Photon: The same job runs in 20 minutes = ~3.3 cluster-hours.

Even with a higher DBU rate, the total cost is drastically reduced. There is virtually no good reason to have Photon disabled for new data engineering workloads.

Taming the Snowflake Connector: Stop Pulling the Entire Ocean

When your code runs spark.read.format("snowflake").load("HUGE_TRANSACTION_TABLE"), you are asking Databricks to pull the entire table over the network from Snowflake into the Spark cluster's memory. This is the cardinal sin of post-migration workloads.

The Snowflake connector is smart, but it's not a magician. It supports predicate pushdown, which means it can translate Spark .filter() or WHERE clauses into a WHERE clause on the Snowflake side. This is critical.

Bad Practice: Filtering in Spark

    # Pulls the ENTIRE sales table from Snowflake first, then filters in Spark.
    # This is incredibly slow and expensive.

    df_sales = spark.read.format("snowflake") \
      .option("dbtable", "PROD.SALES") \
      .load()

    # Filter happens AFTER the data is in Databricks memory
    filtered_df = df_sales.filter("sale_date >= '2023-10-01'") 

    filtered_df.show()

Look at the Spark physical plan for this job using filtered_df.explain(). You will see a Filter operation above the SnowflakeScan operation. This means Spark is doing the work.

Good Practice: Pushing the Filter to Snowflake

    # The filter is applied BEFORE loading the data.
    # Snowflake only sends the required subset of data over the network.

    df_sales = spark.read.format("snowflake") \
      .option("dbtable", "PROD.SALES") \
      .load() \
      .filter("sale_date >= '2023-10-01'") # Spark translates this into a pushdown query

    # --- OR, for more explicit control ---

    query = """
    SELECT * 
    FROM PROD.SALES 
    WHERE sale_date >= '2023-10-01'
    """

    df_sales_from_query = spark.read.format("snowflake") \
      .option("query", query) \
      .load()

    df_sales_from_query.show()

Now, the physical plan will show the filter predicate inside the SnowflakeScan's PushedFilters section. The data transfer is minimized, and the initial scan is orders of magnitude faster. Using the query option is often the most foolproof way to ensure you're only pulling exactly what you need.

Phase 2: Rearchitecting for the Lakehouse

The quick wins will stop the bleeding, but to truly thrive, you must stop treating Databricks as just a compute layer for Snowflake. You need to embrace a Lakehouse architecture, where the center of your data universe becomes Delta Lake, not your Snowflake warehouse.

This means changing the flow of data. Instead of Snowflake -> Databricks (process) -> Snowflake, the modern pattern is Source -> Databricks (ingest to Delta) -> Databricks (process on Delta) -> Serving Layer (which could be Snowflake, or a Databricks SQL Warehouse).

The Medallion Architecture: Your New Blueprint

The Medallion Architecture is a simple, powerful pattern for structuring data in your lakehouse. It organizes data into three distinct quality zones: Bronze, Silver, and Gold.

  • Bronze (Raw): This is your landing zone. Data is ingested from source systems (like your Snowflake tables) into Delta tables with minimal transformation. The structure mirrors the source, and data is often stored as-is (e.g., raw JSON strings, full table dumps). The goal here is fast, cheap ingestion and archival.

    • Migration Context: Your first step is to create a set of jobs that efficiently dump the necessary source tables from Snowflake into Bronze Delta tables on a regular schedule. You read from Snowflake once per day/hour, not every time a job runs.
  • Silver (Cleansed & Conformed): This is where the real work happens. Data from Bronze is cleaned, de-duplicated, joined, and modeled into a more queryable, enterprise view. You might join sales transactions with product and store dimensions, cleanse addresses, and cast data to proper types. These tables form the backbone of your data platform.

    • Migration Context: The logic that used to run against live Snowflake tables now runs against your Bronze Delta tables and produces Silver Delta tables. All heavy lifting—the complex joins and transformations—happens entirely within Databricks, reading from and writing to Delta Lake. This is a pure Spark-on-Delta operation, which is incredibly fast and scalable.
  • Gold (Aggregated & Business-Ready): These tables are purpose-built for your business intelligence and analytics use cases. They are typically highly aggregated, denormalized, and designed for speed. Think of weekly sales summaries by region, customer lifetime value models, or feature tables for machine learning.

    • Migration Context: Instead of writing your final aggregated output back to Snowflake in every job, you write it to a Gold Delta table. From here, you can serve it directly to BI tools via a Databricks SQL Warehouse, or you can have a single, final, efficient job that loads this clean, aggregated Gold data into Snowflake for consumption by legacy tools that haven't been migrated yet.

This shift dramatically reduces your dependency on the Snowflake connector and network transfer, and it places the compute right next to the data.

Delta Lake is Your New Storage Engine: Use Its Superpowers

By moving your processing to Delta Lake, you gain a set of features that replace (and often improve upon) what you were used to in a traditional data warehouse. Ignoring these is like driving a Tesla but only ever using the radio.

1. OPTIMIZE and Z-ORDER: The Key to Fast Queries

In Snowflake, micro-partitioning and clustering are largely automatic. In Delta Lake, you have explicit control via the OPTIMIZE command.

  • OPTIMIZE: This command compacts small files into larger, more optimally sized ones (defaulting to 1GB). Running OPTIMIZE table_name is crucial for tables that receive many small writes, as querying thousands of tiny files is extremely inefficient for Spark.
  • Z-ORDER: This is the secret weapon. Z-Ordering is a multi-dimensional clustering technique. When you Z-ORDER BY (colA, colB), Delta co-locates related data across those columns within the same set of files. When a query comes in with a WHERE clause on colA and/or colB, Databricks can use data skipping to ignore a massive number of files.

Example: Imagine a Silver sales table with billions of rows.

    -- Run this after your daily ETL write to the silver table
    OPTIMIZE delta.`/path/to/silver/sales`
    ZORDER BY (store_id, product_category);

Now, a query like this:

    SELECT sum(sale_amount) 
    FROM delta.`/path/to/silver/sales`
    WHERE store_id = 123 AND product_category = 'Electronics';

...will be lightning fast. Instead of scanning all the data, the query engine knows exactly which few files contain data for store 123 and 'Electronics' and can skip reading the rest. The Z-ORDER keys should be the columns you filter on most frequently.

2. Time Travel: Instant Rollbacks and Audits

Snowflake has Time Travel, and so does Delta Lake. Every operation on a Delta table creates a new version. You can query or restore the table to any previous version or timestamp.

    -- Query the table as it was yesterday
    SELECT * FROM my_delta_table TIMESTAMP AS OF '2023-10-25 18:00:00';

    -- Or query by version number
    SELECT * FROM my_delta_table VERSION AS OF 12;

    -- Restore the table to a previous version
    RESTORE TABLE my_delta_table TO VERSION AS OF 12;

This is invaluable for debugging bad data writes or for creating auditable, point-in-time reports without needing to manage complex snapshot logic yourself.

Refactoring Your Code: From SQL Glue to Spark Native

Lifting and shifting SQL logic often means it ends up as a massive string inside a spark.sql() call. While this works, it misses many of Spark's most powerful optimization opportunities.

1. The UDF (User-Defined Function) Performance Trap

When you can't find a built-in function, the temptation is to write a UDF in Python.

    # A seemingly innocent Python UDF
    def get_initials(name: str) -> str:
      if not name:
        return None
      parts = name.split()
      return parts[0][0] + parts[-1][0] if len(parts) > 1 else parts[0][0]

    initials_udf = udf(get_initials, StringType())

    # Using it in a DataFrame
    df_with_initials = df.withColumn("initials", initials_udf(col("full_name")))

This is a performance killer. Here's why:

  • Black Box: Spark's Catalyst Optimizer cannot "see" inside the Python UDF. It can't optimize or rearrange it.
  • Serialization Overhead: For every single row, data has to be serialized from Spark's internal format, sent to a Python interpreter, processed, and then serialized back. This is incredibly expensive.

The Solution: Always prefer built-in functions. The same logic can almost always be replicated.

    from pyspark.sql.functions import split, col, substring, when, size

    # The Spark-native way. It's more verbose but orders of magnitude faster.
    df_with_initials_native = df.withColumn("name_parts", split(col("full_name"), " ")) \
      .withColumn(
        "initials",
        when(size(col("name_parts")) > 1, 
             concat(substring(col("name_parts").getItem(0), 1, 1), 
                    substring(col("name_parts").getItem(size(col("name_parts")) - 1), 1, 1)))
        .otherwise(substring(col("name_parts").getItem(0), 1, 1))
      )

The difference is staggering. On a large dataset, the UDF version might take an hour; the native version could take two minutes. If you absolutely must use custom Python logic, use Pandas UDFs (Vectorized UDFs), which process data in Apache Arrow batches and dramatically reduce the serialization overhead.

2. Understanding and Minimizing Shuffles

A "shuffle" is the most expensive operation in Spark. It's the process of redistributing data across the nodes in the cluster. This happens during joins, aggregations with different keys (GROUP BY), and certain window functions. A shuffle involves writing data to disk, sending it over the network, and reading it back in.

Your goal as a Databricks optimizer is to become a "Shuffle Hunter."

  • Use the Spark UI: The "Stages" tab is your best friend. Look for stages with huge "Shuffle Read/Write" metrics. These are your bottlenecks.
  • Broadcast Joins: If you are joining a large table with a small table (e.g., a fact table with a dimension table), you want to avoid a shuffle. A broadcast join sends a copy of the entire small table to every node in the cluster. The join then happens locally on each node, completely avoiding a shuffle.
    • Spark tries to do this automatically for tables under a certain size (configurable by spark.sql.autoBroadcastJoinThreshold, default 10MB).
    • You can explicitly hint it: large_df.join(broadcast(small_df), "join_key").
    • Verify it's happening by looking at the query plan for "BroadcastHashJoin" instead of "SortMergeJoin".
  • Partitioning/Z-Ordering: As discussed, properly partitioning and Z-Ordering your Delta tables can pre-sort and pre-organize your data. This can sometimes allow Spark to avoid a shuffle entirely if the data is already co-located by the join or group-by key.

A Real-World Case Study: "RetailCorp's" Post-Migration Nightmare

Let's make this concrete. "RetailCorp" migrated their daily sales reporting pipeline from a series of Snowflake stored procedures to a single, large Databricks notebook.

The "Before" State:

  • Pipeline: A Python notebook scheduled to run once daily.
  • Logic:
    1. Reads three large tables from Snowflake: SALES (500M rows/day), PRODUCTS (1M rows), STORES (5k rows).
    2. Uses spark.sql to join the three tables.
    3. Performs multiple aggregations to calculate sales by region, product category, and promotional campaign.
    4. Uses a Python UDF to parse a complex product_description field.
    5. Writes the final 100k-row summary table back to Snowflake.
  • Cluster: A 20-node "general purpose" (m5d.4xlarge) auto-scaling cluster.
  • The Problems:
    • Runtime: The job took 6 hours to complete, often missing the business SLA.
    • Cost: The cluster cost was over $300 per run.
    • Reliability: The job failed randomly with "ExecutorLostFailure" or "OutOfMemoryError" about twice a week.

The Optimization Journey:

We tackled this using the phased approach.

Phase 1: Triage and Quick Wins (First Week)

  1. Analysis:
    • Spark UI: We saw massive shuffle writes (~10 TB) during the join stage. The executors showed high memory usage and frequent, long GC pauses.
    • Query Plan: A df.explain() showed that the entire 500M row SALES table was being pulled from Snowflake. The PRODUCTS and STORES joins were SortMergeJoin, indicating a full shuffle. The UDF was a black box UDF node.
  2. Immediate Actions:
    • Enable Photon: Ticked the "Use Photon Acceleration" box on the cluster.
    • Pushdown Read: Changed the initial SALES read from option("dbtable", "SALES") to an option("query", ...) that only selected the columns needed and filtered for the previous day's sales on the Snowflake side.
    • Change Instance Type: Switched the worker nodes from general purpose (m5d) to memory-optimized (r5d) to handle the join and aggregation memory pressure.
    • Broadcast the STORES join: Added a broadcast() hint around the STORES DataFrame in the join.

Results of Phase 1:

  • Runtime: Down from 6 hours to 2.5 hours.
  • Cost: Down to ~$150 per run.
  • Reliability: Failures became much less frequent as the memory pressure was reduced.

This was a huge win, but 2.5 hours was still too long, and the architecture was still flawed.

Phase 2: Rearchitecting for the Lakehouse (Next Month)

  1. Analysis: The main bottleneck was still the huge join between the daily sales data and the product dimension, all happening in memory over and over again. The network transfer from Snowflake, while reduced, was still significant.
  2. Architectural Changes:
    • Implement Medallion:
      • Bronze: Created a new, separate Databricks job that runs once daily to copy the previous day's SALES data and any changes from PRODUCTS and STORES from Snowflake into raw Bronze Delta tables. This job was simple and fast.
      • Silver: The main reporting notebook was refactored. It no longer touches Snowflake at all for its inputs. It reads from the Bronze SALES, PRODUCTS, and STORES Delta tables. It performs all the joins and cleaning and writes the output to a clean, conformed Silver daily_sales_enriched table.
      • Gold: A final, smaller notebook reads from the Silver table to perform the business-level aggregations, creating a Gold regional_sales_summary table.
    • Delta Optimizations:
      • The Silver daily_sales_enriched table was partitioned by sale_date.
      • We ran a daily OPTIMIZE and ZORDER BY (region_id, product_category) on the Silver table after the write.
    • Code Refactoring:
      • The expensive Python UDF for parsing the product description was painstakingly rewritten using pyspark.sql.functions like split, regexp_extract, and when/otherwise.
    • Final Output: The final Gold table was served directly to the company's BI tool using a Databricks SQL Warehouse. A separate, small job was created to write this final summary back to Snowflake for legacy systems, ensuring the main pipeline was no longer dependent on Snowflake's availability.

The "After" State - Final Results:

Metric Before Optimization After Phase 1 After Phase 2 (Final)
Runtime 6 hours 2.5 hours 25 minutes
Cost per Run ~$300 ~$150 ~$20
Reliability Fails ~2x/week Fails <1x/month Highly stable
Architecture Brittle, Monolithic Improved, but still Snowflake-dependent Lakehouse-native, Decoupled, Idempotent

The results were transformative. The cost reduction was over 90%. The pipeline now finished hours ahead of the SLA. The new architecture was more modular, easier to debug, and allowed for new use cases (like ad-hoc queries on the Silver data) that were previously impossible.

Continuous Improvement: Governance and Monitoring

Optimization is not a one-time event. Your data and workloads will change. Building a culture of continuous improvement is key.

  • Use Cluster Policies: Define a set of T-shirt sized cluster policies (e.g., "Small ETL," "Large ML Training") that enforce best practices like enabling Photon, setting auto-termination, and restricting instance types. This prevents developers from spinning up monstrously expensive, un-optimized clusters.
  • Tag Everything: Tag clusters and jobs with the team name, project, and cost center. This makes it possible to track down which workloads are driving your cloud spend.
  • Databricks SQL Warehouses: For your BI and analytics users who were used to Snowflake's simplicity, guide them towards Databricks SQL Warehouses. These are fully managed, serverless compute resources specifically optimized for SQL performance, offering an experience very similar to a Snowflake warehouse but running directly on your Delta Lake data.
  • Unity Catalog: As your Lakehouse grows, managing data access and lineage becomes critical. Unity Catalog provides a centralized governance solution for all your data and AI assets across workspaces. It's the key to building a secure, well-governed Lakehouse that can scale.

Moving from Snowflake to Databricks is more than a technical migration; it's a cultural and architectural shift. By leaving behind the warehouse mindset and embracing the granular, code-first power of a true Lakehouse, you can unlock the performance, cost savings, and flexibility that motivated the move in the first place. The journey requires effort, but as RetailCorp's story shows, the rewards are well worth it.