The Hidden Pitfalls of Snowflake to Databricks Migrations ?

L+ Editorial
Feb 01, 2026 Calculating...
Share:

Hidden Pitfalls of Snowflake to Databricks Migrations

The call almost always starts the same way. A Director of Engineering or a VP of Data, energized from a series of vendor presentations, lays out the vision. "We're moving from Snowflake to Databricks. We want to unify our analytics and AI, leverage open source, and get our costs under control. It should be a straightforward migration—SQL is SQL, right?"

I've been on the other end of that call more times than I can count. And every time, I take a deep breath. Because I know what’s coming. I've led these migrations. I’ve lived in the trenches, wrestling with obscure error messages at 2 AM, explaining budget overruns to steering committees, and untangling years of business logic embedded in proprietary SQL functions.

The truth is, migrating from Snowflake to Databricks is not a simple "lift and shift." It is a fundamental paradigm shift. It’s moving from a highly managed, SQL-centric data warehouse-as-a-service to a flexible, powerful, code-first data lakehouse platform. It's like trading a luxury automatic sedan for a manual-transmission rally car. Both will get you there, but one assumes you know how to work the clutch, manage the engine, and navigate the terrain yourself. If you don't, you'll stall, or worse, end up in a ditch.

This isn’t a theoretical exercise. This is a collection of scars and lessons learned from real-world, production migrations. We're going to bypass the marketing slides and talk about what actually breaks, what gets underestimated, and how you can navigate the journey without derailing your data strategy.

The First and Biggest Misconception: "Lift and Shift" is a Lie

Before we dive into a single line of code, we have to address the strategic blunder that poisons so many of these projects from the start. The C-suite hears "cloud data platform migration" and thinks it's like moving from one apartment to another. The reality is that you're redesigning your entire house.

Snowflake is a masterpiece of abstraction. It's a SQL data warehouse that brilliantly hides the complexity of the underlying infrastructure. You pick a T-shirt size for your compute (your "virtual warehouse"), and it just works. It handles concurrency, caching, and storage optimization largely behind the scenes. Its primary interface is SQL.

Databricks, built on top of Apache Spark, is fundamentally different. It's a platform that gives you the tools to build your own high-performance data engine. Its native language is not just SQL, but code—primarily Python and Scala. You are given direct control over the compute clusters, the file formats (Delta Lake), the partitioning, and the optimization processes.

I often use this analogy with my clients:

  • Snowflake is a hydraulic press. You put something in, press a button (run a SQL query), and you get a predictable, powerful result. You don't need to know about the hydraulics, the pressure valves, or the electrical schematics.
  • Databricks is a Swiss Army knife with a power drill attachment. It can do what the hydraulic press does, but it can also screw, saw, open a can, and perform a thousand other tasks. But you need to select the right tool for the job, hold it correctly, and know when to change the battery.

The "lift and shift" lie assumes you can just move your SQL workloads over and everything will work the same, only cheaper or with added AI/ML capabilities. This is a catastrophic miscalculation that leads to blown budgets, missed deadlines, and frustrated teams. The successful path is not migration; it's modernization. You are re-architecting, not just relocating.


The SQL Translation Minefield: More Than Just Dialect Differences

This is where the rubber meets the road, and where most teams get their first big, unwelcome surprise. The belief that you can just find-and-replace a few function names is dangerously naive. We’re not talking about changing GETDATE() to CURRENT_TIMESTAMP(). The chasm is far deeper.

Stored Procedures: The Embedded Brain of Your Warehouse

On my first major Snowflake-to-Databricks project, we hit a wall that nearly sank us. The client had over 300 Snowflake stored procedures. These weren't simple multi-statement SQL scripts. They were complex orchestrators, written in SnowScript (Snowflake's procedural SQL dialect), that handled everything from incremental data loading and error handling to dynamic reporting and access control management. They used variables, cursors, loops, and EXECUTE IMMEDIATE to build and run SQL on the fly.

The Pitfall: Databricks does not have a direct, one-to-one equivalent of a Snowflake stored procedure. There's no CREATE PROCEDURE in Databricks SQL that gives you the same level of procedural logic, state management, and dynamic execution within a single, atomic unit.

Simply running the SQL statements inside a procedure one-by-one won't work because you lose the control flow, the variable handling, and the transactional integrity.

The Solution: Re-architecting into Notebooks and Python

We had to fundamentally rethink how this logic was executed. The solution was to translate each stored procedure into a Databricks notebook, typically using Python.

A simple Snowflake procedure might look like this:

    -- Snowflake Stored Procedure Example
    CREATE OR REPLACE PROCEDURE P_LOAD_AGG_SALES(source_table STRING, target_table STRING)
    RETURNS STRING
    LANGUAGE SQL
    AS
    $$
    DECLARE
        row_count INT;
    BEGIN
        -- Truncate and load the aggregate table
        TRUNCATE TABLE IDENTIFIER(:target_table);

        INSERT INTO IDENTIFIER(:target_table) (region, total_sales, sale_date)
        SELECT region, SUM(sales_amount), CURRENT_DATE()
        FROM IDENTIFIER(:source_table)
        GROUP BY region;

        -- Get the number of rows inserted
        SELECT COUNT(*) INTO :row_count FROM IDENTIFIER(:target_table);

        RETURN 'Successfully loaded ' || :row_count || ' rows.';
    END;
    $$;

    -- Calling the procedure
    CALL P_LOAD_AGG_SALES('RAW_SALES', 'AGG_SALES_DAILY');

The Databricks equivalent is not a SQL script. It’s a parameterized Python function within a notebook.

    # Databricks Notebook (Python) Equivalent
    def load_agg_sales(source_table, target_table):
      """
      This function mimics the Snowflake stored procedure by running
      the aggregation logic using Spark SQL.
      """
      print(f"Loading data from {source_table} to {target_table}")

      # Use Spark SQL to perform the transformation
      # Note the use of f-strings to parameterize table names
      spark.sql(f"""
        INSERT OVERWRITE TABLE {target_table}
        SELECT
          region,
          SUM(sales_amount) as total_sales,
          current_date() as sale_date
        FROM {source_table}
        GROUP BY region
      """)

      # Get the row count and provide feedback
      row_count = spark.table(target_table).count()
      print(f"Successfully loaded {row_count} rows.")
      return row_count

    # Calling the function (e.g., from another cell or a workflow job)
    load_agg_sales("raw_sales", "agg_sales_daily")

This looks simple, but imagine this for a 500-line procedure with nested loops, complex error handling (try/catch blocks), and dynamic SQL generation. It's a full-scale development effort, not a simple translation. We had to build a "procedure migration factory," dedicating developers to this task for weeks.

UDFs (User-Defined Functions): The Silent Performance Killers

Both platforms support UDFs, but how they work under the hood is worlds apart. In Snowflake, JavaScript, Java, and Python UDFs are often used for string manipulation, complex parsing, or business logic that's awkward in pure SQL. They are generally performant.

The Pitfall: In Databricks, a standard Python or Scala UDF is often a performance anti-pattern. When you apply a UDF to a DataFrame column, Spark has to:
1. Serialize the data from its efficient, internal binary format into a format Python can understand.
2. Send this data from the JVM (where Spark runs) to a separate Python process.
3. Execute your Python function row by row.
4. Serialize the results back and send them to the JVM.

This serialization overhead is immense and kills the parallel processing power of Spark. A query that ran in 30 seconds in Snowflake can take 30 minutes in Databricks if it relies heavily on a poorly implemented UDF.

The Solution: Prioritize Native Spark Functions

Our mantra became: "If there's a built-in Spark function for it, use it." The Spark SQL function library is massive. We had to train our team to stop thinking in terms of writing custom functions and start thinking in terms of composing native ones.

For example, a client had a Snowflake JavaScript UDF to parse URL query strings to extract a specific parameter.

    // Snowflake JavaScript UDF
    create or replace function get_url_param(url string, param_name string)
      returns string
      language javascript
    as
    $$
      var url_obj = new URL(URL);
      var params = new URLSearchParams(url_obj.search);
      return params.get(PARAM_NAME);
    $$;

    -- Usage in Snowflake
    SELECT get_url_param(customer_url, 'utm_source') FROM web_logs;

Migrating this naively to a Python UDF in Databricks would be slow. The correct, high-performance approach is to use native Spark functions.

    # Databricks native function equivalent
    from pyspark.sql.functions import col, expr

    # No UDF needed!
    df = spark.table("web_logs")
    result_df = df.withColumn(
        "utm_source",
        expr("parse_url(customer_url, 'QUERY', 'utm_source')")
    )
    result_df.display()

This requires a deeper knowledge of the Spark API but results in orders-of-magnitude better performance. We built a repository of common UDF patterns and their native Spark equivalents to accelerate this process.

The Long Tail of Proprietary Functions and Syntax

Beyond procedures and UDFs lies a minefield of Snowflake-specific SQL functions and syntax that have no direct equivalent. Automated translation tools can catch some of these, but many require manual re-logic-ing.

Here are some of the most common culprits I’ve battled:

Snowflake Feature What It Does The Databricks Challenge & Solution
QUALIFY Filters the results of a window function in the same step. It's incredibly concise and powerful for tasks like "get the latest record per customer." Challenge: Databricks/Spark SQL does not have QUALIFY. You must use a subquery or a Common Table Expression (CTE).

Solution: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) as rn FROM table) WHERE rn = 1. This adds verbosity and requires re-structuring the query.
CONNECT BY / START WITH Used for querying hierarchical data (e.g., employee-manager relationships, parts explosions). Challenge: This is a classic Oracle feature that Snowflake adopted. Spark has no direct equivalent.

Solution: This requires a significant rewrite, often using recursive CTEs (WITH RECURSIVE ...). For very deep or complex hierarchies, we sometimes recommend pre-processing the hierarchy into a flattened structure or using GraphFrames.
FLATTEN Un-nests semi-structured data (JSON, ARRAYs) into a relational format. It's a very intuitive table function. Challenge: The syntax is completely different.

Solution: Use the explode() or inline() function in Databricks. For nested JSON, you often have to chain . notation and explode() calls. E.g., SELECT explode(my_array) ... or SELECT my_struct.* .... It’s functionally equivalent but syntactically a complete rewrite.
:: (Casting) Snowflake’s elegant shorthand for casting data types (e.g., my_col::INT). Challenge: Simple syntax difference, but it's everywhere.

Solution: This is a good candidate for automated replacement. Replace ::TYPE with CAST(column AS TYPE). A global regex search-and-replace is your friend here.
GENERATOR(ROWCOUNT => N) A table function to generate N rows of data, often used for scaffolding date dimensions or for testing. Challenge: No direct equivalent table function.

Solution: Use spark.range(N) to create a DataFrame with a single column of sequential numbers. For more complex sequences like dates, you can use sequence(start, stop, step).
ZEROIFNULL() A convenient shorthand for COALESCE(column, 0). Challenge: Small but pervasive.

Solution: Easy to replace with the standard COALESCE() or IFNULL() functions. Another good target for automation.

Automated tools can handle about 60-70% of these syntax changes. The remaining 30-40%—the CONNECT BY and complex QUALIFY logic—are what consume 80% of your SQL migration effort.


The Operational Abyss: When "Managed Service" Magic Fades

Once you’ve wrestled your SQL into submission, you enter the next circle of migration hell: operations. This is where the cozy, predictable world of Snowflake gives way to the configurable, hands-on world of Databricks. Teams that are unprepared for this shift in responsibility will see costs spiral and performance plummet.

Compute Management: From Virtual Warehouses to Cluster Configurations

In Snowflake, managing compute is laughably easy. You create a Virtual Warehouse, pick a size (X-Small, Small, Large, etc.), and set an auto-suspend policy. That’s it. Snowflake handles the rest.

The Pitfall: In Databricks, you are the fleet commander. You define clusters. This means you have to make decisions about:
* Node Types: Do you need compute-optimized, memory-optimized, or storage-optimized VMs? What generation?
* Driver vs. Worker Nodes: How big should your driver be? How many workers do you need?
* Auto-Scaling: What's the minimum and maximum number of workers?
* Spot vs. On-Demand Instances: Do you want to save money with spot instances and risk interruptions, or pay a premium for reliability?
* Cluster Policies: How do you enforce standards and control costs across dozens of teams?
* Job Clusters vs. All-Purpose Clusters: This is a critical one. All-Purpose clusters are for interactive development in notebooks and are more expensive. Job clusters are for automated, scheduled workloads, are cheaper, and spin up/down for a specific run.

I saw one team migrate their ETL pipelines and run them all on a massive, 24/7 All-Purpose cluster. Their first month's bill was three times what they were paying in Snowflake. Why? Because they treated the Databricks cluster like a persistent Snowflake warehouse. The correct pattern was to schedule each pipeline as a separate Databricks Job, which would use a transient, appropriately-sized Job Cluster that terminated upon completion. This simple architectural mistake cost them tens of thousands of dollars before we caught it.

The Solution: Develop a Cluster Strategy and Use Policies

Before going live, you need a documented strategy for compute.
1. Define T-Shirt Sizes: Create your own "T-shirt sizes" using Databricks Cluster Policies. For example, a job-small policy might define a cluster with 2-4 workers of a certain instance type. This gives your users the simplicity of Snowflake's model while maintaining governance.
2. Mandate Job Clusters for Production: Enforce a rule that no production ETL runs on an All-Purpose cluster. All scheduled pipelines must be deployed as Databricks Jobs.
3. Monitor, Monitor, Monitor: Use the Databricks cost and performance monitoring tools (and your cloud provider's tools) religiously. Identify oversized clusters and underutilized resources.

Security & Governance: Unity Catalog is Powerful, But Not a Magic Wand

Snowflake's Role-Based Access Control (RBAC) is mature, integrated, and well-understood. You create roles, grant privileges on database objects to those roles, and assign roles to users. It's straightforward.

Databricks' answer to this is Unity Catalog (UC). UC is a fantastic and necessary evolution, providing a centralized governance layer for data and AI across your entire workspace. It offers fine-grained access control, data lineage, and auditing.

The Pitfall: Migrating teams often treat Unity Catalog as an afterthought. They start by migrating their data into the legacy Hive Metastore, which is scoped to a single workspace and has a clunky, less-secure permissions model. They think they’ll "add UC later." This is a huge mistake.

Once data and code are tied to the Hive Metastore, migrating it again to Unity Catalog is a painful process. You have to run CREATE TABLE ... AS SELECT or other migration scripts, update all your notebooks and jobs to use the new three-level namespace (catalog.schema.table), and re-apply all your security grants. You essentially end up doing the migration twice.

The Solution: Governance First, Data Second

  1. Set Up Unity Catalog Before Day One: Before you migrate a single table, your Unity Catalog metastore must be configured, attached to your workspaces, and your governance model defined.
  2. Define Your Namespace: Decide on your catalog and schema structure. Will you have a prod and dev catalog? Will each business unit get its own catalog? Settle this before data lands.
  3. Automate Permissions: Use Databricks Terraform provider or SCIM provisioning to manage users, groups, and grants as code. Manually clicking through the UI to grant permissions is not scalable or auditable.

Data Ingestion: The Deceptive Simplicity of COPY INTO

In Snowflake, getting data from a cloud storage bucket into a table is incredibly easy. COPY INTO my_table FROM @my_stage; is a simple, powerful command. For continuous ingestion, Snowpipe automates this process.

The Pitfall: The direct equivalent in Databricks, COPY INTO, exists for Delta tables, but the more robust, scalable, and idiomatic solution for continuous ingestion is Auto Loader. Auto Loader is a Structured Streaming source that can incrementally and efficiently process new files as they land in cloud storage. However, it’s not a single SQL command. It's a configured data pipeline.

You need to write code (PySpark or Spark SQL) that defines the stream:
* The source location (e.g., s3://my-bucket/landing/)
* The data format (JSON, CSV, Parquet)
* Schema inference and evolution options
* A checkpoint location (to track which files have been processed)
* The target Delta table

A team that has hundreds of COPY jobs in Snowflake can be shocked by the effort required to re-implement them as resilient Auto Loader streams. They underestimate the development work needed to build, test, and deploy these pipelines.

The Solution: Build a Reusable Ingestion Framework

Don't write a unique Auto Loader stream for every single data source. We built a generic, metadata-driven ingestion framework.
* A configuration table or file defined the source path, target table, data format, and other options for each data feed.
* A single, parameterized Databricks notebook read this configuration and dynamically started the appropriate Auto Loader stream.

This "framework" approach turned a multi-week development effort for each new pipeline into a 15-minute configuration task.


The Performance Tuning Paradigm Shift: You Are Now the DBA

In Snowflake, performance tuning is largely about choosing the right warehouse size and, for very large tables, defining a CLUSTER BY key to co-locate similar data. The query optimizer and underlying micro-partition management do most of the heavy lifting.

In Databricks, you have more control, which means you have more responsibility. Migrating data and queries as-is and expecting them to be fast is a recipe for disappointment.

Goodbye CLUSTER BY, Hello OPTIMIZE and Z-ORDER

Delta Lake, the storage format at the heart of the Databricks Lakehouse, stores your data as Parquet files in your cloud storage. Over time, as you perform many small writes (e.g., from a streaming job or frequent MERGE operations), you can end up with thousands of small files. Querying a table with too many small files is slow, as Spark has to open and read metadata from all of them.

The Pitfall: Teams migrate their data and forget about file layout management. Their queries, which were fast on a well-managed Snowflake table, become progressively slower on Databricks.

The Solution: Implement a Data Compaction and Clustering Strategy

  1. OPTIMIZE: This command compacts small files into larger, more optimal ones. You should run OPTIMIZE regularly on tables that receive frequent writes. We typically schedule a weekly or daily job that runs OPTIMIZE on all key production tables.
    -- Run this regularly!
    OPTIMIZE my_delta_table;

  1. Z-ORDER: This is the closest equivalent to Snowflake's CLUSTER BY. Z-ORDER is a technique that co-locates related information in the same set of files. If you frequently filter your queries on a set of columns (e.g., WHERE date = '...' AND region = '...'), you should Z-Order by them. This allows the query engine to skip massive amounts of data, dramatically improving performance.
    -- Z-Ordering by common filter columns
    OPTIMIZE my_delta_table ZORDER BY (date, region);

Unlike Snowflake’s automatic clustering, OPTIMIZE and Z-ORDER are commands you must run yourself. It’s an active maintenance task that is non-negotiable for good performance in Databricks. We bake this into our production ETL orchestration from day one.


Tactical Solutions and Automation: How We Actually Get This Done

Talking about pitfalls is easy. Navigating them requires a plan. Here are the battle-tested frameworks and automation scripts my teams use to make these migrations successful.

The Triage and Prioritization Framework

You cannot migrate everything at once. We use a simple 2x2 matrix to triage every single data pipeline, report, and stored procedure.

Low Business Value High Business Value
Low Migration Complexity Quick Wins: Migrate these first. Build momentum and prove the platform's value. Simple batch ETL, basic reporting tables. Crown Jewels: These are your most important, high-visibility workloads. Dedicate your best engineers here. The success of the project hinges on these.
High Migration Complexity Deprecate or Defer: Question if these are even needed. Can they be retired? If not, save them for last. These are the time sinks. Strategic Modernization: These are complex but critical (e.g., intricate stored procedures, ML-heavy pipelines). Don't "lift and shift." This is your chance to re-architect them properly using modern Databricks patterns.

This framework transforms an overwhelming list of "things to migrate" into a phased, strategic roadmap.

Building Your Own SQL Translation Accelerator

Commercial tools for SQL translation exist, but they are never 100% accurate and can be expensive. We found that building a lightweight, targeted "translation accelerator" in-house gave us the best results.

Here’s the process:
1. Extract: Use Snowflake's QUERY_HISTORY view to dump all SQL queries run over the last 90 days. This gives you a real-world corpus of the code you need to convert.
2. Analyze & Pattern Match: Use Python and a library like sqlglot or even just powerful regex to parse the queries. We build scripts to specifically hunt for the pitfalls mentioned earlier: QUALIFY, CONNECT BY, ::, proprietary functions, etc.
3. Automate the Easy Stuff: The script performs automated replacements for simple syntax like casting (:: to CAST) and function names (ZEROIFNULL to COALESCE). It also rewrites QUALIFY clauses into the standard ROW_NUMBER() subquery pattern.
4. Flag for Manual Review: When the script encounters something it can't handle (like a stored procedure call or a CONNECT BY), it doesn't try to be clever. It flags the query and the pattern it found, and outputs it to a "manual review" file, complete with a link to the original query text.

This 80/20 approach—automating the repetitive syntax changes and flagging the complex logic for human experts—is far more effective than trying to build a perfect, all-encompassing translator.

The Non-Negotiable Data Validation Framework

How do you prove to the business that the data in Databricks is identical to the data in Snowflake? You can't just "eyeball it." You need an automated, rigorous validation framework.

Our process is simple but effective:
1. For every table migrated, run a validation query against both the source (Snowflake) and the target (Databricks).
2. The query computes a set of control metrics:
* Total row count (COUNT(*)).
* Checksums on key string/date columns (MD5(CONCAT(col1, col2))).
* Summations on all numeric columns (SUM(metric1), SUM(metric2)).
3. A Python script executes these queries on both platforms via their respective connectors, fetches the results, and compares them.
4. The script outputs a report: SUCCESS if all metrics match, or FAILURE with a diff of the mismatched values.

This automated validation is our safety net. It runs after every single data migration job. It builds trust with business stakeholders and allows us to catch data integrity issues, floating-point precision differences, or character encoding problems instantly.

Lessons from the Trenches: My Migration Best Practices

If I could distill all the painful lessons into a short list of guiding principles, it would be this:

  1. Don't Migrate, Modernize. Reject the "lift and shift" mentality from day one. This is an opportunity to fix technical debt and adopt better patterns, not just change the name on the invoice.
  2. Invest in a Meaningful Pilot. Choose a single, high-value data pipeline (a "Crown Jewel" from the matrix) and migrate it end-to-end. This will expose the real challenges and build your team's muscle memory before you commit to the full-scale program.
  3. Governance First. Set up Unity Catalog, define your security model, and establish your cluster policies before you migrate your first table. Retrofitting governance is 10x harder.
  4. Build a Center of Excellence (CoE). You need a small, dedicated team of experts who live and breathe Databricks. They set the best practices, build the reusable frameworks (ingestion, validation), and train the rest of the organization. Don't expect 100 data engineers to become Databricks experts overnight.
  5. Embrace a Code-First Culture. Your team needs to get comfortable with Python, notebooks, and Git. SQL is still critical, but the orchestration, automation, and advanced transformations will happen in code. Invest in training.
  6. Automate Everything You Can. Automate SQL translation, data validation, infrastructure deployment (Terraform), and permissions management. Manual, repetitive tasks are where errors and inconsistencies creep in.
  7. Plan for Performance Tuning. Assume your queries will be slow at first. Budget time and resources for post-migration performance tuning: implementing OPTIMIZE and Z-ORDER, refactoring UDFs, and right-sizing clusters. It's not a failure; it's a required part of the process.

The Final Word

A Snowflake to Databricks migration is a complex and challenging endeavor. It is fraught with technical, operational, and cultural pitfalls that can easily derail a project that looks simple on paper. The journey from a managed SQL warehouse to a code-first lakehouse requires more than just technical skill; it requires a shift in mindset.

But when done correctly—with a clear-eyed understanding of the challenges and a strategy rooted in modernization rather than a simple lift-and-shift—the results are transformative. You unlock a unified platform for data and AI, gain control over your stack with open formats, and position your organization to innovate in ways a traditional data warehouse simply cannot support. The path is difficult, but for those willing to navigate it with their eyes open, the destination is worth the journey.

Talk to Expert