How to Automate the Migration from Snowflake to Databricks ?

Automating Your Snowflake to Databricks Migration — The Real Story

The first time a client asked me to lead a large-scale migration from Snowflake to Databricks, the initial request felt deceptively simple. "We want to consolidate on the lakehouse for cost savings and to bring our BI and AI workloads together. Can you move our stuff over?" "Our stuff" turned out to be thousands of tables, hundreds of complex stored procedures, a tangled web of user roles, and a mission-critical ETL pipeline orchestrated with Snowflake Tasks and Streams.

My team and I quickly realized that a manual, object-by-object migration would be a multi-year nightmare doomed to fail. It would be slow, riddled with human error, and impossible to validate. We had no choice: we had to build a migration factory. We had to automate.

This isn't a theoretical guide. This is my playbook, forged across multiple production migrations. I’m going to walk you through how we built an automation engine to discover, convert, validate, and deploy a massive Snowflake estate to the Databricks Lakehouse Platform. I’ll share the code we wrote, the "gotchas" that bit us, and the hard-won lessons that now form the core of my migration methodology.

The Unassailable Case for Automation

Before we dive into code, let's be crystal clear about why you can't just wing this. A modern data warehouse isn't a simple database; it's a living ecosystem.

  • The Scale is Deceiving: You might think you have 500 tables. But what about the 3,000 views, 200 stored procedures, 50 external stages, 80 resource-hungry tasks, and the undocumented Python scripts hitting the API? The sheer volume of objects is always 2-3x what stakeholders believe it is.
  • The Hidden Dependencies: The most dangerous part of any migration is the invisible web of dependencies. A view is built on another view, which joins five tables, one of which is populated by a stored procedure called by a Snowflake Task, which is triggered by a Stream on a raw landing table. A manual approach cannot safely untangle this.
  • The Need for Repeatability: You will not get this right on the first try. You will migrate a business unit, test, find issues, and need to re-migrate. Automation provides the consistency and speed to iterate. A manual process is a one-way ticket to chaos, with each iteration introducing new, unique errors.

Our automation strategy rests on four pillars:
1. Discovery: Creating a complete, data-driven inventory of every single object in Snowflake.
2. Analysis: Understanding usage patterns, data lineage, and object dependencies to prioritize and plan the migration waves.
3. Conversion: Automating the translation of code (SQL, Stored Procedures) and schema from Snowflake's dialect to Databricks' dialect.
4. Validation: Programmatically verifying that the data and outputs in Databricks match the source in Snowflake.

Let's build the engine.

Building Your Migration Automation Toolkit in Python

We chose Python for our automation toolkit due to its excellent data handling libraries (Pandas), database connectivity, and general-purpose scripting capabilities. Here are the core components we built.

Step 1: Automated Inventory and Discovery

You cannot migrate what you don't know exists. The first step is to create a comprehensive catalog of every object in your Snowflake environment. We did this by systematically querying Snowflake’s INFORMATION_SCHEMA.

Here’s a Python script that resembles the core of our discovery module. It connects to Snowflake and pulls metadata for various object types into Pandas DataFrames, which can then be saved to CSVs or a database for analysis.

    import snowflake.connector
    import pandas as pd
    from typing import Dict, List

    def get_snowflake_connection(user, password, account, warehouse, database):
        """Establishes a connection to Snowflake."""
        try:
            conn = snowflake.connector.connect(
                user=user,
                password=password,
                account=account,
                warehouse=warehouse,
                database=database
            )
            print("Successfully connected to Snowflake.")
            return conn
        except Exception as e:
            print(f"Error connecting to Snowflake: {e}")
            return None

    def extract_snowflake_inventory(conn) -> Dict[str, pd.DataFrame]:
        """
        Extracts a comprehensive inventory of Snowflake objects.
        Queries INFORMATION_SCHEMA for various object types.
        """
        if not conn:
            return {}

        inventory = {}
        cursor = conn.cursor()

        # List of objects to inventory and the queries to get them
        # In a real project, this list is much longer (stages, pipes, tasks, etc.)
        queries = {
            "databases": "SHOW DATABASES;",
            "schemas": "SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;",
            "tables": "SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE';",
            "views": "SELECT * FROM INFORMATION_SCHEMA.VIEWS;",
            "columns": "SELECT * FROM INFORMATION_SCHEMA.COLUMNS;",
            "procedures": "SELECT * FROM INFORMATION_SCHEMA.PROCEDURES;",
            "functions": "SELECT * FROM INFORMATION_SCHEMA.FUNCTIONS;",
            "pipes": "SELECT * FROM INFORMATION_SCHEMA.PIPES;",
            "tasks": "SELECT * FROM INFORMATION_SCHEMA.TASKS;",
            "stages": "SHOW STAGES IN ACCOUNT;",
            "file_formats": "SHOW FILE FORMATS IN ACCOUNT;"
        }

        for name, query in queries.items():
            print(f"Extracting {name}...")
            try:
                cursor.execute(query)
                # SHOW commands have a different way of fetching results
                if query.strip().upper().startswith("SHOW"):
                     df = pd.DataFrame(cursor.fetchall(), columns=[desc[0] for desc in cursor.description])
                else:
                     df = cursor.fetch_pandas_all()

                # Standardize column names to uppercase for consistency
                df.columns = [col.upper() for col in df.columns]
                inventory[name] = df
                print(f" -> Found {len(df)} {name}.")
            except Exception as e:
                print(f"Could not execute query for {name}. Error: {e}")

        cursor.close()
        return inventory

    # --- Example Usage ---
    # Replace with your actual credentials
    # For production, use environment variables or a secrets manager like Vault or AWS Secrets Manager
    SNOWFLAKE_USER = "your_user"
    SNOWFLAKE_PASSWORD = "your_password"
    SNOWFLAKE_ACCOUNT = "your_account_identifier"
    SNOWFLAKE_WAREHOUSE = "COMPUTE_WH"
    SNOWFLAKE_DATABASE = "your_db" # A default DB to connect to

    conn = get_snowflake_connection(
        SNOWFLAKE_USER, SNOWFLAKE_PASSWORD, SNOWFLAKE_ACCOUNT, SNOWFLAKE_WAREHOUSE, SNOWFLAKE_DATABASE
    )

    if conn:
        full_inventory = extract_snowflake_inventory(conn)
        conn.close()

        # Now you can work with the inventory
        if "tables" in full_inventory:
            print("\nSample of discovered tables:")
            print(full_inventory["tables"].head())

        # Save to CSV for analysis
        for name, df in full_inventory.items():
            df.to_csv(f"snowflake_inventory_{name}.csv", index=False)

This script gives us the raw materials: a complete list of every table, view, procedure, and more. This dataset is the single source of truth for our entire migration project.

Step 2: Usage and Dependency Analysis

An inventory is just a list. The real value comes from understanding how these objects interact. The most critical question is: Is this object even being used? Migrating obsolete objects is a complete waste of time and money.

Snowflake’s ACCOUNT_USAGE schema is a goldmine for this. Specifically, the QUERY_HISTORY and ACCESS_HISTORY views.

  • QUERY_HISTORY: Contains the text of every query executed. We can parse this to build a direct dependency graph (e.g., this query reads from Table_A and View_B and writes to Table_C).
  • ACCESS_HISTORY: Provides a pre-parsed log of which tables and columns were read or written to by which queries. This is often easier and more reliable than parsing raw SQL text.

Here’s a conceptual Python snippet showing how we approached parsing QUERY_HISTORY to find dependencies. This is a simplified example; a production version requires much more robust SQL parsing logic, often using libraries like sqlparse and extensive regular expressions.

    import pandas as pd
    import re
    from collections import defaultdict

    def analyze_query_history(query_history_df: pd.DataFrame) -> Dict[str, Dict[str, set]]:
        """
        Analyzes a DataFrame of query history to build a dependency graph.
        Returns a dictionary mapping each object to its upstream and downstream dependencies.
        """
        # A simple regex to find tables after FROM or JOIN clauses.
        # In reality, this needs to handle CTEs, subqueries, aliases, etc.
        # This is a highly simplified illustration of the concept.
        source_pattern = re.compile(r'(?:FROM|JOIN)\s+([\w\."]+)', re.IGNORECASE)

        # Regex for target tables in INSERT, UPDATE, MERGE
        target_pattern = re.compile(r'(?:INTO|UPDATE|MERGE\s+INTO)\s+([\w\."]+)', re.IGNORECASE)

        dependencies = defaultdict(lambda: {"reads_from": set(), "written_to_by": set()})
        lineage = defaultdict(lambda: {"sources": set(), "targets": set()})

        for index, row in query_history_df.iterrows():
            query_text = row['QUERY_TEXT']
            query_id = row['QUERY_ID']

            # Find all sources (read operations)
            sources = set(source_pattern.findall(query_text))

            # Find the primary target (write operation)
            target_match = target_pattern.search(query_text)
            targets = {target_match.group(1)} if target_match else set()

            if not sources and not targets:
                continue

            # Build the lineage graph for this query
            lineage[query_id]['sources'] = sources
            lineage[query_id]['targets'] = targets

            # Update the object-level dependency map
            for target_table in targets:
                for source_table in sources:
                    dependencies[target_table]['reads_from'].add(source_table)
                    dependencies[source_table]['written_to_by'].add(target_table)

        return {"object_dependencies": dependencies, "query_lineage": lineage}

    # --- Example Usage ---
    # Assume you've queried SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY and loaded it into a DataFrame
    # query_history_df = pd.read_csv("query_history_export.csv")
    #
    # # Filter for relevant queries (e.g., in the last 90 days, successful, not run by you)
    # recent_queries = query_history_df[
    #     (query_history_df['EXECUTION_STATUS'] == 'SUCCESS') &
    #     (query_history_df['QUERY_TYPE'].isin(['INSERT', 'MERGE', 'UPDATE', 'CREATE_TABLE_AS_SELECT']))
    # ]
    #
    # dependency_analysis = analyze_query_history(recent_queries)
    #
    # # Now you can find all upstream dependencies for a critical report table
    # critical_table = "PROD_DB.SALES.F_SALES_DAILY"
    # if critical_table in dependency_analysis['object_dependencies']:
    #     upstream_tables = dependency_analysis['object_dependencies'][critical_table]['reads_from']
    #     print(f"Table {critical_table} is built from: {upstream_tables}")

This analysis allows us to group objects into logical "migration waves." We can visualize the DAG and identify self-contained units to migrate, test, and cut over, minimizing business disruption. It also identifies orphan objects that can be decommissioned.

Step 3: Automated Code Conversion

This is where the magic—and the difficulty—lies. Snowflake SQL and Databricks SQL (based on Spark SQL) are similar, but the differences are numerous and subtle. Manually converting thousands of scripts is not feasible. We needed an automated transpiler.

We built a Python-based transpiler using a dictionary of regex-based replacement rules. It’s not a full-blown compiler, but it handles 70-80% of the common syntactic differences, flagging the rest for manual review.

Feature/Syntax Snowflake Databricks (Spark SQL) Conversion Strategy
Type Casting column::VARCHAR CAST(column AS STRING) Regex replacement
Analytic Functions QUALIFY ROW_NUMBER() OVER (...) = 1 Subquery with ROW_NUMBER() Wrap in a CTE: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) as rn FROM ...) WHERE rn = 1
Set Operator MINUS EXCEPT Simple string replacement
Date Truncation DATE_TRUNC('day', ts) DATE_TRUNC('DAY', ts) Simple string replacement (or regex for case-insensitivity)
Semi-Structured VARIANT, OBJECT, ARRAY STRUCT, MAP, ARRAY More complex; often requires schema inference and from_json
Stored Procedures JavaScript, Snowflake Scripting SQL, Python, Scala High Manual Effort. Automate the boilerplate, flag logic for rewrite.
Null Handling ZEROIFNULL(col) COALESCE(col, 0) Regex replacement
CONNECT BY Supported for hierarchical queries Not supported High Manual Effort. Re-implement using recursive CTEs.

Here's a simplified version of our SQL conversion function:

    import re

    def convert_snowflake_to_databricks_sql(snowflake_sql: str) -> (str, List[str]):
        """
        Performs a rule-based conversion of Snowflake SQL to Databricks SQL.
        Returns the converted SQL and a list of warnings for manual review.
        """
        converted_sql = snowflake_sql
        warnings = []

        # Rule-based replacements using regex
        # The order of rules can matter!
        conversion_rules = {
            # Casting ::
            r'(\b[\w\."]+\b)::([\w\(\),]+)': r'CAST(\1 AS \2)',
            # MINUS to EXCEPT
            r'\bMINUS\b': r'EXCEPT',
            # ZEROIFNULL to COALESCE
            r'ZEROIFNULL\((.*?)\)': r'COALESCE(\1, 0)',
            # IFF to IF
            r'IFF\((.*?),\s*(.*?),\s*(.*?)\)': r'IF(\1, \2, \3)',
        }

        # DataType mapping
        datatype_mappings = {
            r'\bVARCHAR\b': 'STRING',
            r'\bNUMBER\b': 'DECIMAL(38, 0)', # Be more specific if possible
            r'\bFLOAT\b': 'DOUBLE',
            r'\bTIMESTAMP_NTZ\b': 'TIMESTAMP',
        }

        # Combine mappings
        all_rules = {**conversion_rules, **datatype_mappings}

        for pattern, replacement in all_rules.items():
            converted_sql = re.sub(pattern, replacement, converted_sql, flags=re.IGNORECASE)

        # Handling more complex patterns like QUALIFY
        qualify_pattern = re.compile(r'\bQUALIFY\b(.*?)(?:\bORDER BY|\bLIMIT|\b$)', re.IGNORECASE | re.DOTALL)
        if qualify_pattern.search(converted_sql):
            warnings.append("QUALIFY clause found. Manual conversion to a subquery or window function is required.")
            # Placeholder for a more advanced transformation
            # For a real project, we'd parse the window function inside QUALIFY and build the subquery
            converted_sql = converted_sql.replace("QUALIFY", "-- MANUAL REVIEW REQUIRED: QUALIFY clause needs to be rewritten as a subquery with a window function.")

        # Flagging unsupported functions/features
        unsupported_keywords = ['CONNECT BY', 'MATCH_RECOGNIZE']
        for keyword in unsupported_keywords:
            if keyword in converted_sql.upper():
                warnings.append(f"Unsupported keyword '{keyword}' found. Manual rewrite is necessary.")

        return converted_sql, warnings

    # --- Example Usage ---
    snowflake_view_ddl = """
    CREATE OR REPLACE VIEW MY_DB.SALES.V_CUSTOMER_FIRST_ORDER AS
    SELECT
        CUSTOMER_ID,
        ORDER_ID,
        ORDER_DATE::DATE AS ORDER_DAY,
        ZEROIFNULL(REVENUE_USD) AS REVENUE
    FROM
        MY_DB.SALES.ORDERS
    QUALIFY ROW_NUMBER() OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE ASC) = 1;
    """

    converted_ddl, conversion_warnings = convert_snowflake_to_databricks_sql(snowflake_view_ddl)

    print("--- Converted DDL ---\n")
    print(converted_ddl)
    print("\n--- Conversion Warnings ---")
    for warning in conversion_warnings:
        print(f"- {warning}")

This script automates the tedious work and, more importantly, systematically identifies complex logic (QUALIFY, CONNECT BY) that requires a senior developer's attention.

Step 4: Generating a Migration Readiness Report

The output of these scripts is a mountain of data. To make it actionable for project managers and stakeholders, we synthesized it into a readiness report. This was a simple dashboard (we used a combination of CSVs and a BI tool) that answered key questions:

  • Object Summary: How many tables, views, procedures, etc. are in scope?
  • Conversion Status: What percentage of SQL code was converted automatically vs. requires manual review?
  • Complexity Score: Which objects are the most complex to migrate (e.g., those using CONNECT BY, JavaScript UDFs, or having deep dependency chains)?
  • Usage Metrics: Which objects are "hot" (frequently accessed) and "cold" (rarely or never accessed)?
  • Migration Waves: Recommended groupings of objects for phased migration.

A Python function to generate summary stats for such a report might look like this:

    def generate_readiness_summary(inventory: Dict[str, pd.DataFrame], conversion_results: dict):
        """
        Generates summary statistics for a migration readiness report.
        """
        summary = {}

        # Object counts
        for name, df in inventory.items():
            summary[f"count_{name}"] = len(df)

        # Conversion analysis (assuming conversion_results is a dict mapping object name to warnings)
        total_sql_objects = len(conversion_results)
        manual_review_count = sum(1 for warnings in conversion_results.values() if warnings)
        auto_converted_count = total_sql_objects - manual_review_count

        summary["sql_objects_total"] = total_sql_objects
        summary["sql_objects_manual_review"] = manual_review_count
        summary["sql_objects_auto_converted"] = auto_converted_count
        if total_sql_objects > 0:
            summary["auto_conversion_rate"] = (auto_converted_count / total_sql_objects) * 100
        else:
            summary["auto_conversion_rate"] = 100

        # You would also join this with usage data from query history analysis
        # to prioritize objects.

        return summary

    # --- Example Usage ---
    # Assuming you have run the previous scripts
    # conversion_results = {}
    # views_df = full_inventory.get('views', pd.DataFrame())
    # for index, row in views_df.iterrows():
    #     view_name = f"{row['TABLE_CATALOG']}.{row['TABLE_SCHEMA']}.{row['TABLE_NAME']}"
    #     view_ddl = row['VIEW_DEFINITION']
    #     _, warnings = convert_snowflake_to_databricks_sql(view_ddl)
    #     conversion_results[view_name] = warnings
    #
    # readiness_stats = generate_readiness_summary(full_inventory, conversion_results)
    #
    # print("\n--- Migration Readiness Report ---")
    # for key, value in readiness_stats.items():
    #     print(f"{key}: {value}")


This report became our compass, guiding our planning, resource allocation, and communication with the business.

Automation gets you 80% of the way there. The final 20% is where migrations live or die. It involves solving gnarly, platform-specific problems. Here are the most common ones we faced, categorized for clarity.

Connectivity & Drivers

  • The Problem: The client's entire ecosystem of BI tools (Tableau, Power BI), custom applications, and ETL tools (like Informatica or Talend) were configured with Snowflake’s JDBC/ODBC drivers and connection strings. After migrating a dataset, nothing could connect to it on Databricks.
  • How We Diagnosed It: Simple Connection refused or Driver not found errors in application logs. More subtly, we saw authentication errors because developers tried to use their Snowflake key-pair auth method, which isn't a direct equivalent in Databricks.
  • Remediation and Best Practices:
    1. Centralize Connection Information: We created a wiki page that was the single source of truth for the new Databricks SQL Warehouse JDBC URL, PAT (Personal Access Token) generation instructions, and Service Principal setup guides.
    2. Driver Inventory: We used our automation discovery to find all unique CLIENT_DRIVER types in QUERY_HISTORY to build a checklist of every tool we needed to reconfigure.
    3. Use Service Principals: We immediately stopped the use of individual user PATs for production applications. We created Azure AD Service Principals (or AWS IAM Roles), granted them specific permissions in Unity Catalog, and used their credentials for all automated access. This decouples access from individual employees.
    4. Phased Tool Migration: We didn't try to switch all tools at once. We worked with one team (e.g., the Sales BI team), got their Tableau dashboards connecting to a Databricks SQL Warehouse, validated performance, and then used them as a success story to guide other teams.

Performance & Scalability

  • The Problem: We migrated a large, multi-terabyte fact table and its associated dimension tables. In Snowflake, queries against it were fast, thanks to well-defined clustering keys. In Databricks, the same queries, even after SQL translation, were painfully slow. Joins were taking hours instead of minutes.
  • How We Diagnosed It:
    1. Spark UI: This was our best friend. We launched the query from a Databricks Notebook and immediately opened the Spark UI. We saw two things: massive data shuffling (a huge "Exchange" stage) and data skew, where a few tasks were processing gigabytes of data while others did nothing.
    2. EXPLAIN Plan: We ran EXPLAIN on the query in a notebook. The plan showed a Broadcast Hash Join was not being used where it should have been, and the predicates were not being pushed down effectively.
  • Remediation and Best Practices:
    1. Rethink Physical Data Layout: Snowflake's micro-partitions and automatic clustering are fundamentally different from Databricks' Delta Lake files and partitioning/Z-Ordering. We used our QUERY_HISTORY analysis to identify the most common join and filter columns for the slow table.
    2. Strategic Partitioning: The table was being filtered by transaction_date 90% of the time. We rewrote the data loading job to partition the Delta table by this column (PARTITIONED BY (transaction_date)). This immediately enabled partition pruning, drastically reducing the amount of data scanned.
    3. Apply Z-Ordering: The most common join key was customer_id. After partitioning, we applied Z-Ordering (OPTIMIZE table_name ZORDER BY (customer_id)). This co-locates related data within the files of each partition, making joins incredibly fast. It's the closest equivalent to Snowflake’s clustering keys.
    4. Enable Photon: For all our SQL Warehouses and job clusters, we made sure the Photon engine was enabled. It’s a native C++ vectorized engine that provides a massive speedup on standard SQL workloads, often without any code changes.
    5. Analyze Tables: We made it a standard practice to run ANALYZE TABLE table_name COMPUTE STATISTICS FOR ALL COLUMNS after major data loads. This gives the cost-based optimizer the information it needs to create efficient query plans.

Schema & Data Type Issues

  • The Problem: One of our first data validation failures was on a table containing semi-structured data. Snowflake stored it in a VARIANT column. Our initial automated migration defined the corresponding Databricks column as a simple STRING. Downstream processes that expected to query into the JSON (json_col:path.to.key) all failed. We also saw precision loss with NUMBER(38,0) columns being incorrectly mapped.
  • How We Diagnosed It:
    1. Data Loading Failures: Our Spark jobs would fail with parsing errors when trying to write to the Delta table.
    2. SELECT COUNT(*) Mismatches: We ran reconciliation scripts to compare row counts and checksums between Snowflake and Databricks. They didn't match.
    3. Query Errors: CAST and JSON parsing errors appeared in the logs of migrated queries.
  • Remediation and Best Practices:
    1. Explicit Schema Definition: Never rely on schema inference for production pipelines. For VARIANT columns, we used Snowflake's typeof and sampling to infer a schema, then explicitly defined it in Databricks using STRUCT types.
    2. Use from_json: For loading semi-structured data, we would land it as a string and then use the from_json function with a defined schema to parse it into a STRUCT column. This gives you the best of both worlds: robust loading and queryable nested data.
    -- Example of parsing a JSON string column into a struct
    SELECT
      id,
      from_json(raw_json_string_col, 'STRUCT<user: STRING, event_type: STRING, properties: MAP<STRING, STRING>>') AS event_data
    FROM raw_landing_table;

3.  **Create a Data Type Mapping Matrix:** We created and enforced a strict mapping table. `NUMBER(p,s)` in Snowflake becomes `DECIMAL(p,s)` in Databricks. `TIMESTAMP_NTZ` becomes `TIMESTAMP`. `BINARY` becomes `BINARY`. This was embedded into our automation scripts.
4.  **Validate with Checksums:** For validation, simple row counts are not enough. We wrote scripts to generate checksums/hashes on a per-column and per-row basis for a sample of data (`xxhash64` is great for this in Spark) to catch subtle data type or transformation logic errors.

Security & Compliance

  • The Problem: The client had a complex security model built on Snowflake roles and custom roles inherited in a hierarchy. They also used row-level access policies (via CREATE ROW ACCESS POLICY) to ensure certain users could only see data for their own region. A direct migration of data left everything wide open or inaccessible in Databricks.
  • How We Diagnosed It: User acceptance testing (UAT) failed spectacularly. A sales manager from the US could suddenly see European sales data. An analyst who had access before now got Permission Denied errors. Our audit logs showed users were being blocked by Unity Catalog.
  • Remediation and Best Practices:
    1. Adopt Unity Catalog from Day One: This is non-negotiable for any serious Databricks implementation. We decided immediately that we would not use legacy table ACLs. All security would be managed through Unity Catalog (UC).
    2. Map Snowflake Roles to UC Groups: We used our inventory scripts to extract all Snowflake roles and their user assignments. We then mapped these business functions to Groups in Databricks (e.g., the SNOWFLAKE_SALES_ANALYST_ROLE became the databricks-sales-analysts group in our identity provider, which was synced to Databricks).
    3. Re-implement Row-Level Security with Secure Views: Unity Catalog doesn't have a direct equivalent of Snowflake's row access policies (as of this writing). We re-implemented this logic by creating a secure VIEW on top of the base table. The view's logic uses the is_member() or current_user() functions to filter the data.
    -- RLS View in Databricks
    CREATE VIEW secure_catalog.sales.regional_sales AS
    SELECT *
    FROM secure_catalog.sales.all_sales
    WHERE
      CASE
        WHEN is_member('group-us-sales') THEN region = 'US'
        WHEN is_member('group-eu-sales') THEN region = 'EU'
        ELSE FALSE
      END;

    We then granted users `SELECT` access on the view, not the underlying table.
4.  **Automate Grants:** We wrote scripts using the Databricks REST API or Terraform to apply `GRANT` statements programmatically based on our role-to-group mapping document. This ensured consistency and auditability.

Orchestration & Scheduling

  • The Problem: A core ETL process was built using Snowflake Streams and Tasks. A STREAM on a raw table captured new inserts. A TASK ran every 5 minutes, checked the stream, and ran a MERGE statement into a dimension table. Another task was dependent on the first, running a downstream aggregation. This DAG of dependencies was lost in the migration.
  • How We Diagnosed It: Data freshness dashboards turned red. The dimension table wasn't getting updated. Manually running the migrated MERGE script worked, but the automation was broken. We had no dependency management.
  • Remediation and Best Practices:
    1. Pattern 1: Migrating to Databricks Jobs: For direct "lift-and-shift" of task-based workflows, we used Databricks Jobs. A Snowflake TASK DAG was converted into a multi-task Databricks Workflow. Each MERGE script or stored procedure call became a separate task in the workflow, with dependencies explicitly defined in the UI or via the API. This is a solid, direct translation.
    2. Pattern 2: Modernizing with Delta Live Tables (DLT): This was our preferred approach for new and critical pipelines. DLT is a declarative framework that completely abstracts away the orchestration. We rewrote the Stream/Task logic into a DLT pipeline.
    # Simplified DLT pipeline in a Python notebook
    import dlt
    from pyspark.sql.functions import *

    @dlt.table(
      comment="Raw bronze data from cloud storage."
    )
    def bronze_raw_orders():
      return (
        spark.readStream.format("cloudFiles")
          .option("cloudFiles.format", "json")
          .load("/path/to/raw/orders")
      )

    @dlt.table(
      comment="Cleaned and prepared silver dimension table."
    )
    @dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
    def silver_customers():
      return (
        dlt.read_stream("bronze_raw_orders")
          .select("customer_id", "customer_details", "update_timestamp")
      )

    # DLT handles the MERGE logic automatically with Change Data Capture (CDC)
    dlt.create_streaming_live_table("gold_customer_dim")

    dlt.apply_changes(
      target = "gold_customer_dim",
      source = "silver_customers",
      keys = ["customer_id"],
      sequence_by = col("update_timestamp"),
      stored_as_scd_type = 1 # or 2
    )

    With DLT, we just declare the source and the target and the transformations. DLT automatically handles dependency management, infrastructure, data quality checks, and efficient micro-batch or continuous processing. It was a massive leap forward from manually managing Streams and Tasks.

Cost & Resource Optimization

  • The Problem: In the first month after migrating a business unit, the client’s cloud bill for Databricks was higher than they expected. We had replaced a predictable Snowflake credit consumption model with a more complex world of VMs, DBUs (Databricks Units), and storage costs.
  • How We Diagnosed It:
    1. Databricks Cost Analysis Tools: We used the built-in cost and usage dashboards in Databricks to break down spend by SKU, workspace, and tags.
    2. Cluster Monitoring: We saw that several "All-Purpose Clusters" used by data scientists were running 24/7, even overnight, accumulating costs while being completely idle.
    3. Job Cluster Configuration: Some nightly ETL jobs were using massive, fixed-size clusters that were only fully utilized for 10% of the job's duration.
  • Remediation and Best Practices:
    1. Job Clusters are Your Default: We established a rule: if it’s an automated, scheduled workload, it runs on a Job Cluster. Job clusters are cheaper per DBU and exist only for the duration of the job. All-Purpose clusters are for interactive development only.
    2. Enforce Auto-Termination: We set a strict, non-negotiable auto-termination policy (e.g., 30 minutes of inactivity) on all All-Purpose clusters.
    3. Embrace Auto-Scaling: For both job and all-purpose clusters, we enabled auto-scaling. We'd set a minimum number of workers (e.g., 2) and a maximum (e.g., 20). Databricks automatically scales the cluster up to handle load and, more importantly, scales it down to save money.
    4. Use Cluster Policies: To prevent developers from spinning up monster GPU clusters for a simple SQL query, we implemented Cluster Policies. Policies are JSON-defined rules that restrict the available instance types, set mandatory tags (for chargeback), and enforce auto-termination settings.
    5. Leverage Spot Instances: For non-time-critical workloads, we configured our clusters to use spot instances for a significant percentage of the workers, which can provide savings of up to 70-90% over on-demand prices.

My Battle-Tested Best Practices: Lessons from the Field

If you only take away a few things, let them be these hard-won lessons.

  1. Don't Boil the Ocean. Start with a single, well-understood business domain or data mart. Use it as your pilot to build your automation scripts, refine your methodology, and train your team. Your first migration wave is your template for all future waves.
  2. QUERY_HISTORY and ACCESS_HISTORY are Pure Gold. Mine them relentlessly. They are your ground truth for what matters. Usage data should drive every decision you make about prioritization, performance tuning, and decommissioning.
  3. Treat Migration as a Software Development Project. Your automation scripts, SQL code, and infrastructure definitions should live in Git. Use pull requests for code reviews. Set up CI/CD pipelines to deploy changes. This discipline prevents the chaos of a manual approach.
  4. Unity Catalog Is Not Optional. Adopt it from day one, even if it feels like more upfront work. It is the foundation for governance, security, and data discovery on the lakehouse. Retrofitting it later is 10x harder.
  5. Validate, Validate, Then Validate Again. Data is the asset. You must prove, with data, that the migration was successful. Automate row counts, checksums, and even full data reconciliations on key tables. Your business users will not trust the new platform until you can prove its fidelity.
  6. Invest in Your People. The technical migration is only half the battle. The other half is the human element. Train your analysts, engineers, and data scientists on the new tools and concepts (Spark UI, notebooks, Delta Lake, DLT). The shift from a managed SQL warehouse to the more versatile but complex lakehouse requires a change in mindset.

Conclusion: It's a Modernization Journey, Not Just a Migration

Automating the migration from Snowflake to Databricks is an intense, complex, but entirely achievable engineering challenge. A manual approach is a recipe for failure. A systematic, automation-driven "migration factory" approach is the only path to success.

By building a toolkit to discover your assets, analyze their dependencies, convert their logic, and validate the results, you transform an overwhelming task into a manageable, repeatable process. The challenges you face—from performance tuning to security modeling—are not signs of a failed strategy but are simply the predictable hurdles of any major platform shift. By anticipating them and having a plan, you can overcome them systematically.

Ultimately, the goal isn't just to move data from point A to point B. It's to unlock the full potential of the Databricks Lakehouse Platform—to unify your data warehousing and AI workloads, to build more efficient and reliable data pipelines, and to empower your organization with a single, governed source of truth for all data and analytics. With the right automation and methodology, you're not just migrating; you're modernizing. And that makes all the difference.