Top 5 Things to Consider When Migrating from Snowflake to Databricks ?

if you're thinking about moving from Snowflake to Databricks. It's a journey many teams are considering, often driven by the desire to unify data warehousing with AI/ML workloads, gain more control over the underlying storage, or optimize costs for specific usage patterns. Having navigated this migration path with multiple teams, I can tell you it's much more than a simple platform swap. It’s a paradigm shift.

Forget the marketing slides for a moment. The core difference is this: Snowflake is a managed SQL data warehouse-as-a-service that brilliantly separates storage and compute. Databricks is a unified data and AI platform built on top of open-source technologies like Apache Spark, Delta Lake, and MLflow, running on your cloud storage. This distinction is the source of both its power and its complexity.

Successfully migrating requires you to unlearn some habits and embrace a new way of thinking about your data's lifecycle. Here are the top five practical, in-the-weeds considerations that will make or break your migration.


1. RETHINKING YOUR DATA ARCHITECTURE: IT'S NOT JUST A LIFT-AND-SHIFT

The single biggest mistake I see is teams trying to replicate their Snowflake environment one-for-one in Databricks. This "lift-and-shift" approach is a recipe for high costs and poor performance. In Snowflake, you likely have databases, schemas, and tables. You think in terms of virtual warehouses (T-shirt sizes like X-Small, Large) that you spin up to run SQL queries. It's clean, abstracted, and very SQL-centric.

Databricks requires you to think closer to the metal. You own the storage layer (e.g., AWS S3, Azure ADLS Gen2), and the architecture should reflect that.

From Warehouse-on-Cloud to the Lakehouse Paradigm

Your Snowflake mental model is one of structured data sitting in a proprietary, highly optimized format, queried by a managed compute layer. Databricks flips this. Your data lives in an open format (Delta Lake) on your own cloud object storage. This is the "Lakehouse" concept in practice: data warehouse-like performance and reliability directly on your data lake.

This has profound implications.

  • Storage is Your Responsibility: In Snowflake, you load data and it's managed. In Databricks, you are creating files (Parquet files, by default, wrapped in a Delta transaction log) in your own S3 bucket or ADLS container. This means you need to think about file layouts, partitioning, and managing small files—things Snowflake completely abstracts away.
  • Compute is More Granular: Instead of a "Large" warehouse, you define a Databricks cluster. You choose the VM instance types (e.g., memory-optimized, compute-optimized), the number of workers, and whether it auto-scales. This is powerful but can be daunting.

Here’s a practical breakdown of the conceptual mapping:

Snowflake Concept Databricks Equivalent(s) Key Consideration
Account Databricks Workspace/Account A workspace is the environment where you work. An account can manage multiple workspaces.
Database / Schema Unity Catalog: Catalog / Schema Unity Catalog provides a three-level namespace (catalog.schema.table) that feels very familiar to Snowflake users. This is the recommended approach.
Virtual Warehouse Databricks SQL Warehouse (Serverless or Pro) or All-Purpose/Jobs Cluster This is the most critical distinction. Use SQL Warehouses for BI/SQL analytics. Use All-Purpose/Jobs clusters for data engineering (Spark jobs). Don't use an expensive All-Purpose cluster for your Tableau dashboard!
Table Delta Table (managed or external) A Delta table is a collection of Parquet files with a transaction log. This is the heart of the Lakehouse.
VARIANT Data Type STRUCT, MAP, ARRAY, JSON string Databricks encourages parsing semi-structured data into strongly-typed nested structures on read for better performance, unlike Snowflake's optimized VARIANT type.
Snowpipe Auto Loader Auto Loader is Databricks's powerful, scalable service for incremental data ingestion from cloud storage. It's more configurable than Snowpipe.
Streams & Tasks Delta Live Tables (DLT) or Structured Streaming + Databricks Jobs DLT is a declarative framework for building reliable data pipelines. It's a higher-level abstraction than Snowflake's Streams/Tasks.
Clustering Key Z-ORDER BY Z-Ordering colocates related information in the same set of files. It’s a similar concept to Snowflake's clustering but works on a multi-dimensional basis on the files in your cloud storage.

The Medallion Architecture in Practice

Because you manage the data lifecycle, a structured architectural pattern is essential. The most successful migrations I've been a part of have rigorously adopted the Medallion Architecture. Don't treat this as optional marketing fluff; it's a foundational pattern for success on Databricks.

  • Bronze Layer (Raw): This is where your raw data lands, as-is, from source systems. Think of it as your immutable staging area. The goal is to capture the data quickly and cheaply. Data is often stored in its original format or as Delta tables with the raw payload (e.g., a single column of JSON or CSV). You never want to have to re-ingest from a source system. Your Bronze layer is your source of truth.
  • Silver Layer (Cleansed, Conformed): Here, you take the raw data from Bronze and apply basic quality rules, cleansing, and transformations. Data is parsed from JSON into columns, joined with reference data, and conformed into an enterprise view. Your Silver tables should mirror the structure of your source systems but be clean and queryable. This is often where business analysts and data scientists start their exploration.
  • Gold Layer (Aggregated, Curated): This layer contains your highly refined, business-level aggregates. These are your final reporting tables, feature stores for ML, or data marts. Gold tables are optimized for specific business use cases, often denormalized and aggregated to provide answers to key business questions quickly. Your BI tools should almost exclusively point to Gold tables.

Real-World Scenario: A team I worked with tried a "lift-and-shift" from their PROD schema in Snowflake. They dumped all 300 tables into a single schema in Databricks and pointed their jobs at it. It was chaos. Jobs failed due to data quality issues that were previously hidden, query performance was terrible because nothing was optimized, and lineage was impossible to track.

The fix was to re-architect. We mapped their source ingestion into Bronze tables. We then built Silver tables that represented their core business entities (customers, orders, products). Finally, we rebuilt their key BI dashboards on top of new Gold tables, like daily_sales_summary and customer_360_view. The upfront effort was significant, but it made the system reliable, performant, and understandable.


2. TRANSLATING THE CODE: BEYOND SQL DIALECTS

This is where the rubber meets the road. A significant portion of your migration effort will be in translating your Snowflake logic into a Databricks-compatible format. While both platforms support ANSI SQL, the devil is in the details—the proprietary functions, procedural logic, and underlying execution models are worlds apart.

SQL Functions and Dialect Differences

You will spend a non-trivial amount of time hunting down Snowflake-specific functions and finding Databricks equivalents. Simple ones are easy (GETDATE() -> current_timestamp()), but others are more complex.

Here's a list of common "gotchas":

  • Semi-structured Data Handling: Snowflake's VARIANT and dot-notation/bracket-notation for parsing JSON is incredibly convenient.

    • Snowflake SQL: SELECT raw_payload:customer.name FROM my_table;
    • Databricks SQL: SELECT raw_payload.customer.name FROM my_table; (if raw_payload is a STRUCT) or SELECT get_json_object(raw_payload_string, '$.customer.name') FROM my_table;
    • Practitioner's Tip: The best practice in Databricks is to define a schema and parse your JSON into a STRUCT type during ingestion (e.g., in your Bronze-to-Silver pipeline). Querying a STRUCT is dramatically faster than parsing a JSON string on the fly with get_json_object.
  • QUALIFY ROW_NUMBER(): Snowflake’s QUALIFY clause is a powerful and elegant way to filter window functions. Databricks doesn't have a direct equivalent. You need to fall back to a subquery or a Common Table Expression (CTE).

    • Snowflake SQL:
    SELECT * FROM events
    QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp DESC) = 1;

*   **Databricks SQL (Equivalent):**
    SELECT * FROM (
      SELECT *,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp DESC) as rn
      FROM events
    )
    WHERE rn = 1;

  • Sequence Generators (AUTOINCREMENT / SEQUENCE): Snowflake makes generating unique surrogate keys simple. Databricks, being a distributed system operating on files, doesn't have a native, scalable equivalent for gap-free sequences.
    • Snowflake: CREATE TABLE t (id INT AUTOINCREMENT, ...)
    • Databricks: Delta Lake now supports IDENTITY columns, which work similarly but come with caveats about performance in very high-volume writes.
    CREATE TABLE t (id BIGINT GENERATED BY DEFAULT AS IDENTITY, ...)

*   **Alternative Pattern:** Often, the better pattern in a distributed world is to use a composite natural key or generate a UUID using a function like `uuid()`. Relying on a globally ordered sequence can become a bottleneck.

Stored Procedures: The Spark Mindset Shift

This is a major architectural pivot. In Snowflake, you write Stored Procedures in JavaScript, SQL (Scripting), or even bring in Java/Python code. They are executed transactionally on the warehouse. They are a common way to encapsulate complex business logic.

Trying to replicate a 500-line imperative stored procedure in Databricks SQL is a path of pain. Databricks's native procedural language is Apache Spark, most commonly used via the Python (PySpark) or Scala APIs.

Your stored procedures should not be translated to Databricks SQL procedures. They should be re-implemented as Spark jobs, organized within Databricks Notebooks or Python scripts.

Example: A Complex Customer Update Procedure

Imagine a Snowflake stored procedure that does the following:
1. Takes a customer_id.
2. Looks up recent orders.
3. Calls a UDF to calculate a new loyalty score.
4. Fetches the latest address from another table.
5. Updates the master customers table in a transaction.

The Snowflake Way (Conceptual JS Stored Proc):

    CREATE OR REPLACE PROCEDURE update_customer_loyalty(CUSTOMER_ID FLOAT)
    RETURNS STRING
    LANGUAGE JAVASCRIPT
    AS $$
      // Imperative logic: SELECT, call UDF, UPDATE
      var sql_get_orders = `SELECT ... WHERE customer_id = ${CUSTOMER_ID}`;
      // ... run statements, process results, build UPDATE statement ...
      snowflake.execute({sqlText: "UPDATE customers SET ..."});
      return "SUCCESS";
    $$;

The Databricks Way (PySpark Notebook/Job):
This is a declarative, functional approach. You define data transformations, not imperative steps.

    # A Databricks Notebook that can be parameterized with widgets
    dbutils.widgets.text("customer_id", "0")
    customer_id = dbutils.widgets.get("customer_id")

    # 1. Read the necessary data into DataFrames
    customers_df = spark.table("silver.customers")
    orders_df = spark.table("silver.orders")

    # 2. Define the transformations (this could include UDFs, but native functions are better)
    from pyspark.sql.functions import col, sum, when

    # Filter for the relevant customer and calculate new score
    customer_orders = orders_df.filter(col("customer_id") == customer_id)
    new_loyalty_score_df = customer_orders.groupBy("customer_id").agg(
        sum("order_total").alias("total_spend")
    ).withColumn("loyalty_score",
        when(col("total_spend") > 1000, "Gold")
        .when(col("total_spend") > 500, "Silver")
        .otherwise("Bronze")
    )

    # 3. Use Delta Lake's MERGE to update the target table atomically
    # This is the idiomatic way to perform an "upsert" or "update" in Databracks
    (customers_df.alias("c")
      .merge(
        new_loyalty_score_df.alias("s"),
        "c.customer_id = s.customer_id"
      )
      .whenMatchedUpdate(set = {
        "loyalty_score": "s.loyalty_score",
        "last_updated_ts": "current_timestamp()"
      })
      .execute()
    )

The mindset shift is from row-by-row, imperative processing to set-based, declarative transformations on entire DataFrames. The MERGE statement is your best friend for migrating complex UPDATE logic.

User-Defined Functions (UDFs): A Performance Minefield

Snowflake makes it easy to write UDFs in various languages. Databricks also supports UDFs (Python and Scala), but they come with a huge performance warning. When you use a Python UDF in a Spark job, the data has to be serialized from the Spark JVM, sent to a Python process, computed, and then serialized back. This overhead can kill performance.

The rule of thumb: Avoid UDFs whenever possible. 95% of what you do in a UDF can be accomplished with Spark's built-in functions in the pyspark.sql.functions module. These functions operate directly on the JVM data and are orders of magnitude faster.

If you absolutely must use a UDF for complex logic that can't be expressed otherwise, consider using a Pandas UDF (Vectorized UDF). These operate on Apache Arrow data structures and process data in batches (as Pandas Series or DataFrames), significantly reducing the serialization overhead.


3. ORCHESTRATION AND INGESTION: THE NEW DATA FLOW

How data gets into your platform and how your transformation jobs are triggered is fundamentally different. In Snowflake, you likely rely on a combination of Snowpipe for ingestion and Tasks for scheduling, possibly orchestrated by an external tool like Airflow.

Databricks offers a more integrated but also more fragmented ecosystem. You have to make conscious choices about which tools to use for which job.

Ingestion: From Snowpipe to Auto Loader

Snowpipe is excellent for its simplicity: point it at an S3/ADLS stage, and it loads new files automatically.

Auto Loader is Databricks's answer, and it's more powerful and configurable. It can run in two modes:
* Directory Listing Mode: Good for smaller numbers of files. It lists the source directory to find new files.
* File Notification Mode: The recommended, most scalable mode. It subscribes to file event notifications from your cloud provider (e.g., S3 Event Notifications -> SQS queue). This is more efficient than constantly listing directories with millions of files.

A typical Auto Loader pattern in PySpark looks like this:

    # Set up the stream read
    streaming_df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")  # The format of the source files
      .option("cloudFiles.schemaLocation", "/path/to/schema/location") # Auto Loader tracks state here
      .load("/path/to/raw/data/in/s3/")
    )

    # Apply transformations
    processed_df = streaming_df.withColumn("ingest_time", current_timestamp())

    # Write the stream to a Delta table
    (processed_df.writeStream
      .format("delta")
      .outputMode("append")
      .option("checkpointLocation", "/path/to/checkpoint/location") # Critical for fault-tolerance
      .trigger(availableNow=True) # Run as a batch job, process all available data, then stop
      .toTable("bronze.my_raw_table")
    )

Key Differences & Advantages of Auto Loader:
* Schema Inference and Evolution: Auto Loader can automatically infer your schema and, more importantly, handle schema drift gracefully. It can notify you of changes or rescue unexpected data into a separate column.
* Exactly-Once Processing: By using checkpoint locations, Auto Loader guarantees that each file is processed exactly once, even if the job fails and restarts.
* Cost-Effective: You run Auto Loader on a Databricks Job cluster, which you only pay for when it's running. You can schedule it to run every 5 minutes or every hour, controlling your cost-latency tradeoff, unlike Snowpipe's more opaque credit consumption model.

Orchestration: Delta Live Tables vs. Databricks Jobs vs. External Tools

Snowflake's Tasks provide a basic, cron-like scheduling mechanism within the database. Databricks offers more sophisticated options.

  • Databricks Jobs: This is the core orchestrator. You can schedule a notebook, a Python script, or a dbt Core project to run on a schedule. It handles cluster creation/termination, parameter passing, and alerting. You will use this for the majority of your batch pipelines. A typical pattern is to have a multi-task job: Task 1 ingests data with Auto Loader, Task 2 runs Silver transformations, Task 3 builds Gold aggregates.

  • Delta Live Tables (DLT): This is a game-changer for new projects and a strong consideration for your migration. DLT is a declarative pipeline framework. Instead of defining a DAG of tasks, you define your tables (as SQL or Python functions) and specify their dependencies using LIVE references. DLT handles the orchestration, data quality checks, error handling, and auto-scaling for you.

    Conceptual DLT Pipeline (SQL):

    -- Bronze table definition (ingesting from cloud storage)
    CREATE OR REFRESH STREAMING LIVE TABLE raw_orders
    COMMENT "Raw orders from the cloud"
    AS SELECT * FROM cloud_files('/path/to/orders', 'json');

    -- Silver table definition with quality constraints
    CREATE OR REFRESH STREAMING LIVE TABLE clean_orders (
      CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
    )
    COMMENT "Cleansed and conformed orders"
    AS SELECT
      order_id,
      customer_id,
      ...
    FROM STREAM(LIVE.raw_orders);

    -- Gold table definition (a batch aggregation)
    CREATE LIVE TABLE daily_customer_summary
    COMMENT "Aggregated daily sales by customer"
    AS SELECT
      customer_id,
      CAST(order_timestamp AS DATE) as order_date,
      SUM(order_total) as total_sales
    FROM LIVE.clean_orders
    GROUP BY customer_id, order_date;

DLT automatically builds the dependency graph, manages micro-batch or continuous streaming, and provides a rich UI for monitoring pipeline health and data lineage. For complex, streaming-first pipelines, migrating to DLT over a traditional Jobs/Notebooks approach can save immense development and maintenance effort.
  • External Orchestrators (e.g., Airflow): If your organization already has a heavy investment in a tool like Airflow, you don't have to abandon it. Databricks has a robust REST API and an official Airflow provider. You can trigger Databricks jobs from your Airflow DAGs. The common pattern is to keep high-level orchestration in Airflow (e.g., s3_sensor >> databricks_job >> slack_notification) but let Databricks handle the internal multi-task orchestration for better performance and integration.

4. PERFORMANCE AND COST: TAMING THE SPARK BEAST

This is often the reason for migrating to Databricks, but it can also be a source of massive budget overruns if not managed carefully. In Snowflake, cost is predictable: you pay for your warehouse size and how long it runs, plus storage. In Databricks, you pay for the cloud VMs in your cluster for every second they are active, plus a "Databricks Unit" (DBU) charge on top, which varies by the VM type and Databricks service tier.

Understanding how to optimize this equation is non-negotiable.

Cluster Configuration is Everything

Choosing the right cluster is an art and a science. A poorly configured cluster can be 10x more expensive and/or 10x slower than an optimized one.

  • SQL Warehouses vs. Jobs/All-Purpose Clusters:

    • SQL Warehouses: Use these for all your SQL analytics and BI tools (Tableau, Power BI, etc.). They are highly optimized for low-latency SQL queries and use the Photon engine by default. The Serverless option is fantastic because it has instant start-up times and you only pay for the query execution time, much like Snowflake. For any interactive/BI workload, Serverless SQL Warehouses are the goal.
    • Jobs Clusters: Use these for your scheduled, automated data pipelines (e.g., your Bronze-to-Silver ETL). These clusters are created just for the job run and terminate afterward, making them very cost-effective.
    • All-Purpose Clusters: Use these for interactive development in notebooks. They are more expensive than Jobs clusters because they are designed to stay on for long periods. A common mistake is leaving a large All-Purpose cluster running overnight. Always set an aggressive auto-termination (e.g., 30-60 minutes of inactivity).
  • Instance Types: Don't just pick the biggest machine.

    • Memory-Optimized: Good for shuffles, joins, and aggregations. Often a good default for ETL.
    • Compute-Optimized: Good for CPU-intensive tasks like complex string manipulations or ML model training.
    • Storage-Optimized: Useful if your job involves a lot of caching and spilling to disk (though you should try to avoid this by using more memory).
  • Photon Engine: Always use Photon. Photon is Databricks's native vectorized execution engine, rewritten in C++. It's a replacement for parts of Spark's core engine and provides massive speedups (I've seen 2-5x on average) for SQL and DataFrame operations. It's enabled by default on SQL Warehouses. Make sure you choose a Photon-enabled runtime for your Jobs and All-Purpose clusters. The DBU cost is slightly higher, but the performance gain almost always leads to a lower total cost because your cluster runs for a much shorter time.

Optimizing Your Delta Tables

Just like in a traditional database, how your data is physically organized matters.

  • Partitioning: Partition your large tables (>1TB) by a low-cardinality column that is frequently used in WHERE clauses (e.g., date, country). This allows Spark to perform "partition pruning," skipping massive amounts of data. Be careful not to over-partition (e.g., by user_id on a table with millions of users), as this creates millions of small files and cripples performance.
  • Z-Ordering: For high-cardinality columns used in filters, use Z-ORDER. This is a technique that co-locates related data within files. It's a multi-dimensional alternative to a single-column sort. A common pattern is to partition by date and then Z-Order by a high-cardinality ID.
    -- Run this periodically on your large tables
    OPTIMIZE my_gold_table
    ZORDER BY (customer_id, product_id);

  • OPTIMIZE and VACUUM: Because ETL jobs can create many small files, you need to periodically run the OPTIMIZE command to compact them into larger, more optimally sized files (around 1GB is a good target). VACUUM permanently removes old, unreferenced files from your storage, saving on storage costs and keeping your table history clean. These should be scheduled as part of your regular maintenance jobs.

The Cost Monitoring Feedback Loop

You must actively monitor your costs. Use the Databricks cost and usage dashboards. Look for:
* Expensive jobs: Which jobs are consuming the most DBUs? Can their clusters be downsized? Can the code be optimized?
* Idle All-Purpose clusters: Who is leaving clusters on? Enforce termination policies.
* Inefficient queries: Use the Query Profile in the SQL editor to find bottlenecks in your queries. Is a join causing a massive shuffle? Is there a missing filter that could leverage partitioning?

This feedback loop—run, monitor, optimize, repeat—is the key to managing costs effectively on Databricks.


5. GOVERNANCE AND SECURITY: EMBRACING THE LAKEHOUSE

Snowflake has a mature and straightforward Role-Based Access Control (RBAC) model. You create roles, grant them privileges on database objects (SELECT on a table, USAGE on a schema), and assign roles to users.

Databricks, historically, had a more fractured model. You had workspace permissions, cluster permissions, and table access control lists (TACLs). It was complicated and didn't provide a unified view.

Unity Catalog (UC) changes everything. If you are migrating to Databricks today, you should be migrating to a Unity Catalog-enabled workspace. It is not optional for a serious enterprise deployment.

Unity Catalog: The Unified Governance Layer

Unity Catalog provides a centralized governance solution for all your data and AI assets across all workspaces. It solves the historical governance challenges and should be the cornerstone of your security model.

  • Three-Level Namespace (catalog.schema.table): This provides a familiar structure for anyone coming from a traditional database world. It lets you organize your data logically (e.g., a dev catalog, a prod catalog).
  • ANSI SQL GRANT/REVOKE: The security model is standard SQL.
    -- Granting usage on a catalog and schema
    GRANT USAGE ON CATALOG prod_catalog TO `data_analysts`;
    GRANT USAGE ON SCHEMA prod_catalog.gold TO `data_analysts`;

    -- Granting select on a specific table
    GRANT SELECT ON TABLE prod_catalog.gold.daily_sales_summary TO `data_analysts`;

  • Centralized Auditing: All actions performed against data registered in Unity Catalog are logged, giving you a complete audit trail.
  • Built-in Lineage: UC automatically captures column-level lineage for all queries and jobs run on Databricks. You can visualize how a column in a Gold table was derived, all the way back to the Bronze layer. This is incredibly powerful for impact analysis and debugging.
  • Data Sharing (Delta Sharing): UC has a built-in, open protocol for sharing live data with external partners, regardless of what platform they use. This is Databricks's answer to Snowflake's Secure Data Sharing and is a major enterprise feature.

Practical Implementation Strategy

  1. Set up your Metastore: Create one Unity Catalog metastore per region and link it to all workspaces in that region.
  2. Define your Catalogs: A good starting point is to create catalogs based on your environment or business unit (e.g., dev, staging, prod or sales_catalog, finance_catalog).
  3. Map Snowflake Roles to Databricks Groups: Re-create your persona-based roles (e.g., data_analysts, data_engineers, ml_scientists) as Groups in Databricks. It's much easier to manage permissions for groups than for individual users.
  4. Use External Locations and Storage Credentials: To allow UC to manage data in your cloud storage, you register Storage Credentials (which encapsulate an IAM Role or Managed Identity) and External Locations (a pointer to an S3/ADLS path). This is the secure way to grant Databricks access to your data lake without scattering access keys everywhere.
  5. Migrate your GRANT statements: Systematically go through your Snowflake GRANT statements and replicate the permissions using the UC model. This is often a good time to clean up and simplify your permission model.

Don't try to manage security without Unity Catalog. The older, table-ACL model is considered legacy and will not provide the enterprise-grade features you need.


Case Study: FinOptiq's Migration Journey

To put this all together, let's look at a real (but anonymized) case study of a fintech company, "FinOptiq."

The Starting Point (Snowflake):
* Workload: Primarily BI and analytics on transaction and market data.
* Architecture: A large PROD database with hundreds of tables in a few schemas. A mix of raw JSON data in VARIANT columns and structured tables.
* Pain Points:
1. Cost: Snowflake credit consumption was rising unpredictably, especially from data science teams running heavy, ad-hoc queries.
2. ML/AI Silo: The data science team was pulling data out of Snowflake into local Python environments to build fraud detection models, creating a data governance nightmare.
3. Data Sharing: Sharing curated datasets with partners was cumbersome, involving secure views or data extracts.

The Migration to Databricks: Applying the 5 Considerations

  1. Architecture Rethink: FinOptiq didn't "lift-and-shift." They committed to the Medallion Architecture in Unity Catalog.

    • bronze catalog: All source data landed here via Auto Loader jobs. Raw transaction feeds were kept as Delta tables with the original JSON.
    • silver catalog: PySpark jobs parsed the bronze JSON, enforced data types, joined with reference data, and created clean entity tables like accounts, transactions, and customers.
    • gold catalog: Business-level tables were created for BI (e.g., monthly_account_summary) and for the ML team (e.g., a transaction_features table).
  2. Code Translation: The biggest effort was rewriting dozens of complex Snowflake stored procedures used for nightly settlement and reporting.

    • These were re-implemented as multi-task Databricks Jobs running parameterized PySpark notebooks.
    • The MERGE statement was used extensively to update their silver and gold tables.
    • They identified several slow Python UDFs and replaced them with native Spark SQL functions, gaining a 10x performance improvement on those specific steps.
  3. Ingestion & Orchestration:

    • All Snowpipe ingestion was replaced with streaming Auto Loader jobs, scheduled to run every 15 minutes using the availableNow trigger. This gave them near real-time data with controllable costs.
    • They chose Databricks Jobs to orchestrate their Medallion pipelines, finding DLT was a bit too much of a paradigm shift initially. They created a standard "Bronze-Silver-Gold" multi-task job for each data source.
  4. Performance & Cost Management: Initially, their costs were higher than expected.

    • The Fix: They instituted a strict "right tool for the job" policy. BI dashboards were moved from All-Purpose clusters to a Serverless SQL Warehouse, which dramatically cut idle compute costs. Data engineering pipelines were moved to Jobs clusters.
    • They ran OPTIMIZE and Z-ORDER on their largest Silver tables (transactions) weekly. This cut query times for their fraud analytics team from minutes to seconds.
    • They used the cluster and cost monitoring dashboards to identify and downsize over-provisioned clusters used by the data science team.
  5. Governance & Security: They went all-in on Unity Catalog.

    • They mapped their Snowflake roles to Databricks groups (e.g., BI_TEAM, FRAUD_ANALYTICS_DS, DATA_ENGINEERS).
    • GRANT statements were used to give the BI_TEAM SELECT access only on the gold catalog. The FRAUD_ANALYTICS_DS team got SELECT on specific silver and gold tables. DATA_ENGINEERS had write access to all catalogs.
    • The biggest win was using Delta Sharing. They could now securely share a live partner_settlement_view table from their Gold catalog directly with a partner's Databricks workspace (or even a Power BI user with an open-source connector), eliminating all ETL and data extracts.

The Result: After a challenging but structured six-month migration, FinOptiq achieved its goals. Their total cost of ownership was reduced by about 20% (after initial optimization efforts). More importantly, their fraud model deployment time went from weeks to days because the ML team could now build, train, and deploy directly on the same platform where the data lived, with full governance and lineage provided by Unity Catalog and MLflow.

The journey from Snowflake to Databricks is complex, but by focusing on these five key areas, you can avoid the common pitfalls and unlock the true potential of a unified data and AI platform. It’s less about swapping one technology for another and more about evolving your team's approach to the entire data lifecycle.