if you're thinking about moving from Snowflake to Databricks. It's a journey many teams are considering, often driven by the desire to unify data warehousing with AI/ML workloads, gain more control over the underlying storage, or optimize costs for specific usage patterns. Having navigated this migration path with multiple teams, I can tell you it's much more than a simple platform swap. It’s a paradigm shift.
Forget the marketing slides for a moment. The core difference is this: Snowflake is a managed SQL data warehouse-as-a-service that brilliantly separates storage and compute. Databricks is a unified data and AI platform built on top of open-source technologies like Apache Spark, Delta Lake, and MLflow, running on your cloud storage. This distinction is the source of both its power and its complexity.
Successfully migrating requires you to unlearn some habits and embrace a new way of thinking about your data's lifecycle. Here are the top five practical, in-the-weeds considerations that will make or break your migration.
1. RETHINKING YOUR DATA ARCHITECTURE: IT'S NOT JUST A LIFT-AND-SHIFT
The single biggest mistake I see is teams trying to replicate their Snowflake environment one-for-one in Databricks. This "lift-and-shift" approach is a recipe for high costs and poor performance. In Snowflake, you likely have databases, schemas, and tables. You think in terms of virtual warehouses (T-shirt sizes like X-Small, Large) that you spin up to run SQL queries. It's clean, abstracted, and very SQL-centric.
Databricks requires you to think closer to the metal. You own the storage layer (e.g., AWS S3, Azure ADLS Gen2), and the architecture should reflect that.
From Warehouse-on-Cloud to the Lakehouse Paradigm
Your Snowflake mental model is one of structured data sitting in a proprietary, highly optimized format, queried by a managed compute layer. Databricks flips this. Your data lives in an open format (Delta Lake) on your own cloud object storage. This is the "Lakehouse" concept in practice: data warehouse-like performance and reliability directly on your data lake.
This has profound implications.
- Storage is Your Responsibility: In Snowflake, you load data and it's managed. In Databricks, you are creating files (Parquet files, by default, wrapped in a Delta transaction log) in your own S3 bucket or ADLS container. This means you need to think about file layouts, partitioning, and managing small files—things Snowflake completely abstracts away.
- Compute is More Granular: Instead of a "Large" warehouse, you define a Databricks cluster. You choose the VM instance types (e.g., memory-optimized, compute-optimized), the number of workers, and whether it auto-scales. This is powerful but can be daunting.
Here’s a practical breakdown of the conceptual mapping:
| Snowflake Concept | Databricks Equivalent(s) | Key Consideration |
|---|---|---|
| Account | Databricks Workspace/Account | A workspace is the environment where you work. An account can manage multiple workspaces. |
| Database / Schema | Unity Catalog: Catalog / Schema | Unity Catalog provides a three-level namespace (catalog.schema.table) that feels very familiar to Snowflake users. This is the recommended approach. |
| Virtual Warehouse | Databricks SQL Warehouse (Serverless or Pro) or All-Purpose/Jobs Cluster | This is the most critical distinction. Use SQL Warehouses for BI/SQL analytics. Use All-Purpose/Jobs clusters for data engineering (Spark jobs). Don't use an expensive All-Purpose cluster for your Tableau dashboard! |
| Table | Delta Table (managed or external) | A Delta table is a collection of Parquet files with a transaction log. This is the heart of the Lakehouse. |
VARIANT Data Type |
STRUCT, MAP, ARRAY, JSON string |
Databricks encourages parsing semi-structured data into strongly-typed nested structures on read for better performance, unlike Snowflake's optimized VARIANT type. |
| Snowpipe | Auto Loader | Auto Loader is Databricks's powerful, scalable service for incremental data ingestion from cloud storage. It's more configurable than Snowpipe. |
| Streams & Tasks | Delta Live Tables (DLT) or Structured Streaming + Databricks Jobs | DLT is a declarative framework for building reliable data pipelines. It's a higher-level abstraction than Snowflake's Streams/Tasks. |
| Clustering Key | Z-ORDER BY |
Z-Ordering colocates related information in the same set of files. It’s a similar concept to Snowflake's clustering but works on a multi-dimensional basis on the files in your cloud storage. |
The Medallion Architecture in Practice
Because you manage the data lifecycle, a structured architectural pattern is essential. The most successful migrations I've been a part of have rigorously adopted the Medallion Architecture. Don't treat this as optional marketing fluff; it's a foundational pattern for success on Databricks.
- Bronze Layer (Raw): This is where your raw data lands, as-is, from source systems. Think of it as your immutable staging area. The goal is to capture the data quickly and cheaply. Data is often stored in its original format or as Delta tables with the raw payload (e.g., a single column of JSON or CSV). You never want to have to re-ingest from a source system. Your Bronze layer is your source of truth.
- Silver Layer (Cleansed, Conformed): Here, you take the raw data from Bronze and apply basic quality rules, cleansing, and transformations. Data is parsed from JSON into columns, joined with reference data, and conformed into an enterprise view. Your Silver tables should mirror the structure of your source systems but be clean and queryable. This is often where business analysts and data scientists start their exploration.
- Gold Layer (Aggregated, Curated): This layer contains your highly refined, business-level aggregates. These are your final reporting tables, feature stores for ML, or data marts. Gold tables are optimized for specific business use cases, often denormalized and aggregated to provide answers to key business questions quickly. Your BI tools should almost exclusively point to Gold tables.
Real-World Scenario: A team I worked with tried a "lift-and-shift" from their PROD schema in Snowflake. They dumped all 300 tables into a single schema in Databricks and pointed their jobs at it. It was chaos. Jobs failed due to data quality issues that were previously hidden, query performance was terrible because nothing was optimized, and lineage was impossible to track.
The fix was to re-architect. We mapped their source ingestion into Bronze tables. We then built Silver tables that represented their core business entities (customers, orders, products). Finally, we rebuilt their key BI dashboards on top of new Gold tables, like daily_sales_summary and customer_360_view. The upfront effort was significant, but it made the system reliable, performant, and understandable.
2. TRANSLATING THE CODE: BEYOND SQL DIALECTS
This is where the rubber meets the road. A significant portion of your migration effort will be in translating your Snowflake logic into a Databricks-compatible format. While both platforms support ANSI SQL, the devil is in the details—the proprietary functions, procedural logic, and underlying execution models are worlds apart.
SQL Functions and Dialect Differences
You will spend a non-trivial amount of time hunting down Snowflake-specific functions and finding Databricks equivalents. Simple ones are easy (GETDATE() -> current_timestamp()), but others are more complex.
Here's a list of common "gotchas":
-
Semi-structured Data Handling: Snowflake's
VARIANTand dot-notation/bracket-notation for parsing JSON is incredibly convenient.- Snowflake SQL:
SELECT raw_payload:customer.name FROM my_table; - Databricks SQL:
SELECT raw_payload.customer.name FROM my_table;(ifraw_payloadis aSTRUCT) orSELECT get_json_object(raw_payload_string, '$.customer.name') FROM my_table; - Practitioner's Tip: The best practice in Databricks is to define a schema and parse your JSON into a
STRUCTtype during ingestion (e.g., in your Bronze-to-Silver pipeline). Querying aSTRUCTis dramatically faster than parsing a JSON string on the fly withget_json_object.
- Snowflake SQL:
-
QUALIFY ROW_NUMBER(): Snowflake’sQUALIFYclause is a powerful and elegant way to filter window functions. Databricks doesn't have a direct equivalent. You need to fall back to a subquery or a Common Table Expression (CTE).- Snowflake SQL:
SELECT * FROM events
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp DESC) = 1;
* **Databricks SQL (Equivalent):**
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp DESC) as rn
FROM events
)
WHERE rn = 1;
- Sequence Generators (
AUTOINCREMENT/SEQUENCE): Snowflake makes generating unique surrogate keys simple. Databricks, being a distributed system operating on files, doesn't have a native, scalable equivalent for gap-free sequences.- Snowflake:
CREATE TABLE t (id INT AUTOINCREMENT, ...) - Databricks: Delta Lake now supports
IDENTITYcolumns, which work similarly but come with caveats about performance in very high-volume writes.
- Snowflake:
CREATE TABLE t (id BIGINT GENERATED BY DEFAULT AS IDENTITY, ...)
* **Alternative Pattern:** Often, the better pattern in a distributed world is to use a composite natural key or generate a UUID using a function like `uuid()`. Relying on a globally ordered sequence can become a bottleneck.
Stored Procedures: The Spark Mindset Shift
This is a major architectural pivot. In Snowflake, you write Stored Procedures in JavaScript, SQL (Scripting), or even bring in Java/Python code. They are executed transactionally on the warehouse. They are a common way to encapsulate complex business logic.
Trying to replicate a 500-line imperative stored procedure in Databricks SQL is a path of pain. Databricks's native procedural language is Apache Spark, most commonly used via the Python (PySpark) or Scala APIs.
Your stored procedures should not be translated to Databricks SQL procedures. They should be re-implemented as Spark jobs, organized within Databricks Notebooks or Python scripts.
Example: A Complex Customer Update Procedure
Imagine a Snowflake stored procedure that does the following:
1. Takes a customer_id.
2. Looks up recent orders.
3. Calls a UDF to calculate a new loyalty score.
4. Fetches the latest address from another table.
5. Updates the master customers table in a transaction.
The Snowflake Way (Conceptual JS Stored Proc):
CREATE OR REPLACE PROCEDURE update_customer_loyalty(CUSTOMER_ID FLOAT)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS $$
// Imperative logic: SELECT, call UDF, UPDATE
var sql_get_orders = `SELECT ... WHERE customer_id = ${CUSTOMER_ID}`;
// ... run statements, process results, build UPDATE statement ...
snowflake.execute({sqlText: "UPDATE customers SET ..."});
return "SUCCESS";
$$;
The Databricks Way (PySpark Notebook/Job):
This is a declarative, functional approach. You define data transformations, not imperative steps.
# A Databricks Notebook that can be parameterized with widgets
dbutils.widgets.text("customer_id", "0")
customer_id = dbutils.widgets.get("customer_id")
# 1. Read the necessary data into DataFrames
customers_df = spark.table("silver.customers")
orders_df = spark.table("silver.orders")
# 2. Define the transformations (this could include UDFs, but native functions are better)
from pyspark.sql.functions import col, sum, when
# Filter for the relevant customer and calculate new score
customer_orders = orders_df.filter(col("customer_id") == customer_id)
new_loyalty_score_df = customer_orders.groupBy("customer_id").agg(
sum("order_total").alias("total_spend")
).withColumn("loyalty_score",
when(col("total_spend") > 1000, "Gold")
.when(col("total_spend") > 500, "Silver")
.otherwise("Bronze")
)
# 3. Use Delta Lake's MERGE to update the target table atomically
# This is the idiomatic way to perform an "upsert" or "update" in Databracks
(customers_df.alias("c")
.merge(
new_loyalty_score_df.alias("s"),
"c.customer_id = s.customer_id"
)
.whenMatchedUpdate(set = {
"loyalty_score": "s.loyalty_score",
"last_updated_ts": "current_timestamp()"
})
.execute()
)
The mindset shift is from row-by-row, imperative processing to set-based, declarative transformations on entire DataFrames. The MERGE statement is your best friend for migrating complex UPDATE logic.
User-Defined Functions (UDFs): A Performance Minefield
Snowflake makes it easy to write UDFs in various languages. Databricks also supports UDFs (Python and Scala), but they come with a huge performance warning. When you use a Python UDF in a Spark job, the data has to be serialized from the Spark JVM, sent to a Python process, computed, and then serialized back. This overhead can kill performance.
The rule of thumb: Avoid UDFs whenever possible. 95% of what you do in a UDF can be accomplished with Spark's built-in functions in the pyspark.sql.functions module. These functions operate directly on the JVM data and are orders of magnitude faster.
If you absolutely must use a UDF for complex logic that can't be expressed otherwise, consider using a Pandas UDF (Vectorized UDF). These operate on Apache Arrow data structures and process data in batches (as Pandas Series or DataFrames), significantly reducing the serialization overhead.
3. ORCHESTRATION AND INGESTION: THE NEW DATA FLOW
How data gets into your platform and how your transformation jobs are triggered is fundamentally different. In Snowflake, you likely rely on a combination of Snowpipe for ingestion and Tasks for scheduling, possibly orchestrated by an external tool like Airflow.
Databricks offers a more integrated but also more fragmented ecosystem. You have to make conscious choices about which tools to use for which job.
Ingestion: From Snowpipe to Auto Loader
Snowpipe is excellent for its simplicity: point it at an S3/ADLS stage, and it loads new files automatically.
Auto Loader is Databricks's answer, and it's more powerful and configurable. It can run in two modes:
* Directory Listing Mode: Good for smaller numbers of files. It lists the source directory to find new files.
* File Notification Mode: The recommended, most scalable mode. It subscribes to file event notifications from your cloud provider (e.g., S3 Event Notifications -> SQS queue). This is more efficient than constantly listing directories with millions of files.
A typical Auto Loader pattern in PySpark looks like this:
# Set up the stream read
streaming_df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json") # The format of the source files
.option("cloudFiles.schemaLocation", "/path/to/schema/location") # Auto Loader tracks state here
.load("/path/to/raw/data/in/s3/")
)
# Apply transformations
processed_df = streaming_df.withColumn("ingest_time", current_timestamp())
# Write the stream to a Delta table
(processed_df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/path/to/checkpoint/location") # Critical for fault-tolerance
.trigger(availableNow=True) # Run as a batch job, process all available data, then stop
.toTable("bronze.my_raw_table")
)
Key Differences & Advantages of Auto Loader:
* Schema Inference and Evolution: Auto Loader can automatically infer your schema and, more importantly, handle schema drift gracefully. It can notify you of changes or rescue unexpected data into a separate column.
* Exactly-Once Processing: By using checkpoint locations, Auto Loader guarantees that each file is processed exactly once, even if the job fails and restarts.
* Cost-Effective: You run Auto Loader on a Databricks Job cluster, which you only pay for when it's running. You can schedule it to run every 5 minutes or every hour, controlling your cost-latency tradeoff, unlike Snowpipe's more opaque credit consumption model.
Orchestration: Delta Live Tables vs. Databricks Jobs vs. External Tools
Snowflake's Tasks provide a basic, cron-like scheduling mechanism within the database. Databricks offers more sophisticated options.
-
Databricks Jobs: This is the core orchestrator. You can schedule a notebook, a Python script, or a dbt Core project to run on a schedule. It handles cluster creation/termination, parameter passing, and alerting. You will use this for the majority of your batch pipelines. A typical pattern is to have a multi-task job: Task 1 ingests data with Auto Loader, Task 2 runs Silver transformations, Task 3 builds Gold aggregates.
-
Delta Live Tables (DLT): This is a game-changer for new projects and a strong consideration for your migration. DLT is a declarative pipeline framework. Instead of defining a DAG of tasks, you define your tables (as SQL or Python functions) and specify their dependencies using
LIVEreferences. DLT handles the orchestration, data quality checks, error handling, and auto-scaling for you.Conceptual DLT Pipeline (SQL):
-- Bronze table definition (ingesting from cloud storage)
CREATE OR REFRESH STREAMING LIVE TABLE raw_orders
COMMENT "Raw orders from the cloud"
AS SELECT * FROM cloud_files('/path/to/orders', 'json');
-- Silver table definition with quality constraints
CREATE OR REFRESH STREAMING LIVE TABLE clean_orders (
CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "Cleansed and conformed orders"
AS SELECT
order_id,
customer_id,
...
FROM STREAM(LIVE.raw_orders);
-- Gold table definition (a batch aggregation)
CREATE LIVE TABLE daily_customer_summary
COMMENT "Aggregated daily sales by customer"
AS SELECT
customer_id,
CAST(order_timestamp AS DATE) as order_date,
SUM(order_total) as total_sales
FROM LIVE.clean_orders
GROUP BY customer_id, order_date;
DLT automatically builds the dependency graph, manages micro-batch or continuous streaming, and provides a rich UI for monitoring pipeline health and data lineage. For complex, streaming-first pipelines, migrating to DLT over a traditional Jobs/Notebooks approach can save immense development and maintenance effort.
- External Orchestrators (e.g., Airflow): If your organization already has a heavy investment in a tool like Airflow, you don't have to abandon it. Databricks has a robust REST API and an official Airflow provider. You can trigger Databricks jobs from your Airflow DAGs. The common pattern is to keep high-level orchestration in Airflow (e.g.,
s3_sensor >> databricks_job >> slack_notification) but let Databricks handle the internal multi-task orchestration for better performance and integration.
4. PERFORMANCE AND COST: TAMING THE SPARK BEAST
This is often the reason for migrating to Databricks, but it can also be a source of massive budget overruns if not managed carefully. In Snowflake, cost is predictable: you pay for your warehouse size and how long it runs, plus storage. In Databricks, you pay for the cloud VMs in your cluster for every second they are active, plus a "Databricks Unit" (DBU) charge on top, which varies by the VM type and Databricks service tier.
Understanding how to optimize this equation is non-negotiable.
Cluster Configuration is Everything
Choosing the right cluster is an art and a science. A poorly configured cluster can be 10x more expensive and/or 10x slower than an optimized one.
-
SQL Warehouses vs. Jobs/All-Purpose Clusters:
- SQL Warehouses: Use these for all your SQL analytics and BI tools (Tableau, Power BI, etc.). They are highly optimized for low-latency SQL queries and use the Photon engine by default. The Serverless option is fantastic because it has instant start-up times and you only pay for the query execution time, much like Snowflake. For any interactive/BI workload, Serverless SQL Warehouses are the goal.
- Jobs Clusters: Use these for your scheduled, automated data pipelines (e.g., your Bronze-to-Silver ETL). These clusters are created just for the job run and terminate afterward, making them very cost-effective.
- All-Purpose Clusters: Use these for interactive development in notebooks. They are more expensive than Jobs clusters because they are designed to stay on for long periods. A common mistake is leaving a large All-Purpose cluster running overnight. Always set an aggressive auto-termination (e.g., 30-60 minutes of inactivity).
-
Instance Types: Don't just pick the biggest machine.
- Memory-Optimized: Good for shuffles, joins, and aggregations. Often a good default for ETL.
- Compute-Optimized: Good for CPU-intensive tasks like complex string manipulations or ML model training.
- Storage-Optimized: Useful if your job involves a lot of caching and spilling to disk (though you should try to avoid this by using more memory).
-
Photon Engine: Always use Photon. Photon is Databricks's native vectorized execution engine, rewritten in C++. It's a replacement for parts of Spark's core engine and provides massive speedups (I've seen 2-5x on average) for SQL and DataFrame operations. It's enabled by default on SQL Warehouses. Make sure you choose a Photon-enabled runtime for your Jobs and All-Purpose clusters. The DBU cost is slightly higher, but the performance gain almost always leads to a lower total cost because your cluster runs for a much shorter time.
Optimizing Your Delta Tables
Just like in a traditional database, how your data is physically organized matters.
- Partitioning: Partition your large tables (>1TB) by a low-cardinality column that is frequently used in
WHEREclauses (e.g.,date,country). This allows Spark to perform "partition pruning," skipping massive amounts of data. Be careful not to over-partition (e.g., byuser_idon a table with millions of users), as this creates millions of small files and cripples performance. - Z-Ordering: For high-cardinality columns used in filters, use
Z-ORDER. This is a technique that co-locates related data within files. It's a multi-dimensional alternative to a single-column sort. A common pattern is to partition by date and then Z-Order by a high-cardinality ID.
-- Run this periodically on your large tables
OPTIMIZE my_gold_table
ZORDER BY (customer_id, product_id);
OPTIMIZEandVACUUM: Because ETL jobs can create many small files, you need to periodically run theOPTIMIZEcommand to compact them into larger, more optimally sized files (around 1GB is a good target).VACUUMpermanently removes old, unreferenced files from your storage, saving on storage costs and keeping your table history clean. These should be scheduled as part of your regular maintenance jobs.
The Cost Monitoring Feedback Loop
You must actively monitor your costs. Use the Databricks cost and usage dashboards. Look for:
* Expensive jobs: Which jobs are consuming the most DBUs? Can their clusters be downsized? Can the code be optimized?
* Idle All-Purpose clusters: Who is leaving clusters on? Enforce termination policies.
* Inefficient queries: Use the Query Profile in the SQL editor to find bottlenecks in your queries. Is a join causing a massive shuffle? Is there a missing filter that could leverage partitioning?
This feedback loop—run, monitor, optimize, repeat—is the key to managing costs effectively on Databricks.
5. GOVERNANCE AND SECURITY: EMBRACING THE LAKEHOUSE
Snowflake has a mature and straightforward Role-Based Access Control (RBAC) model. You create roles, grant them privileges on database objects (SELECT on a table, USAGE on a schema), and assign roles to users.
Databricks, historically, had a more fractured model. You had workspace permissions, cluster permissions, and table access control lists (TACLs). It was complicated and didn't provide a unified view.
Unity Catalog (UC) changes everything. If you are migrating to Databricks today, you should be migrating to a Unity Catalog-enabled workspace. It is not optional for a serious enterprise deployment.
Unity Catalog: The Unified Governance Layer
Unity Catalog provides a centralized governance solution for all your data and AI assets across all workspaces. It solves the historical governance challenges and should be the cornerstone of your security model.
- Three-Level Namespace (
catalog.schema.table): This provides a familiar structure for anyone coming from a traditional database world. It lets you organize your data logically (e.g., adevcatalog, aprodcatalog). - ANSI SQL
GRANT/REVOKE: The security model is standard SQL.
-- Granting usage on a catalog and schema
GRANT USAGE ON CATALOG prod_catalog TO `data_analysts`;
GRANT USAGE ON SCHEMA prod_catalog.gold TO `data_analysts`;
-- Granting select on a specific table
GRANT SELECT ON TABLE prod_catalog.gold.daily_sales_summary TO `data_analysts`;
- Centralized Auditing: All actions performed against data registered in Unity Catalog are logged, giving you a complete audit trail.
- Built-in Lineage: UC automatically captures column-level lineage for all queries and jobs run on Databricks. You can visualize how a column in a Gold table was derived, all the way back to the Bronze layer. This is incredibly powerful for impact analysis and debugging.
- Data Sharing (Delta Sharing): UC has a built-in, open protocol for sharing live data with external partners, regardless of what platform they use. This is Databricks's answer to Snowflake's Secure Data Sharing and is a major enterprise feature.
Practical Implementation Strategy
- Set up your Metastore: Create one Unity Catalog metastore per region and link it to all workspaces in that region.
- Define your Catalogs: A good starting point is to create catalogs based on your environment or business unit (e.g.,
dev,staging,prodorsales_catalog,finance_catalog). - Map Snowflake Roles to Databricks Groups: Re-create your persona-based roles (e.g.,
data_analysts,data_engineers,ml_scientists) as Groups in Databricks. It's much easier to manage permissions for groups than for individual users. - Use External Locations and Storage Credentials: To allow UC to manage data in your cloud storage, you register
Storage Credentials(which encapsulate an IAM Role or Managed Identity) andExternal Locations(a pointer to an S3/ADLS path). This is the secure way to grant Databricks access to your data lake without scattering access keys everywhere. - Migrate your
GRANTstatements: Systematically go through your SnowflakeGRANTstatements and replicate the permissions using the UC model. This is often a good time to clean up and simplify your permission model.
Don't try to manage security without Unity Catalog. The older, table-ACL model is considered legacy and will not provide the enterprise-grade features you need.
Case Study: FinOptiq's Migration Journey
To put this all together, let's look at a real (but anonymized) case study of a fintech company, "FinOptiq."
The Starting Point (Snowflake):
* Workload: Primarily BI and analytics on transaction and market data.
* Architecture: A large PROD database with hundreds of tables in a few schemas. A mix of raw JSON data in VARIANT columns and structured tables.
* Pain Points:
1. Cost: Snowflake credit consumption was rising unpredictably, especially from data science teams running heavy, ad-hoc queries.
2. ML/AI Silo: The data science team was pulling data out of Snowflake into local Python environments to build fraud detection models, creating a data governance nightmare.
3. Data Sharing: Sharing curated datasets with partners was cumbersome, involving secure views or data extracts.
The Migration to Databricks: Applying the 5 Considerations
-
Architecture Rethink: FinOptiq didn't "lift-and-shift." They committed to the Medallion Architecture in Unity Catalog.
bronzecatalog: All source data landed here via Auto Loader jobs. Raw transaction feeds were kept as Delta tables with the original JSON.silvercatalog: PySpark jobs parsed the bronze JSON, enforced data types, joined with reference data, and created clean entity tables likeaccounts,transactions, andcustomers.goldcatalog: Business-level tables were created for BI (e.g.,monthly_account_summary) and for the ML team (e.g., atransaction_featurestable).
-
Code Translation: The biggest effort was rewriting dozens of complex Snowflake stored procedures used for nightly settlement and reporting.
- These were re-implemented as multi-task Databricks Jobs running parameterized PySpark notebooks.
- The
MERGEstatement was used extensively to update theirsilverandgoldtables. - They identified several slow Python UDFs and replaced them with native Spark SQL functions, gaining a 10x performance improvement on those specific steps.
-
Ingestion & Orchestration:
- All Snowpipe ingestion was replaced with streaming Auto Loader jobs, scheduled to run every 15 minutes using the
availableNowtrigger. This gave them near real-time data with controllable costs. - They chose Databricks Jobs to orchestrate their Medallion pipelines, finding DLT was a bit too much of a paradigm shift initially. They created a standard "Bronze-Silver-Gold" multi-task job for each data source.
- All Snowpipe ingestion was replaced with streaming Auto Loader jobs, scheduled to run every 15 minutes using the
-
Performance & Cost Management: Initially, their costs were higher than expected.
- The Fix: They instituted a strict "right tool for the job" policy. BI dashboards were moved from All-Purpose clusters to a Serverless SQL Warehouse, which dramatically cut idle compute costs. Data engineering pipelines were moved to Jobs clusters.
- They ran
OPTIMIZEandZ-ORDERon their largest Silver tables (transactions) weekly. This cut query times for their fraud analytics team from minutes to seconds. - They used the cluster and cost monitoring dashboards to identify and downsize over-provisioned clusters used by the data science team.
-
Governance & Security: They went all-in on Unity Catalog.
- They mapped their Snowflake roles to Databricks groups (e.g.,
BI_TEAM,FRAUD_ANALYTICS_DS,DATA_ENGINEERS). GRANTstatements were used to give theBI_TEAMSELECTaccess only on thegoldcatalog. TheFRAUD_ANALYTICS_DSteam gotSELECTon specificsilverandgoldtables.DATA_ENGINEERShad write access to all catalogs.- The biggest win was using Delta Sharing. They could now securely share a live
partner_settlement_viewtable from their Gold catalog directly with a partner's Databricks workspace (or even a Power BI user with an open-source connector), eliminating all ETL and data extracts.
- They mapped their Snowflake roles to Databricks groups (e.g.,
The Result: After a challenging but structured six-month migration, FinOptiq achieved its goals. Their total cost of ownership was reduced by about 20% (after initial optimization efforts). More importantly, their fraud model deployment time went from weeks to days because the ML team could now build, train, and deploy directly on the same platform where the data lived, with full governance and lineage provided by Unity Catalog and MLflow.
The journey from Snowflake to Databricks is complex, but by focusing on these five key areas, you can avoid the common pitfalls and unlock the true potential of a unified data and AI platform. It’s less about swapping one technology for another and more about evolving your team's approach to the entire data lifecycle.