1. Define Your "North Star": Business-Driven Migration Goals
A migration from Snowflake to Databricks is not a simple "like-for-like" swap. It's a fundamental architectural shift. The projects that succeed are those that start by asking "Why?" long before they ask "How?". Simply wanting to reduce costs is a weak driver; costs can be managed on any platform. A strong "North Star" is a business or technical capability you can't achieve effectively in your current state.
Successful drivers we’ve seen in the field include:
- Unifying Data Science and Data Engineering: A company’s data science team works in Python notebooks on sampled data, while the data engineering team builds ELT pipelines in Snowflake using SQL. The "North Star" is to create a single platform where both teams can collaborate on the same full-fidelity data, using the language of their choice (Python, Scala, R, or SQL), dramatically reducing model deployment friction.
- Enabling Real-Time Analytics: A logistics company relies on batch-based reporting in Snowflake, with data latency of several hours. They can't track fleet movements or optimize routes in real-time. The "North Star" is to build streaming applications using Delta Live Tables (DLT) to achieve sub-minute latency on critical dashboards and alerts.
- Controlling a Fragmented Data Stack: An organization uses Snowflake for warehousing, a separate tool for data ingestion (like Fivetran), another for orchestration (like Airflow), and yet another for machine learning (like SageMaker). The "North Star" is to consolidate these functions onto the Databricks Lakehouse Platform to simplify governance, reduce vendor sprawl, and lower the total cost of ownership (TCO) through architectural simplification.
Actionable Insight:
Hold a kickoff workshop with business leaders, not just IT. Frame the discussion around capabilities. Don't ask, "Do you want to move to Databricks?" Ask, "What if you could run predictive models directly against your transactional data as it arrives?" or "What if your product analytics team could build and test new features using the exact same data and logic as your BI reports?" The answers to these questions will define your true migration goals and build the executive support needed to weather the inevitable challenges.
2. Conduct a Deep Architectural Autopsy
Before you can plan the journey, you need a detailed map of your current estate. A superficial inventory is a recipe for scope creep and unforeseen roadblocks. You need to go beyond just listing tables and views; you must dissect the living, breathing organism that is your Snowflake environment.
The Inventory Checklist
Your team should create a comprehensive catalog covering:
- Core Data Objects:
- Databases, Schemas, Tables: Note table types (transient, temporary, permanent), sizes, and clustering keys.
- Views: Differentiate between standard and materialized views. Materialized views in Snowflake have no direct equivalent and often need to be re-architected as aggregated tables in a Medallion architecture.
- Stages (Internal/External): Document their locations and what processes use them for loading/unloading data.
- Data Ingestion & Egress:
- Snowpipe: Catalog all pipes, their source locations (S3, Azure Blob), and target tables.
- COPY INTO: Identify all scripts and processes that use bulk loading.
- Connectors: List all BI tools (Tableau, Power BI), ETL tools (Fivetran, Matillion), and custom applications connecting via JDBC/ODBC.
- Data Sharing: Document all secure data shares, both inbound and outbound. This is a critical governance area.
- Code and Logic:
- Stored Procedures (JavaScript/Snowflake Scripting): These are often the most complex part of a migration. They require 100% manual refactoring into Python/Scala notebooks or Databricks SQL UDFs. You must analyze what they do: is it procedural ETL, complex business logic, or administrative tasks?
- User-Defined Functions (UDFs/UDTFs): Catalog their language (SQL, Java, JavaScript, Python) and usage frequency. Simple SQL UDFs are often portable; others are a full rewrite.
- Tasks: These are Snowflake's native scheduler. You need to map each task and its dependency chain to a Databricks Job.
- Streams: This is Snowflake’s Change Data Capture (CDC) mechanism. Map each stream to its source table and the downstream processes that consume it. This will likely be re-implemented using Delta Lake's
MERGEcapabilities or Delta Live Tables.
- Security & Governance:
- Roles & Grants: Export and analyze the entire role hierarchy. Who has access to what, and how?
- Row-level Security & Column-level Masking: Identify every policy and the tables they are applied to. This logic must be replicated in Databricks, likely using Unity Catalog.
- Performance & Cost:
- Warehouses: Document every virtual warehouse, its size (XS, S, M, etc.), auto-scaling configuration, and which user groups or processes use it. Query
QUERY_HISTORYto map workloads to specific warehouses. This is crucial for sizing your initial Databricks clusters.
- Warehouses: Document every virtual warehouse, its size (XS, S, M, etc.), auto-scaling configuration, and which user groups or processes use it. Query
Tooling for the Autopsy:
While you can use Snowflake's information_schema and account_usage views, this is laborious. Consider using automated discovery tools. Some partners and consultancies have internal scripts, and third-party tools are emerging that can parse query logs and object metadata to build a dependency graph. This investment pays for itself by preventing "gotchas" deep into the project.
3. Embrace the Lakehouse Paradigm, Don't Just "Lift and Shift" SQL
The most common failure pattern is treating Databricks as just a cheaper, faster Snowflake. A team that simply moves all their Snowflake SQL scripts and runs them on a Databricks SQL Warehouse is leaving 80% of the platform's value on the table and will likely be disappointed with the outcome.
Snowflake is a SQL-first, proprietary data warehouse. Databricks is a code-first, open-format Lakehouse. Understanding this distinction is key.
The Mental Model Shift
| Feature/Concept | Snowflake (SQL Warehouse) | Databricks (Lakehouse) | Practitioner's Takeaway |
|---|---|---|---|
| Data Storage | Proprietary, managed internal format | Open Delta Lake format (Parquet, JSON logs) on your own cloud storage | You gain control and ownership of your data in an open format. This eliminates lock-in but adds responsibility for managing the storage layer (e.g., lifecycle policies). |
| Primary Interface | SQL (via UI, SnowSQL, connectors) | Notebooks (Python, Scala, R, SQL) and SQL Editor | Your team's skillset must expand. SQL analysts are still first-class citizens with Databricks SQL, but engineers must become proficient in notebook-based development and software engineering principles (Git, CI/CD). |
| Compute | Virtual Warehouses (T-shirt sizes) | Clusters (node types, sizes, auto-scaling) | You have more granular control over compute, which is powerful for optimization but also introduces more complexity. You're not just picking a "Medium" warehouse; you're selecting VM instance types. |
| Transformation Logic | SQL, Stored Procedures, Tasks | Delta Live Tables, Structured Streaming, Notebook Jobs (Python/Spark), dbt | Move away from monolithic stored procedures. Embrace declarative pipelines with DLT for ETL/ELT, or programmatic Spark jobs for complex transformations that are difficult to express in SQL. |
| Governance | RBAC, Column Masking, Row Access Policies | Unity Catalog (centralized governance for files, tables, ML models, dashboards) | Rebuild your governance model in Unity Catalog from the start. It's the core of the Lakehouse and unifies security across all assets, not just tables. |
Practical Example: Re-architecting a Multi-step Stored Procedure
Imagine a complex Snowflake stored procedure that runs nightly. It takes 2000 lines of Snowflake Scripting to:
1. Ingest new JSON data from a stage.
2. Run a series of MERGE statements to update a dimension table.
3. Execute multiple CREATE TABLE AS SELECT (CTAS) statements to build intermediate aggregates.
4. Finally, update a final summary fact table.
5. Perform cleanup of temporary tables.
The "Lift and Shift" Mistake: Attempting to rewrite this entire procedural flow into a single, massive Databricks SQL script or a Python notebook that just mimics the old logic. It will be brittle, hard to debug, and won't leverage the platform's strengths.
The Lakehouse Approach (using Delta Live Tables):
You would break this down into a declarative, multi-stage pipeline.
# DLT Pipeline: nightly_customer_summary.py
import dlt
from pyspark.sql.functions import *
# BRONZE: Ingest raw data with schema inference and quality checks
@dlt.table(
name="raw_events_bronze",
comment="Raw JSON events from cloud storage."
)
@dlt.expect_or_drop("valid_timestamp", "event_timestamp IS NOT NULL")
def raw_events_bronze():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/path/to/raw/json/data")
)
# SILVER: Cleaned and enriched data, ready for analytics
@dlt.table(
name="customer_updates_silver",
comment="Cleaned customer updates, ready for merging."
)
def customer_updates_silver():
return (
dlt.read_stream("raw_events_bronze")
.withColumn("processed_at", current_timestamp())
# ... more cleaning and transformation logic ...
)
# GOLD: Business-level aggregate table
@dlt.table(
name="daily_customer_summary_gold",
comment="Daily aggregated summary of customer activity."
)
def daily_customer_summary_gold():
# This could join streams with static dimension tables
return (
dlt.read_stream("customer_updates_silver")
.groupBy("customer_id", window("event_timestamp", "1 day"))
.agg(count("*").alias("activity_count"))
)
This DLT pipeline is more resilient, observable, and easier to maintain than the monolithic stored procedure. It automatically handles dependency management, retries, and data quality monitoring. This is the paradigm shift in action.
4. Design a Phased and Resilient Data Ingress Strategy
You cannot afford downtime or data loss during the migration. Moving terabytes or petabytes of data from Snowflake's proprietary storage to Delta Lake on your cloud storage is a significant engineering challenge.
Phase 1: The Historical Backfill (Full Load)
This is the one-time operation to move all your existing data. Don't underestimate the time and network egress costs.
- Method: The most reliable method is to use Snowflake's
COPY INTO @...command to unload tables into a cloud storage stage (S3, ADLS Gen2, GCS) in Parquet format. Then, in Databricks, you can read these Parquet files and write them out as Delta tables usingCREATE OR REPLACE TABLE ... AS SELECTordf.write.format("delta").save(...). - Performance Tuning:
- Unload large tables in parallel using
MAX_FILE_SIZEto create multiple smaller files. This allows you to leverage Spark's parallelism when reading on the Databricks side. - Use a sufficiently large Snowflake warehouse for the
COPYcommand to maximize unload speed. - On the Databricks side, use a large, optimized cluster for the initial conversion to Delta. You can scale it down after the backfill is complete.
- Partition your data on write in Databricks. For a 10TB
orderstable, writing it partitioned byorder_datewill be critical for future query performance.
- Unload large tables in parallel using
Example: Unloading from Snowflake and Loading into Databricks
Step A: In Snowflake SQL
COPY INTO @my_s3_stage/orders_unload/
FROM my_db.my_schema.orders
FILE_FORMAT = (TYPE = PARQUET)
MAX_FILE_SIZE = 1073741824; -- 1GB chunks
Step B: In a Databricks Notebook
# Path where Snowflake unloaded the data
source_path = "s3://my-bucket/orders_unload/"
# Target path for the Delta table
delta_path = "s3://my-bucket/delta/orders"
# Read the Parquet files and write as a partitioned Delta table
(spark.read.format("parquet").load(source_path)
.write
.format("delta")
.partitionBy("order_date")
.option("overwriteSchema", "true")
.save(delta_path)
)
# Register the table in the metastore (Unity Catalog)
spark.sql(f"CREATE TABLE IF NOT EXISTS main.bronze.orders USING DELTA LOCATION '{delta_path}'")
Phase 2: The Incremental Sync (Dual-Write Period)
After the backfill, your Snowflake system is still live and receiving new data. You need a way to keep your new Databricks tables in sync until the final cutover.
- Method 1 (Simple, for low-volume): Snowflake Streams + Tasks. Create a Snowflake Stream on each source table to capture changes (inserts, updates, deletes). A scheduled Snowflake Task then runs periodically (e.g., every 15 minutes) to unload only the data in the stream to a dedicated cloud storage location. A Databricks Job then picks up these micro-batches and merges them into the target Delta table.
- Method 2 (Robust, for high-volume): CDC with a Third-Party Tool. For critical, high-throughput tables, relying on Snowflake Streams can be fragile. A dedicated CDC tool (like Fivetran, Qlik Replicate, Arcion) can stream changes directly from Snowflake's underlying change tracking mechanism or logs to your cloud storage or directly into Databricks, providing a more reliable and lower-latency sync.
- Method 3 (Application-level Dual Write): In some cases, you can modify the upstream applications that write to Snowflake to also write to a streaming source that Databricks can ingest (e.g., Kafka, Kinesis). This is the cleanest approach but requires control over the source applications.
The dual-write period is critical for testing. It allows you to run queries against both systems and validate that the data is identical, which builds confidence for the final cutover.
5. Automate Code Conversion, But Budget for Manual Refactoring
The bulk of the engineering effort in a migration lies in converting decades of proprietary SQL, stored procedures, and scripts. While automation is a powerful accelerator, believing any tool will provide 100% automated conversion is a dangerous fantasy.
The 70/30 Rule
A realistic expectation is that automated tools can handle 60-80% of your standard SQL (DML, DDL). The remaining 20-40%—the most complex, business-critical, and non-standard logic—will require manual refactoring by skilled engineers. This is where your budget and timeline can be broken.
Automated Conversion Tooling
There are several tools available, from open-source parsers to commercial enterprise solutions. They typically work by parsing the Snowflake SQL dialect and transpiling it to Databricks-compatible Spark SQL. They are excellent at handling:
* Syntax differences (e.g., QUALIFY to ROW_NUMBER() OVER (...) ... WHERE rn = 1).
* Function mapping (e.g., IFF to IF, DATE_TRUNC to TRUNC).
* Basic DDL and DML statements.
Example: Automated Translation
A tool would convert this Snowflake SQL:
-- Snowflake: Find the latest action for each user
SELECT
user_id,
action,
timestamp
FROM user_actions
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;
Into this Databricks-compatible SQL:
-- Databricks: Equivalent using a CTE or subquery
SELECT
user_id,
action,
timestamp
FROM (
SELECT
user_id,
action,
timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) as rn
FROM user_actions
)
WHERE rn = 1;
The Manual Refactoring Bucket: Where a Senior Engineer is Worth Their Weight in Gold
This is what the tools can't handle and where you need your best people.
- Stored Procedures: As mentioned, these are a 100% manual rewrite. The logic needs to be re-thought in a Spark-native way (Python/Scala DataFrames) or as a DLT pipeline. You're not translating; you're re-architecting.
- Snowflake-Specific Functions: Functions without direct equivalents require creative workarounds.
FLATTENfor semi-structured data: This needs to be reimplemented using Spark'sexplode()function. The logic is similar, but the syntax and approach are different.MATCH_RECOGNIZEfor sessionization: This powerful pattern-matching function has no direct Spark SQL equivalent. You would need to rewrite this logic using complex window functions or a programmatic Spark approach, which is a significant effort.
- Recursive CTEs: While Spark 3.4+ has some support, it can be limited. Complex hierarchical queries may need to be rewritten using a programmatic, iterative approach in a notebook (e.g., a loop that joins a DataFrame to itself until a condition is met).
- Task/Stream Logic: The entire workflow of CDC streams being consumed by tasks needs to be manually redesigned, likely using Delta Live Tables or a combination of Databricks Jobs and
MERGEstatements.
Project Management Insight:
Your project plan must have separate workstreams and estimates for "Automated Conversion" and "Manual Refactoring." Triage your codebase early. Tag every object with its expected conversion method (automated, manual, or decommission). Assign your most experienced engineers to the manual refactoring tasks first, as they carry the most risk.
6. Rethink Performance from First Principles: Partitioning, Clustering, and Caching
You cannot apply Snowflake performance tuning concepts directly to Databricks. Snowflake's magic lies in its automatic optimization, micro-partitioning, and results caching, abstracted away behind a simple T-shirt-sized warehouse. Databricks gives you the raw controls, which is powerful but requires a deeper understanding of distributed systems.
Key Performance Levers in Databricks
- File Size Optimization (
OPTIMIZE): Delta Lake is composed of Parquet files. Too many small files ("the small file problem") kills read performance because of the overhead in opening each file. Regularly run theOPTIMIZEcommand to compact small files into larger, more optimal sizes (target ~1GB). This is the single most important maintenance task for a Delta table.
OPTIMIZE main.bronze.orders;
- Data Skipping via Z-Ordering (
OPTIMIZE ZORDER BY): This is the closest analog to Snowflake's clustering, but you have to be explicit. Z-Ordering is a technique that co-locates related information in the same set of files. If your queries frequently filter oncustomer_idandorder_date, Z-Ordering by those columns will allow the query engine to skip reading files that don't contain relevant data.
-- If you often query by customer and date range
OPTIMIZE main.bronze.orders ZORDER BY (customer_id, order_date);
**Practitioner's Tip:** Don't Z-Order by high-cardinality columns (like `transaction_id`) or columns you don't filter on. It's an expensive operation, so choose your Z-Order columns wisely based on your most common and expensive query patterns. A good rule of thumb is 1-4 columns.
-
Partitioning: This is a physical data layout strategy. Partitioning your
orderstable byorder_datecreates a directory structure like/orders/order_date=2023-10-26/. If your query includesWHERE order_date = '2023-10-26', Spark will only read the data in that specific directory, providing a massive performance boost.
Warning: Do not over-partition. Partitioning by a high-cardinality column (e.g.,customer_id) will create millions of small directories and lead to the "small file problem" you were trying to avoid. Use partitioning only for low-cardinality columns that are used in >95% of your queries on that table (e.g., date, region). -
Caching: Databricks provides a powerful, on-disk cache (Delta Caching) that is automatically used on clusters with Delta Cache-enabled worker types (e.g., AWS
i3series, AzureLseries). When you read a file, it's cached on the worker's local SSD. Subsequent reads of that same file are orders of magnitude faster. Unlike Snowflake's result set cache, this caches the data itself, so different queries that happen to touch the same data can benefit. For interactive dashboards and exploratory analysis, using cache-accelerated clusters is a game-changer.
From Warehouse Sizing to Cluster Configuration
Instead of picking a LARGE warehouse, you'll configure a Databricks cluster. You'll choose:
* Worker/Driver Types: What kind of VMs? General Purpose, Memory Optimized, or Storage Optimized (for caching)?
* Number of Workers: How many nodes in the cluster?
* Auto-scaling: Set a min and max number of workers. The cluster will scale up under load and down when idle.
* Spot vs. On-Demand: Use spot instances for non-critical workloads to save significant cost, with the fallback to on-demand if spot instances are unavailable.
This requires a shift in mindset from "buy a bigger warehouse" to "configure the right cluster for the workload." You might have a small, memory-optimized cluster for a Python ML training job and a large, storage-optimized cluster for your main BI workload.
7. Rebuild Governance from the Ground Up with Unity Catalog
Do not try to replicate your Snowflake role-based access control (RBAC) model piecemeal. A Snowflake-to-Databricks migration is the perfect opportunity—and the right imperative—to adopt Unity Catalog (UC) from day one. UC is the cornerstone of modern Databricks governance and provides a unified control plane that Snowflake cannot match.
What Unity Catalog Centralizes
| Asset Type | Governance in Snowflake | Governance in Databricks with Unity Catalog |
|---|---|---|
| Tables & Views | Standard SQL GRANT/REVOKE on schemas and tables. |
Standard SQL GRANT/REVOKE on catalogs, schemas, and tables. |
| Files in Storage | No native governance. Access is controlled by IAM roles on the cloud storage bucket. | Governed as "Volumes." You can GRANT READ/WRITE on directories of files to specific users/groups. |
| ML Models | Not a concept. Models live outside the warehouse. | First-class citizens in UC. You can GRANT EXECUTE on a registered model. |
| Dashboards | Governed within the BI tool (Tableau, etc.). | Governed within Databricks. |
| Lineage | Relies on third-party tools or parsing query history. | Automated, column-level lineage out-of-the-box. UC automatically tracks which table, job, or notebook produced which data. |
| Data Sharing | Secure Data Sharing. | Delta Sharing, an open protocol for sharing live data with any consumer on any platform. |
The Implementation Roadmap
- Deploy a UC Metastore: This is the top-level container for all your assets. You provision one per region and link it to your Databricks workspaces.
- Define Your Namespace: A best practice is the three-level namespace:
catalog.schema.table.- Catalog: The highest level. Map these to environments (
dev,staging,prod) or business units (sales,marketing). - Schema: The equivalent of a database schema, used to group related tables.
- Table/View/Volume/Model: The final asset.
- Catalog: The highest level. Map these to environments (
- Migrate Security Principals: Re-create your user groups (e.g.,
analysts_finance,engineers_data_platform) in Databricks and grant them access at the appropriate level. Granting permissions to groups, not individual users, is essential for manageability. - Re-implement Fine-Grained Access Control:
- Row-Level Security: Re-create Snowflake's row access policies using
CREATE FUNCTIONto define a filter rule and then apply it to a table. The function returnsTRUEorFALSEbased on the user or their group membership. - Column-Level Masking: Use
ALTER TABLE ... ALTER COLUMN SET MASKto apply a masking function (e.g.,MASK hash(*)) to sensitive columns for non-privileged users.
- Row-Level Security: Re-create Snowflake's row access policies using
The Power of Automated Lineage
This is a killer feature you get "for free" with UC. Imagine a manager questions a number on a Power BI dashboard.
* In Snowflake: An engineer would have to manually trace the query from Power BI, then look at the query that created the summary table, and the one that created the table before that, piecing the lineage together like a detective.
* In Databricks (with UC): You can click "Lineage" in the Data Explorer UI. You will see a visual graph showing the entire flow: from the raw source files, through the Bronze and Silver notebooks/tables, to the Gold aggregate table, and finally to the dashboard that consumes it. This reduces debugging time from hours to minutes.
8. Institute a "Code-First" DevOps Culture
Snowflake development is often SQL-centric and UI-driven. A developer might write a script in the Snowflake UI, test it, and then copy-paste it into a scheduler. This approach is a liability in Databricks, which is fundamentally a platform for running code.
Treating your Databricks assets—notebooks, SQL queries, DLT pipelines, job definitions—as code is non-negotiable for a stable, scalable, and maintainable environment.
Core Components of Databricks DevOps
- Git Integration with Databricks Repos: This is the foundation. Every project should live in a Git repository (GitHub, GitLab, Azure DevOps). The Databricks Repos feature allows you to clone a remote Git repo directly into your workspace. Developers work on branches, create pull requests for code review, and merge to main. This brings software engineering best practices to your data pipelines.
- Infrastructure as Code (IaC) with Terraform: Do not click-to-create clusters, jobs, or permissions in the UI. Use the Databricks Terraform Provider to define all your infrastructure declaratively.
- Benefits: Your entire environment is version-controlled, auditable, and reproducible. Setting up a new
devenvironment is as simple as runningterraform applywith a different variable file. - Example Terraform for a Databricks Job:
- Benefits: Your entire environment is version-controlled, auditable, and reproducible. Setting up a new
resource "databricks_job" "nightly_etl" {
name = "Nightly Sales ETL"
task {
task_key = "run_dlt_pipeline"
pipeline_task {
pipeline_id = databricks_pipeline.sales_dlt.id
}
}
schedule {
quartz_cron_expression = "0 0 2 * * ?" // Run at 2 AM daily
timezone_id = "UTC"
}
# Define the cluster to run on
new_cluster {
spark_version = "13.3.x-scala2.12"
node_type_id = "i3.xlarge"
num_workers = 4
}
}
-
CI/CD (Continuous Integration/Continuous Deployment): Use CI/CD tools like GitHub Actions or Azure DevOps Pipelines to automate testing and deployment. A typical workflow for a pull request might be:
- On PR creation: Automatically run linters and basic syntax checks.
- On PR merge to
stagingbranch:- The CI/CD runner uses the Databricks CLI or REST API to deploy the updated notebook/code to the staging workspace.
- It then uses Terraform to apply any infrastructure changes (e.g., a new job).
- It triggers the job to run.
- It runs integration tests (e.g., asserting that a count in a target table is correct).
- On successful staging run: A manual approval promotes the changes to the
mainbranch, which triggers a deployment to production.
-
Embracing dbt (Data Build Tool): If your team is primarily analytics engineers who "think in SQL,"
dbtis an essential part of the modern Databricks stack.dbton Databricks brings software engineering rigor to your SQL-based transformations. It handles dependency management, testing, documentation, and materialization (as views or tables) directly on top of Delta Lake. Migrating Snowflake SQL transformations todbtmodels running on a Databricks SQL Warehouse is a very common and successful pattern.
This cultural shift is often the hardest part. It requires training analysts to use Git and engineers to think about infrastructure as code. But the payoff in stability and development velocity is immense.
9. Implement a Multi-Layered Validation Framework
"Does the data in Databricks match Snowflake?" This is the question that will determine the success or failure of your cutover. You need a rigorous, automated, and multi-faceted approach to prove data integrity and performance parity.
Layer 1: Cell-by-Cell Data Reconciliation
For critical tables, you can't just rely on row counts. You need to prove the data is identical.
-
Technique: Use a data diff tool or write your own validation scripts. A common approach is to:
- For a given table,
UNPIVOTall columns into a key-value structure in both Snowflake and Databricks. - Calculate a hash (e.g.,
MD5) for each row's unpivoted representation. MINUSorEXCEPTthe set of hashes from one system against the other. The result should be zero rows.
- For a given table,
-
Automation: Build a framework that takes a list of tables and runs this reconciliation automatically, generating a daily report. This will be your primary tool during the dual-write period to catch any discrepancies in the incremental sync logic.
Layer 2: Business Logic & KPI Validation
This layer ensures that the re-architected business logic produces the same results.
- Technique: Identify your top 20-50 most critical business reports or dashboards (e.g., "Monthly Active Users," "QTD Revenue by Region"). Re-create these reports in your BI tool (Tableau, Power BI) pointing to the new Databricks source.
- Execution: For a set period (e.g., one month), run the old report against Snowflake and the new report against Databricks every single day. Export the results and compare them. Any deviation points to a bug in your transformation logic. This is the ultimate UAT (User Acceptance Testing) and builds immense trust with the business.
Layer 3: Performance & Cost Benchmarking
You need to prove that the new system is not only correct but also performs adequately and within budget.
- Performance:
- Identify a set of representative queries from your Snowflake
query_history: some short, interactive dashboard queries; some medium-sized analytical queries; and some heavy, nightly ETL queries. - Run this same set of queries against both systems (on appropriately sized compute) and log the execution times.
- The goal isn't always to be faster on every single query. The goal is to meet the SLAs. An interactive query that took 3 seconds in Snowflake should still be in the low single digits in Databricks. A nightly job that took 1 hour can take 1.5 hours if it's more cost-effective.
- Identify a set of representative queries from your Snowflake
- Cost:
- During the benchmarking phase, use Databricks tags to tag all clusters and SQL Warehouses related to the migration project.
- Use the cost analysis tools in your cloud provider (e.g., AWS Cost Explorer, Azure Cost Management) to get a precise cost for the Databricks portion of the workload.
- Compare this to the Snowflake credit consumption for the equivalent workload over the same period. This provides a data-driven TCO comparison, not a theoretical one.
This validation framework is your safety net. It moves the conversation from "I think it's working" to "I can prove it's working."
10. Invest in People: The Overlooked Pillar of Migration Success
You can have the best technology and the most perfect migration plan, but if your people don't know how to use the new platform, or worse, are resistant to it, the project will fail. This is a change management exercise as much as it is a technology project.
Tailored Training for Different Personas
A one-size-fits-all training program is ineffective. You need to segment your users and provide targeted enablement.
- For the BI Analyst / SQL Analyst:
- Focus: Databricks SQL. Show them that the SQL editor is a familiar, powerful environment.
- Key Topics: Connecting their BI tools to Databricks SQL Warehouses, understanding the three-level namespace in Data Explorer, using the query history, and building basic dashboards within Databricks.
- Goal: Make them feel comfortable and productive on day one. Assure them that their SQL skills are still highly valuable.
- For the Data Engineer:
- Focus: The "Code-First" paradigm.
- Key Topics: Notebook-based development in Python/Scala, Spark fundamentals (DataFrames, lazy evaluation), Delta Live Tables for declarative ETL, job orchestration, and CI/CD with Databricks Repos and Terraform.
- Goal: Evolve them from SQL scriptwriters to data software engineers.
- For the Data Scientist:
- Focus: The unification of ML and data.
- Key Topics: Using MLflow to track experiments and register models in Unity Catalog, leveraging Spark for distributed model training, feature engineering at scale, and serving models with Databricks Model Serving.
- Goal: Show them how they can now work on full datasets without sampling and dramatically shorten the path from experiment to production.
Communication and Evangelism
- Weekly Demos: Hold open sessions every week to showcase progress. Show a migrated dashboard, a working DLT pipeline, or the new lineage graph. Seeing is believing.
- Office Hours: Set up dedicated time slots where users can bring their problems and get help from the core migration team. This prevents frustration and builds goodwill.
- Identify Champions: Find enthusiastic users in each group and make them your champions. Give them early access, extra training, and a direct line to the project team. Their positive testimonials to their peers are more powerful than any official communication.
- Acknowledge the Learning Curve: Be open that there will be a transition period. Acknowledge the power and simplicity of Snowflake's abstractions and explain the "why" behind the new complexities in Databricks (more control, open formats, unified platform).
A migration isn't "done" at technical cutover. It's done when the last user has been successfully onboarded and the Snowflake account can be decommissioned with confidence. Investing in your people ensures that the new platform won't just work, but will flourish.
Case Study: "Streamify Media" Migration from Snowflake to Databricks
Company Profile & "Before" State:
Streamify Media is a mid-sized digital content provider. Their data platform was centered around Snowflake, which they used for all their BI reporting, user engagement analytics, and ad revenue tracking.
- Architecture: Fivetran piped production database data and ad network APIs into Snowflake. A team of 5 data engineers maintained a complex web of over 300 Snowflake stored procedures and tasks to run their nightly ETL.
- Pain Points:
- Cost Overruns: A separate team of 3 data scientists was trying to build content recommendation models. They would extract data from Snowflake, load it into SageMaker, and run training jobs. The massive data egress costs and the cost of the large Snowflake warehouse needed to perform the extractions were becoming prohibitive.
- DevOps Friction: The data science and data engineering teams were completely siloed. Models were built on sampled data, and "productionizing" a model involved a painful, multi-week process of engineers trying to translate Python logic into Snowflake SQL.
- Lack of Real-time Insight: All reporting was T+1. The product team wanted to run A/B tests and see user behavior impact within minutes, not days.
The "North Star" and Migration Approach:
Streamify's leadership defined their "North Star": "To create a single, unified platform for analytics and machine learning, enabling real-time experimentation and reducing model deployment time from weeks to days."
They adopted a phased approach, applying several of the best practices:
-
Architectural Autopsy & Triage (Practice #2): They used an automated tool to parse their entire Snowflake codebase. They discovered that ~70% of their logic was standard SQL transformations, but 30% was contained in complex, multi-step JavaScript stored procedures for sessionization and user state management. This 30% was flagged as high-risk and high-effort.
-
Embracing the Lakehouse Paradigm (Practice #3): Instead of rewriting the complex sessionization stored procedures, they re-architected them as a Delta Live Tables pipeline. A streaming source of raw clicks (Bronze) was declaratively transformed into a sessionized table (Silver), dramatically simplifying the logic and enabling near real-time updates.
-
Automated Conversion with Manual Refactoring (Practice #5): The 70% of standard SQL transformations were migrated to
dbtmodels. The analytics engineering team quickly got up to speed withdbton Databricks SQL. This left the senior data engineers free to focus on the high-effort DLT pipeline for sessionization. -
Rebuilding Governance in Unity Catalog (Practice #7): They started with UC from day one. They created
dev,staging, andprodcatalogs. PII columns in thecustomerstable were automatically masked for theproduct_analyticsgroup, but remained visible to thecompliancegroup. This replaced a clunky and hard-to-manage set of masking policies in Snowflake. -
Investing in People (Practice #10): The migration team ran targeted workshops.
- BI analysts were trained on connecting Tableau to their new
prodcatalog via Databricks SQL. - Data scientists were ecstatic to learn they could now use
databricks-connectto write PySpark code in their local VS Code IDE that would execute on the full cluster dataset, and then use MLflow to track everything within the same platform.
- BI analysts were trained on connecting Tableau to their new
The "After" State and Results:
After a four-month dual-write period and a successful cutover, Streamify Media's new platform on Databricks looked radically different and delivered significant business value:
- Unified Platform: Data scientists and engineers now collaborate in the same environment on the same data. The wall between the two teams was broken down.
- Accelerated ML Lifecycle: The time to deploy a new recommendation model went from 4 weeks to under 3 days. The automated column-level lineage from UC made debugging and validating feature logic trivial.
- Real-time Experimentation: Using the DLT pipeline and Databricks SQL, the product team could launch an A/B test on the website and see its impact on user engagement on a dashboard with only a 5-minute delay.
- Reduced TCO: While their cloud compute bill remained similar, they eliminated their Fivetran and SageMaker costs. The architectural simplification and removal of data egress fees resulted in a 25% reduction in total data platform TCO within the first year.
Streamify's success was not because they simply moved data from A to B. It was because they used the migration as a catalyst to fundamentally rethink their architecture, processes, and culture around a modern, unified data and AI platform.