The Real-World Guide to Migrating from AWS to Databricks: More Than Just a Lift-and-Shift
I’ve sat in too many kickoff meetings where a VP of Engineering, with stars in their eyes, declares, "We're moving to the Lakehouse! We're doing an AWS to Databricks migration to unify our data and AI." Everyone nods. The project manager opens a new Gantt chart. And I, the architect who has to actually make it work, feel a familiar knot in my stomach.
Why? Because I know what's coming. The "lift-and-shift" fantasy, the belief that we can just copy our AWS Glue scripts, point them at a new cluster, and call it a day. The shock when the first month's Databricks bill is 3x the old AWS bill. The panic when a converted Redshift stored procedure returns different numbers than the original.
Let's be clear: migrating your data stack from a collection of AWS services (Glue, EMR, Redshift, Step Functions) to Databricks is one of the most powerful moves you can make for your data strategy. But it's not a simple swap. It’s an architectural and philosophical shift. If you treat it like you're just changing the brand of your ETL engine, you are setting yourself up for failure.
I’ve been in the trenches on these projects. I’ve seen them succeed spectacularly and fail miserably. This isn't a sales pitch for Databricks or any specific tool. This is a field guide from someone who has the scars to prove it. We’re going to talk about what a real data migration from AWS to the Databricks Lakehouse looks like, warts and all. We'll cover the analysis, the code, the orchestration, the tools, and the inevitable "why is this so slow?!" moments.
Part 1: The Sobering First Step - Job Analysis & Metadata Extraction
Before you write a single line of Spark code, you have to understand the beast you're trying to tame. Your existing AWS data platform isn't a single entity. It's a sprawling ecosystem of:
- AWS Glue ETL Jobs: Dozens, maybe hundreds, of Python or Scala scripts, each with its own libraries, IAM roles, and job parameters.
- Redshift Stored Procedures: The hidden logic bombs. Years of business rules are often buried in procedural SQL that runs on a schedule.
- EMR Clusters: Perhaps you have some legacy MapReduce jobs or custom Spark applications running on EMR.
- AWS Step Functions & Lambda: The glue that holds the glue together. Complex orchestration logic that triggers jobs based on S3 events, schedules, or the completion of other jobs.
- Airflow/Control-M: External schedulers adding another layer of complexity.
The first, and most critical, phase is a deep job analysis. You cannot migrate what you don't understand. Manually clicking through the AWS console and reading every script is a recipe for missing critical dependencies.
Here’s our approach:
- Automated Metadata Extraction: You need to programmatically pull everything. Use the AWS SDK (Boto3 for Python is your friend) to list all Glue jobs, triggers, crawlers, Step Function definitions, and Lambda function configurations. For Redshift, query system tables like
SVV_TABLE_INFO,PG_PROC, andSVL_QUERY_SUMMARYto find table dependencies and procedure logic. - Build a Dependency Graph: The output of step one is just a pile of JSON and SQL. The real value comes from connecting the dots. We build a graph (using a library like
networkxin Python or a graph database) where the nodes are your jobs, tables, and S3 paths, and the edges are the dependencies.- “This Glue job reads from
s3://raw-data/and writes tos3://processed-data/.” - “This Step Function triggers that Glue job after a Lambda function successfully completes.”
- “This Redshift procedure reads from
processed.customer_tableand updatesanalytics.monthly_summary.”
- “This Glue job reads from
- Identify Migration Waves: Now you can look at the graph and make intelligent decisions. Don't try to boil the ocean. A dependency graphing exercise reveals the "leaf nodes" of your data ecosystem—the pipelines that have few downstream dependencies. These are your ideal candidates for a pilot migration. You can group pipelines into logical "waves" based on business domain (e.g., "Sales Analytics," "Product Usage Funnel") and technical dependencies.
For the CXOs and PMs: This initial analysis is the single most important factor in de-risking your migration. It prevents the "oops, we forgot that quarterly finance report" disaster. It allows you to create a realistic timeline and budget based on a complete inventory, not guesswork. Skipping this step is like trying to build a house without a blueprint. The initial cost of this analysis phase will pay for itself tenfold by preventing rework and production outages later.
Part 2: The Translation Guide - Detailed AWS → Databricks Mapping
Once you know what you have, you need to know where it's going. This isn't a one-to-one mapping. You're upgrading capabilities, not just changing names. Thinking in terms of the underlying function is key to a successful ETL modernization.
Here’s the cheat sheet I use on every project:
| AWS Component | Databricks Equivalent | Key Considerations & "Gotchas" |
|---|---|---|
| Data Storage (S3) | S3 with Delta Lake | This is the most fundamental shift. You’re not just writing Parquet files anymore. Delta Lake wraps your data in S3 with a transaction log, giving you ACID compliance, time travel (for rollbacks and auditing), and massive performance gains with features like Z-ORDERING. Your code changes from df.write.parquet(...) to df.write.format("delta").save(...). |
| Data Catalog (AWS Glue Catalog) | Unity Catalog | This is a game-changer for data governance. The Glue Catalog is just a technical metadata store. Unity Catalog is a unified governance layer for files, tables, ML models, and dashboards. It provides fine-grained access control (row/column level), centralized auditing, and data lineage across your entire lakehouse. You migrate your Glue Catalog to Unity Catalog, but you also need to re-think your entire security model. |
| ETL Engine (AWS Glue ETL) | Databricks Jobs (on Clusters) | Glue’s proprietary DynamicFrames are gone. You'll be using standard, open-source PySpark or Spark SQL. This is a huge win for portability and testing. A Databricks Job is simply a notebook, a Python script, or a JAR file executed on a cluster. The cluster configuration (node types, size) is now much more flexible. |
| SQL Warehouse (Amazon Redshift) | Databricks SQL (Serverless) | Redshift is a traditional, provisioned data warehouse. Databricks SQL provides a serverless, highly concurrent SQL endpoint on top of your Delta Lake data. The migration involves converting Redshift's procedural SQL to Databricks' ANSI-standard Spark SQL. Performance is dramatically different; Databricks SQL excels at scanning massive datasets, while Redshift can be faster for certain star-schema joins on smaller, well-tuned datasets. This requires careful performance tuning. |
| Orchestration (AWS Step Functions) | Databricks Workflows | Instead of a visual web of Lambda functions and service integrations, Databricks Workflows allows you to define a Directed Acyclic Graph (DAG) of tasks. A single workflow can run a notebook, then a SQL script, then a dbt model, then a Python script. It's fully integrated, so you get unified logging, monitoring, and alerting without wiring together three different AWS services. This is a massive simplification of your job scheduling. |
| Serverless Compute (AWS Lambda) | Databricks Serverless Functions / Jobs | For small, event-driven tasks (e.g., kicking off a job when a file lands), Lambda is common. Databricks now offers Serverless compute for jobs and SQL, which spins up resources instantly without you managing a cluster. For data ingestion, Autoloader on a Serverless cluster is often a more robust and scalable pattern than a file-triggered Lambda function. |
| Security & Identity (AWS IAM) | Unity Catalog + Service Principals | In AWS, you might have complex IAM roles for each Glue job. In Databricks, the model is simpler. You grant permissions on data objects (catalogs, schemas, tables) within Unity Catalog to users or groups. For automated jobs, you use Service Principals (the equivalent of an IAM role for applications) and grant them only the permissions they need. This centralizes access control around the data itself. |
The Big Picture: The goal of this AWS to Databricks mapping is simplification. You are moving from a collection of loosely coupled services to a single, unified platform. This reduces operational overhead, simplifies data governance, and eliminates the data silos between your data warehouse (Redshift) and your data lake (S3). This is the core value proposition of the data lakehouse.
Part 3: In the Trenches - Refactoring Complex AWS Logic
This is where the rubber meets the road. You can't just find-and-replace aws_glue with pyspark. You have to re-think the logic. The most common mistake is to write Spark code that mimics the old, procedural style of a legacy system. This leads to slow, unscalable jobs.
Let's look at a few real-world examples of code conversion.
Example 1: The Looping Redshift Stored Procedure
A classic pattern in Redshift is to use a cursor to loop through a set of rows and perform some complex, row-by-row update. This is anathema to a distributed processing engine like Spark.
The Old Way (Conceptual Redshift SQL):
CREATE PROCEDURE update_customer_segments() AS $$ DECLARE rec RECORD; cur CURSOR FOR SELECT customer_id, lifetime_value FROM customers; BEGIN OPEN cur; LOOP FETCH cur INTO rec; EXIT WHEN NOT FOUND; IF rec.lifetime_value > 10000 THEN UPDATE customer_segments SET segment = 'Platinum' WHERE customer_id = rec.customer_id; ELSIF rec.lifetime_value > 5000 THEN UPDATE customer_segments SET segment = 'Gold' WHERE customer_id = rec.customer_id; ELSE UPDATE customer_segments SET segment = 'Silver' WHERE customer_id = rec.customer_id; END IF; END LOOP; CLOSE cur; END; $$ LANGUAGE plpgsql; This is slow, procedural, and doesn't scale.
The Databricks Way (Spark SQL with MERGE):
Spark thinks in terms of whole datasets, not individual rows. The modern, idiomatic way to do this in Databricks is with a MERGE INTO statement, which is an atomic and highly parallel operation on a Delta table.
-- First, create a staging table/CTE with the new segment logic CREATE OR REPLACE TEMP VIEW new_segments AS SELECT customer_id, CASE WHEN lifetime_value > 10000 THEN 'Platinum' WHEN lifetime_value > 5000 THEN 'Gold' ELSE 'Silver' END AS new_segment FROM customers; -- Now, merge the changes into the target Delta table MERGE INTO customer_segments t -- target USING new_segments s -- source ON t.customer_id = s.customer_id WHEN MATCHED AND t.segment <> s.new_segment THEN UPDATE SET t.segment = s.new_segment WHEN NOT MATCHED THEN INSERT (customer_id, segment) VALUES (s.customer_id, s.new_segment); This is a declarative, set-based operation that Spark can parallelize across hundreds of nodes. It's faster, more readable, and leverages the power of Delta Lake to handle the updates atomically.
Example 2: The AWS Glue DynamicFrame Script
AWS Glue introduced its own DynamicFrame object, which was meant to handle schema evolution. In practice, it often led to code that was locked into the Glue ecosystem and harder to debug.
The Old Way (Conceptual Glue Python Script):
from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job # ... boilerplate setup ... glue_context = GlueContext(SparkContext.getOrCreate()) datasource0 = glue_context.create_dynamic_frame.from_catalog( database = "source_db", table_name = "raw_events" ) # Use a Glue-specific 'ResolveChoice' transform to handle mixed types resolved_frame = datasource0.resolveChoice(specs = [('user_id','cast:long')]) # Convert to Spark DataFrame for some operations, then back spark_df = resolved_frame.toDF() spark_df_filtered = spark_df.filter("event_type = 'purchase'") result_dynamic_frame = DynamicFrame.fromDF(spark_df_filtered, glue_context, "result") # Write out using Glue's writer glue_context.write_dynamic_frame.from_options( frame = result_dynamic_frame, connection_type = "s3", connection_options = {"path": "s3://output-path/"}, format = "parquet" ) The Databricks Way (Standard PySpark):
In Databricks, you just use standard PySpark DataFrames. The code is cleaner, more portable, and leverages Spark's native capabilities.
from pyspark.sql.functions import col from pyspark.sql.types import LongType # Read directly from the source table (assuming it's in Unity Catalog) # Autoloader is even better for incremental streaming from S3 df = spark.read.table("source_catalog.source_db.raw_events") # Handle schema and types using standard DataFrame APIs # This is more explicit and easier to debug than ResolveChoice df_transformed = df.withColumn("user_id", col("user_id").cast(LongType())) \ .filter(col("event_type") == "purchase") # Write to a Delta table - this is the standard df_transformed.write.format("delta").mode("overwrite").saveAsTable( "target_catalog.target_db.purchase_events" ) This code can run on any Spark environment, is easier to unit test, and by writing to a Delta table, you get all the benefits of the Lakehouse automatically. The introduction of Python type hints in PySpark also makes modern code much more maintainable.
Part 4: Tying It All Together - A Saner Approach to Orchestration
Orchestration is the nervous system of your data platform, and in AWS, it can become a tangled mess. A common pattern I see is a Step Function that looks like a spiderweb: an S3 trigger fires a Lambda, which does some validation and then starts a Glue Job. The Step Function polls the Glue Job status. If it succeeds, it triggers another Lambda to update a DynamoDB table and start a second Glue Job.
This is brittle, hard to debug (you have to check CloudWatch logs for three different services), and a nightmare for metadata management.
Databricks Workflows fundamentally simplifies this. You define your entire pipeline as a series of tasks within a single workflow asset.
Imagine that complex Step Function. In Databricks, it becomes a single workflow with three tasks:
1. Task 1 (Validation): A Python notebook that runs the validation logic previously in the first Lambda.
2. Task 2 (Main ETL): A PySpark script (your converted Glue job) that depends on the successful completion of Task 1.
3. Task 3 (Post-Processing): A SQL script that updates a log table and depends on Task 2.
All the logging, error handling, retries, and alerting are configured in one place. You have a single run history to look at. If Task 2 fails, the workflow stops, sends an alert, and you can go directly to the failed notebook run, see the exact error, and even interactively debug it by attaching a cluster.
The new Serverless Compute for Jobs feature takes this even further. For many ETL workloads, you don't even need to think about cluster configurations anymore. You just submit your workflow and Databricks handles the compute, leading to better cost optimization as you only pay for the exact time your job is running.
Part 5: Choosing Your Weapons - The Tooling Landscape
So, how do you actually perform the code conversion at scale? You have a few options, each with serious trade-offs.
- The Manual Rewrite: You assemble a team of senior engineers and have them rewrite every script by hand.
- Pros: You get high-quality, idiomatic code tailored exactly to your needs.
- Cons: It's incredibly slow, expensive, and brain-numbingly repetitive for your best engineers. It’s also prone to human error and inconsistencies across the team. This approach is only feasible for very small-scale migrations.
- Generic "Code Converters": There are tools and online services that promise to convert SQL from one dialect to another.
- Pros: Can be quick for simple, one-off queries.
- Cons: These are glorified find-and-replace engines. They will fail on complex procedural logic, vendor-specific functions, and have zero understanding of the architectural context. They often produce syntactically correct but horribly non-performant Spark code because they don't refactor loops into joins, for example. They completely ignore the orchestration and metadata aspects.
- Specialized Migration Platforms (e.g., Travinto): This is a newer category of tool designed specifically for large-scale, cross-platform migrations.
Full disclosure: after trying the first two approaches and feeling the pain, my teams have had the most success using a specialized platform. On our last major Redshift and Glue migration, we used Travinto, and it fundamentally changed the project's trajectory.
This isn't a sales pitch. It's an explanation of why this category of tool works better. A tool like Travinto is not a magic "migrate" button. It's a co-pilot for your architects and engineers. Here’s how it’s different:
- It Starts with Analysis: It automates the deep analysis I described in Part 1. It scans your entire AWS environment (Glue, Redshift, Step Functions), ingests the code and metadata, and automatically builds the complete dependency graph. This alone saved us weeks of manual effort and gave our PMs a clear view of the entire project scope.
- It Understands Intent: It goes beyond syntactic conversion. It has a semantic understanding of the source platform. It recognizes a Redshift
CURSOR LOOPand knows the correct idiomatic equivalent is a SparkMERGEorJOIN/UPDATEpattern. It converts GlueDynamicFramesto standard, clean PySpark. - It Generates Idiomatic Databricks Code: The output isn't just "Spark." It's "Databricks." It generates code that uses Delta Lake, suggests partitioning and Z-ORDERING strategies, and structures the project for use with Databricks Repos and CI/CD.
- It Addresses the Whole Picture: It doesn't just convert a single script. It analyzes the Step Function orchestration and helps you map it to a Databricks Workflows JSON definition. It migrates the pipeline, not just the pieces.
The result was that Travinto automated about 80-90% of the conversion process, but the key is that it was the right 80%. It handled the tedious, repetitive, and error-prone work. This freed up my senior engineers to focus on the most complex 10% of the business logic, on performance tuning the new jobs, and on designing the new data governance model in Unity Catalog. It turned a 12-month high-risk project into a 5-month predictable one.
Part 6: When Things Go Wrong - A Troubleshooter's FAQ
No migration is perfect. Here are some of the common "gotchas" you will hit during an AWS to Databricks project and how to solve them.
-
Q: My converted Spark SQL query returns a different row count than the original Redshift query. Why?
- A: Almost always, this is due to
NULLhandling. In SQL Server or Redshift,column = NULLis often treated differently than in ANSI SQL (which Spark uses), wherecolumn IS NULLis the correct syntax. Also, check for differences in howJOINs on nullable keys are handled. Perform aMINUSorEXCEPTquery between the source and target datasets to pinpoint the exact rows that differ. This is a critical data validation step.
- A: Almost always, this is due to
-
Q: My Databricks job is way slower than my old Glue job! I thought this was supposed to be faster?
- A: Welcome to the world of performance issues in distributed systems. Nine times out of ten, it's a data skew problem or an inefficient shuffle.
- Open the Spark UI. This is non-negotiable. Look at the query plan and the stage durations. Is one task taking 100x longer than the others? That's skew.
- Check for Shuffles: Are you doing a
joinorgroupByon a high-cardinality key without proper partitioning? This causes a massive data shuffle across the network. - Review your file sizes: Your Delta table should be composed of well-sized files (ideally 100MB - 1GB). Too many small files kill read performance. Use
OPTIMIZEandZ-ORDERon your Delta tables.OPTIMIZEcompacts small files, andZ-ORDERco-locates related data to drastically improve query speed.
- A: Welcome to the world of performance issues in distributed systems. Nine times out of ten, it's a data skew problem or an inefficient shuffle.
-
Q: My costs are spiraling out of control on Databricks. What am I doing wrong?
- A: This is common when people treat Databricks clusters like old, always-on Redshift clusters.
- Use Autoscaling Job Clusters: Don't use all-purpose interactive clusters for production jobs. Job clusters are cheaper and terminate when the job is done. Configure them to autoscale up and down.
- Leverage Spot Instances: For non-critical workloads, use a high percentage of spot instances. This can cut your compute costs by 70% or more.
- Right-Size Your Clusters: Don't just pick the biggest instance type. Use the metrics from your job runs to see if you are CPU-bound or memory-bound and adjust the instance family accordingly.
- Embrace Serverless: For SQL warehousing (BI) and increasingly for jobs, use Databricks Serverless. You pay for what you use with zero management overhead. This is a key cost optimization lever.
- A: This is common when people treat Databricks clusters like old, always-on Redshift clusters.
-
Q: How do I handle proprietary libraries from my Glue jobs?
- A: This is a common legacy system problem. In Glue, you'd point to an S3 path. In Databricks, you have better options. You can install libraries on a cluster via PyPI, Maven, or by uploading Python
.whlor.eggfiles. For a more robust solution, use cluster-init scripts to run a pip install, or better yet, package your environment into a custom Docker container and use Databricks Container Services. This ensures perfect reproducibility.
- A: This is a common legacy system problem. In Glue, you'd point to an S3 path. In Databricks, you have better options. You can install libraries on a cluster via PyPI, Maven, or by uploading Python
Final Thoughts: Your Path Forward
Migrating from a fragmented AWS data stack to the Databricks Lakehouse is a journey of transformation, not just translation. When done right, the business impact is immense: you break down data silos, accelerate AI and BI initiatives, simplify governance and security, and often lower your total cost of ownership.
My advice, from one practitioner to another:
- Don't Skip the Blueprint: Invest heavily in the initial analysis and metadata management. Know exactly what you have before you start moving.
- Think in Lakehouse: Train your team to stop writing procedural, row-by-row code. Embrace declarative, set-based operations with PySpark and Spark SQL. Learn and love Delta Lake.
- Pilot Smart: Choose a business-critical but technically contained pipeline for your first migration wave. Success here will build momentum and confidence.
- Leverage Accelerators: This is not the time for heroics. The manual approach is too slow and risky for any enterprise-scale migration. The landscape of tooling has matured. Look seriously at specialized platforms that can automate the tedious analysis and conversion, generating high-quality, idiomatic code. A tool like Travinto isn’t a crutch; it’s a force multiplier. It provides the solid foundation and accelerates your timeline, allowing your most valuable people to focus on architecture, performance, and delivering business value, which is the entire point of this exercise.
An AWS to Databricks migration is a challenging but rewarding endeavor. Armed with the right strategy, the right mindset, and the right tools, you can build a data platform that will serve your business for the next decade.
Now, go open that Gantt chart with confidence.