The Real-World Guide to Migrating from AWS to Databricks: More Than Just a Lift-and-Shift

I’ve sat in too many kickoff meetings where a VP of Engineering, with stars in their eyes, declares, "We're moving to the Lakehouse! We're doing an AWS to Databricks migration to unify our data and AI." Everyone nods. The project manager opens a new Gantt chart. And I, the architect who has to actually make it work, feel a familiar knot in my stomach.

Why? Because I know what's coming. The "lift-and-shift" fantasy, the belief that we can just copy our AWS Glue scripts, point them at a new cluster, and call it a day. The shock when the first month's Databricks bill is 3x the old AWS bill. The panic when a converted Redshift stored procedure returns different numbers than the original.

Let's be clear: migrating your data stack from a collection of AWS services (Glue, EMR, Redshift, Step Functions) to Databricks is one of the most powerful moves you can make for your data strategy. But it's not a simple swap. It’s an architectural and philosophical shift. If you treat it like you're just changing the brand of your ETL engine, you are setting yourself up for failure.

I’ve been in the trenches on these projects. I’ve seen them succeed spectacularly and fail miserably. This isn't a sales pitch for Databricks or any specific tool. This is a field guide from someone who has the scars to prove it. We’re going to talk about what a real data migration from AWS to the Databricks Lakehouse looks like, warts and all. We'll cover the analysis, the code, the orchestration, the tools, and the inevitable "why is this so slow?!" moments.

Part 1: The Sobering First Step - Job Analysis & Metadata Extraction

Before you write a single line of Spark code, you have to understand the beast you're trying to tame. Your existing AWS data platform isn't a single entity. It's a sprawling ecosystem of:

AWS Glue ETL Jobs: Dozens, maybe hundreds, of Python or Scala scripts, each with its own libraries, IAM roles, and job parameters.
Redshift Stored Procedures: The hidden logic bombs. Years of business rules are often buried in procedural SQL that runs on a schedule.
EMR Clusters: Perhaps you have some legacy MapReduce jobs or custom Spark applications running on EMR.
AWS Step Functions & Lambda: The glue that holds the glue together. Complex orchestration logic that triggers jobs based on S3 events, schedules, or the completion of other jobs.
Airflow/Control-M: External schedulers adding another layer of complexity.

The first, and most critical, phase is a deep job analysis. You cannot migrate what you don't understand. Manually clicking through the AWS console and reading every script is a recipe for missing critical dependencies.

Here’s our approach:

Automated Metadata Extraction: You need to programmatically pull everything. Use the AWS SDK (Boto3 for Python is your friend) to list all Glue jobs, triggers, crawlers, Step Function definitions, and Lambda function configurations. For Redshift, query system tables like SVV_TABLE_INFO, PG_PROC, and SVL_QUERY_SUMMARY to find table dependencies and procedure logic.
Build a Dependency Graph: The output of step one is just a pile of JSON and SQL. The real value comes from connecting the dots. We build a graph (using a library like networkx in Python or a graph database) where the nodes are your jobs, tables, and S3 paths, and the edges are the dependencies.
- “This Glue job reads from s3://raw-data/ and writes to s3://processed-data/.”
- “This Step Function triggers that Glue job after a Lambda function successfully completes.”
- “This Redshift procedure reads from processed.customer_table and updates analytics.monthly_summary.”
Identify Migration Waves: Now you can look at the graph and make intelligent decisions. Don't try to boil the ocean. A dependency graphing exercise reveals the "leaf nodes" of your data ecosystem—the pipelines that have few downstream dependencies. These are your ideal candidates for a pilot migration. You can group pipelines into logical "waves" based on business domain (e.g., "Sales Analytics," "Product Usage Funnel") and technical dependencies.

For the CXOs and PMs: This initial analysis is the single most important factor in de-risking your migration. It prevents the "oops, we forgot that quarterly finance report" disaster. It allows you to create a realistic timeline and budget based on a complete inventory, not guesswork. Skipping this step is like trying to build a house without a blueprint. The initial cost of this analysis phase will pay for itself tenfold by preventing rework and production outages later.

Part 2: The Translation Guide - Detailed AWS → Databricks Mapping

Once you know what you have, you need to know where it's going. This isn't a one-to-one mapping. You're upgrading capabilities, not just changing names. Thinking in terms of the underlying function is key to a successful ETL modernization.

Here’s the cheat sheet I use on every project:

AWS Component	Databricks Equivalent	Key Considerations & "Gotchas"
Data Storage (S3)	S3 with Delta Lake	This is the most fundamental shift. You’re not just writing Parquet files anymore. Delta Lake wraps your data in S3 with a transaction log, giving you ACID compliance, time travel (for rollbacks and auditing), and massive performance gains with features like Z-ORDERING. Your code changes from `df.write.parquet(...)` to `df.write.format("delta").save(...)`.
Data Catalog (AWS Glue Catalog)	Unity Catalog	This is a game-changer for data governance. The Glue Catalog is just a technical metadata store. Unity Catalog is a unified governance layer for files, tables, ML models, and dashboards. It provides fine-grained access control (row/column level), centralized auditing, and data lineage across your entire lakehouse. You migrate your Glue Catalog to Unity Catalog, but you also need to re-think your entire security model.
ETL Engine (AWS Glue ETL)	Databricks Jobs (on Clusters)	Glue’s proprietary `DynamicFrames` are gone. You'll be using standard, open-source PySpark or Spark SQL. This is a huge win for portability and testing. A Databricks Job is simply a notebook, a Python script, or a JAR file executed on a cluster. The cluster configuration (node types, size) is now much more flexible.
SQL Warehouse (Amazon Redshift)	Databricks SQL (Serverless)	Redshift is a traditional, provisioned data warehouse. Databricks SQL provides a serverless, highly concurrent SQL endpoint on top of your Delta Lake data. The migration involves converting Redshift's procedural SQL to Databricks' ANSI-standard Spark SQL. Performance is dramatically different; Databricks SQL excels at scanning massive datasets, while Redshift can be faster for certain star-schema joins on smaller, well-tuned datasets. This requires careful performance tuning.
Orchestration (AWS Step Functions)	Databricks Workflows	Instead of a visual web of Lambda functions and service integrations, Databricks Workflows allows you to define a Directed Acyclic Graph (DAG) of tasks. A single workflow can run a notebook, then a SQL script, then a dbt model, then a Python script. It's fully integrated, so you get unified logging, monitoring, and alerting without wiring together three different AWS services. This is a massive simplification of your job scheduling.
Serverless Compute (AWS Lambda)	Databricks Serverless Functions / Jobs	For small, event-driven tasks (e.g., kicking off a job when a file lands), Lambda is common. Databricks now offers Serverless compute for jobs and SQL, which spins up resources instantly without you managing a cluster. For data ingestion, Autoloader on a Serverless cluster is often a more robust and scalable pattern than a file-triggered Lambda function.
Security & Identity (AWS IAM)	Unity Catalog + Service Principals	In AWS, you might have complex IAM roles for each Glue job. In Databricks, the model is simpler. You grant permissions on data objects (catalogs, schemas, tables) within Unity Catalog to users or groups. For automated jobs, you use Service Principals (the equivalent of an IAM role for applications) and grant them only the permissions they need. This centralizes access control around the data itself.

The Big Picture: The goal of this AWS to Databricks mapping is simplification. You are moving from a collection of loosely coupled services to a single, unified platform. This reduces operational overhead, simplifies data governance, and eliminates the data silos between your data warehouse (Redshift) and your data lake (S3). This is the core value proposition of the data lakehouse.

Part 3: In the Trenches - Refactoring Complex AWS Logic

This is where the rubber meets the road. You can't just find-and-replace aws_glue with pyspark. You have to re-think the logic. The most common mistake is to write Spark code that mimics the old, procedural style of a legacy system. This leads to slow, unscalable jobs.

Let's look at a few real-world examples of code conversion.

Example 1: The Looping Redshift Stored Procedure

A classic pattern in Redshift is to use a cursor to loop through a set of rows and perform some complex, row-by-row update. This is anathema to a distributed processing engine like Spark.

The Old Way (Conceptual Redshift SQL):

    CREATE PROCEDURE update_customer_segments() AS $$     DECLARE         rec RECORD;         cur CURSOR FOR SELECT customer_id, lifetime_value FROM customers;     BEGIN         OPEN cur;         LOOP             FETCH cur INTO rec;             EXIT WHEN NOT FOUND;             IF rec.lifetime_value > 10000 THEN                 UPDATE customer_segments SET segment = 'Platinum' WHERE customer_id = rec.customer_id;             ELSIF rec.lifetime_value > 5000 THEN                 UPDATE customer_segments SET segment = 'Gold' WHERE customer_id = rec.customer_id;             ELSE                 UPDATE customer_segments SET segment = 'Silver' WHERE customer_id = rec.customer_id;             END IF;         END LOOP;         CLOSE cur;     END;     $$ LANGUAGE plpgsql;

This is slow, procedural, and doesn't scale.

The Databricks Way (Spark SQL with MERGE):

Spark thinks in terms of whole datasets, not individual rows. The modern, idiomatic way to do this in Databricks is with a MERGE INTO statement, which is an atomic and highly parallel operation on a Delta table.

    -- First, create a staging table/CTE with the new segment logic     CREATE OR REPLACE TEMP VIEW new_segments AS     SELECT       customer_id,       CASE         WHEN lifetime_value > 10000 THEN 'Platinum'         WHEN lifetime_value > 5000 THEN 'Gold'         ELSE 'Silver'       END AS new_segment     FROM customers;     -- Now, merge the changes into the target Delta table     MERGE INTO customer_segments t  -- target     USING new_segments s           -- source     ON t.customer_id = s.customer_id     WHEN MATCHED AND t.segment <> s.new_segment THEN       UPDATE SET t.segment = s.new_segment     WHEN NOT MATCHED THEN       INSERT (customer_id, segment) VALUES (s.customer_id, s.new_segment);

This is a declarative, set-based operation that Spark can parallelize across hundreds of nodes. It's faster, more readable, and leverages the power of Delta Lake to handle the updates atomically.

Example 2: The AWS Glue DynamicFrame Script

AWS Glue introduced its own DynamicFrame object, which was meant to handle schema evolution. In practice, it often led to code that was locked into the Glue ecosystem and harder to debug.

The Old Way (Conceptual Glue Python Script):

    from awsglue.transforms import *     from awsglue.utils import getResolvedOptions     from pyspark.context import SparkContext     from awsglue.context import GlueContext     from awsglue.job import Job     # ... boilerplate setup ...     glue_context = GlueContext(SparkContext.getOrCreate())     datasource0 = glue_context.create_dynamic_frame.from_catalog(         database = "source_db",         table_name = "raw_events"     )     # Use a Glue-specific 'ResolveChoice' transform to handle mixed types     resolved_frame = datasource0.resolveChoice(specs = [('user_id','cast:long')])     # Convert to Spark DataFrame for some operations, then back     spark_df = resolved_frame.toDF()     spark_df_filtered = spark_df.filter("event_type = 'purchase'")     result_dynamic_frame = DynamicFrame.fromDF(spark_df_filtered, glue_context, "result")     # Write out using Glue's writer     glue_context.write_dynamic_frame.from_options(         frame = result_dynamic_frame,         connection_type = "s3",         connection_options = {"path": "s3://output-path/"},         format = "parquet"     )

The Databricks Way (Standard PySpark):

In Databricks, you just use standard PySpark DataFrames. The code is cleaner, more portable, and leverages Spark's native capabilities.

    from pyspark.sql.functions import col     from pyspark.sql.types import LongType     # Read directly from the source table (assuming it's in Unity Catalog)     # Autoloader is even better for incremental streaming from S3     df = spark.read.table("source_catalog.source_db.raw_events")     # Handle schema and types using standard DataFrame APIs     # This is more explicit and easier to debug than ResolveChoice     df_transformed = df.withColumn("user_id", col("user_id").cast(LongType())) \                        .filter(col("event_type") == "purchase")     # Write to a Delta table - this is the standard     df_transformed.write.format("delta").mode("overwrite").saveAsTable(         "target_catalog.target_db.purchase_events"     )

This code can run on any Spark environment, is easier to unit test, and by writing to a Delta table, you get all the benefits of the Lakehouse automatically. The introduction of Python type hints in PySpark also makes modern code much more maintainable.

Part 4: Tying It All Together - A Saner Approach to Orchestration

Orchestration is the nervous system of your data platform, and in AWS, it can become a tangled mess. A common pattern I see is a Step Function that looks like a spiderweb: an S3 trigger fires a Lambda, which does some validation and then starts a Glue Job. The Step Function polls the Glue Job status. If it succeeds, it triggers another Lambda to update a DynamoDB table and start a second Glue Job.

This is brittle, hard to debug (you have to check CloudWatch logs for three different services), and a nightmare for metadata management.

Databricks Workflows fundamentally simplifies this. You define your entire pipeline as a series of tasks within a single workflow asset.

Imagine that complex Step Function. In Databricks, it becomes a single workflow with three tasks:
1. Task 1 (Validation): A Python notebook that runs the validation logic previously in the first Lambda.
2. Task 2 (Main ETL): A PySpark script (your converted Glue job) that depends on the successful completion of Task 1.
3. Task 3 (Post-Processing): A SQL script that updates a log table and depends on Task 2.

All the logging, error handling, retries, and alerting are configured in one place. You have a single run history to look at. If Task 2 fails, the workflow stops, sends an alert, and you can go directly to the failed notebook run, see the exact error, and even interactively debug it by attaching a cluster.

The new Serverless Compute for Jobs feature takes this even further. For many ETL workloads, you don't even need to think about cluster configurations anymore. You just submit your workflow and Databricks handles the compute, leading to better cost optimization as you only pay for the exact time your job is running.

Part 5: Choosing Your Weapons - The Tooling Landscape

So, how do you actually perform the code conversion at scale? You have a few options, each with serious trade-offs.

The Manual Rewrite: You assemble a team of senior engineers and have them rewrite every script by hand.
- Pros: You get high-quality, idiomatic code tailored exactly to your needs.
- Cons: It's incredibly slow, expensive, and brain-numbingly repetitive for your best engineers. It’s also prone to human error and inconsistencies across the team. This approach is only feasible for very small-scale migrations.
Generic "Code Converters": There are tools and online services that promise to convert SQL from one dialect to another.
- Pros: Can be quick for simple, one-off queries.
- Cons: These are glorified find-and-replace engines. They will fail on complex procedural logic, vendor-specific functions, and have zero understanding of the architectural context. They often produce syntactically correct but horribly non-performant Spark code because they don't refactor loops into joins, for example. They completely ignore the orchestration and metadata aspects.
Specialized Migration Platforms (e.g., Travinto): This is a newer category of tool designed specifically for large-scale, cross-platform migrations.

Full disclosure: after trying the first two approaches and feeling the pain, my teams have had the most success using a specialized platform. On our last major Redshift and Glue migration, we used Travinto, and it fundamentally changed the project's trajectory.

This isn't a sales pitch. It's an explanation of why this category of tool works better. A tool like Travinto is not a magic "migrate" button. It's a co-pilot for your architects and engineers. Here’s how it’s different:

It Starts with Analysis: It automates the deep analysis I described in Part 1. It scans your entire AWS environment (Glue, Redshift, Step Functions), ingests the code and metadata, and automatically builds the complete dependency graph. This alone saved us weeks of manual effort and gave our PMs a clear view of the entire project scope.
It Understands Intent: It goes beyond syntactic conversion. It has a semantic understanding of the source platform. It recognizes a Redshift CURSOR LOOP and knows the correct idiomatic equivalent is a Spark MERGE or JOIN/UPDATE pattern. It converts Glue DynamicFrames to standard, clean PySpark.
It Generates Idiomatic Databricks Code: The output isn't just "Spark." It's "Databricks." It generates code that uses Delta Lake, suggests partitioning and Z-ORDERING strategies, and structures the project for use with Databricks Repos and CI/CD.
It Addresses the Whole Picture: It doesn't just convert a single script. It analyzes the Step Function orchestration and helps you map it to a Databricks Workflows JSON definition. It migrates the pipeline, not just the pieces.

The result was that Travinto automated about 80-90% of the conversion process, but the key is that it was the right 80%. It handled the tedious, repetitive, and error-prone work. This freed up my senior engineers to focus on the most complex 10% of the business logic, on performance tuning the new jobs, and on designing the new data governance model in Unity Catalog. It turned a 12-month high-risk project into a 5-month predictable one.

Part 6: When Things Go Wrong - A Troubleshooter's FAQ

No migration is perfect. Here are some of the common "gotchas" you will hit during an AWS to Databricks project and how to solve them.

Q: My converted Spark SQL query returns a different row count than the original Redshift query. Why?
- A: Almost always, this is due to NULL handling. In SQL Server or Redshift, column = NULL is often treated differently than in ANSI SQL (which Spark uses), where column IS NULL is the correct syntax. Also, check for differences in how JOINs on nullable keys are handled. Perform a MINUS or EXCEPT query between the source and target datasets to pinpoint the exact rows that differ. This is a critical data validation step.
Q: My Databricks job is way slower than my old Glue job! I thought this was supposed to be faster?
- A: Welcome to the world of performance issues in distributed systems. Nine times out of ten, it's a data skew problem or an inefficient shuffle.
  1. Open the Spark UI. This is non-negotiable. Look at the query plan and the stage durations. Is one task taking 100x longer than the others? That's skew.
  2. Check for Shuffles: Are you doing a join or groupBy on a high-cardinality key without proper partitioning? This causes a massive data shuffle across the network.
  3. Review your file sizes: Your Delta table should be composed of well-sized files (ideally 100MB - 1GB). Too many small files kill read performance. Use OPTIMIZE and Z-ORDER on your Delta tables. OPTIMIZE compacts small files, and Z-ORDER co-locates related data to drastically improve query speed.
Q: My costs are spiraling out of control on Databricks. What am I doing wrong?
- A: This is common when people treat Databricks clusters like old, always-on Redshift clusters.
  1. Use Autoscaling Job Clusters: Don't use all-purpose interactive clusters for production jobs. Job clusters are cheaper and terminate when the job is done. Configure them to autoscale up and down.
  2. Leverage Spot Instances: For non-critical workloads, use a high percentage of spot instances. This can cut your compute costs by 70% or more.
  3. Right-Size Your Clusters: Don't just pick the biggest instance type. Use the metrics from your job runs to see if you are CPU-bound or memory-bound and adjust the instance family accordingly.
  4. Embrace Serverless: For SQL warehousing (BI) and increasingly for jobs, use Databricks Serverless. You pay for what you use with zero management overhead. This is a key cost optimization lever.
Q: How do I handle proprietary libraries from my Glue jobs?
- A: This is a common legacy system problem. In Glue, you'd point to an S3 path. In Databricks, you have better options. You can install libraries on a cluster via PyPI, Maven, or by uploading Python .whl or .egg files. For a more robust solution, use cluster-init scripts to run a pip install, or better yet, package your environment into a custom Docker container and use Databricks Container Services. This ensures perfect reproducibility.

Final Thoughts: Your Path Forward

Migrating from a fragmented AWS data stack to the Databricks Lakehouse is a journey of transformation, not just translation. When done right, the business impact is immense: you break down data silos, accelerate AI and BI initiatives, simplify governance and security, and often lower your total cost of ownership.

My advice, from one practitioner to another:

Don't Skip the Blueprint: Invest heavily in the initial analysis and metadata management. Know exactly what you have before you start moving.
Think in Lakehouse: Train your team to stop writing procedural, row-by-row code. Embrace declarative, set-based operations with PySpark and Spark SQL. Learn and love Delta Lake.
Pilot Smart: Choose a business-critical but technically contained pipeline for your first migration wave. Success here will build momentum and confidence.
Leverage Accelerators: This is not the time for heroics. The manual approach is too slow and risky for any enterprise-scale migration. The landscape of tooling has matured. Look seriously at specialized platforms that can automate the tedious analysis and conversion, generating high-quality, idiomatic code. A tool like Travinto isn’t a crutch; it’s a force multiplier. It provides the solid foundation and accelerates your timeline, allowing your most valuable people to focus on architecture, performance, and delivering business value, which is the entire point of this exercise.

An AWS to Databricks migration is a challenging but rewarding endeavor. Armed with the right strategy, the right mindset, and the right tools, you can build a data platform that will serve your business for the next decade.

Now, go open that Gantt chart with confidence.

How to migrate Aws to Databricks