Redshift to Databricks Migration: A Practical Guide for Data Teams
The world of data is in constant motion. The tools and platforms that were revolutionary a few years ago are now making way for more flexible, scalable, and unified systems. If you're running your analytics on AWS Redshift, you've likely experienced its power as a petabyte-scale data warehouse. But you might also be feeling its limitations, especially when it comes to handling unstructured data, integrating machine learning workflows, and managing rising costs.
Enter Databricks and the Lakehouse architecture.
The shift from a traditional data warehouse like Redshift to a lakehouse platform is more than just a technology swap; it's a paradigm shift. It’s about breaking down the silos between your data lake, data warehouse, and machine learning platforms. It's about building a single, open source-based home for all your data—structured, semi-structured, and unstructured—and all your analytics and AI workloads.
Migrating a mission-critical system like your data warehouse can feel daunting. Where do you start? What are the hidden "gotchas"? How do you ensure business continuity? This guide is here to answer those questions. We'll walk you through a phased migration strategy, provide automated scripts for assessment, and dive deep into the common errors you might encounter and how to solve them.
Why Even Consider Migrating from Redshift to Databricks?
Before diving into the "how," let's solidify the "why." Understanding the fundamental differences and benefits will anchor your migration project and help you communicate its value to stakeholders.
| Feature | AWS Redshift (Traditional Data Warehouse) | Databricks (Lakehouse Platform) |
|---|---|---|
| Architecture | Tightly coupled compute and storage (MPP) | Decoupled compute (Spark clusters) and storage (your cloud object storage, e.g., S3) |
| Data Formats | Proprietary internal format | Open source formats (Delta Lake, Parquet, ORC, etc.) |
| Data Types | Primarily structured data. Semi-structured support via SUPER data type. |
Handles structured, semi-structured, and unstructured data (text, images, video) natively. |
| Workloads | Optimized for BI and SQL analytics. | Unified platform for SQL analytics, data engineering (ETL), and data science/machine learning. |
| Cost Model | Pay for provisioned clusters (compute + storage), even when idle. | Pay for compute only when clusters are running. Storage costs are standard object storage fees. |
| Scalability | Scale by adding nodes to a cluster, which can involve downtime or data redistribution. | Compute clusters can be spun up, resized, or shut down in minutes with no data movement. |
The key drivers for making the move often boil down to these core advantages of the Databricks Lakehouse:
- Unify Data and AI: Stop moving data between systems. Analyze your data and train ML models on the exact same data, in the same platform, reducing complexity and cost.
- Cost-Efficiency: The decoupled architecture means you can scale your storage to petabytes at low cost on S3, and only pay for the compute you need, when you need it. You can even use different-sized clusters for different workloads on the same data.
- Future-Proofing with Open Formats: By building on Delta Lake, which is built on the open Apache Parquet format, you avoid vendor lock-in. Your data is always accessible to a wide ecosystem of tools and engines.
- Ultimate Flexibility: Natively process streaming data, run complex transformations in Python or Scala, and serve interactive SQL dashboards, all from one copy of your data.
The Migration Game Plan: A Phased Approach
A successful migration is a well-planned one. Rushing in without a clear strategy is a recipe for budget overruns, missed deadlines, and frustrated users. We recommend breaking the project into five distinct phases.
Phase 1: Discovery and Assessment
You can't map a journey without knowing your starting point. This initial phase is about creating a detailed inventory of your entire Redshift ecosystem. You need to understand what you have, how it's used, and who uses it.
Key questions to answer:
- Data & Schema: How many databases, schemas, and tables are there? What are their sizes, data types, and constraints? Are there any Redshift-specific features in use (e.g.,
DISTKEY,SORTKEY,SUPERdata type)? - Workloads & Queries: What are the most frequently executed queries? Which tables are queried most often? Are there complex stored procedures or user-defined functions (UDFs)? What is the typical query concurrency and performance?
- Dependencies: What tools and processes connect to Redshift? This includes ETL/ELT pipelines (Airflow, AWS Glue, custom scripts), business intelligence tools (Tableau, Power BI, Looker), and any custom applications.
- Users & Security: Who are the users and what are their roles and permissions? How is access control managed?
Automating Your Inventory with a Python Script
Manually gathering this information is tedious and error-prone. You can automate a large part of this discovery process by querying Redshift's system tables. The following Python script connects to your Redshift cluster and generates a high-level usage and inventory report.
Prerequisites:
- Python 3 installed.
- The
psycopg2-binaryandpandaslibraries. Install them using pip:pip install psycopg2-binary pandas. - Read-only access to the Redshift cluster for a dedicated user. This user needs permission to view system tables like
svv_tables,svv_table_info, andstl_query.
import psycopg2
import pandas as pd
from datetime import datetime, timedelta
# --- Configuration ---
# Replace with your Redshift connection details
REDSHIFT_HOST = "your-redshift-cluster.random-chars.region.redshift.amazonaws.com"
REDSHIFT_PORT = 5439
REDSHIFT_DB = "your_db_name"
REDSHIFT_USER = "your_readonly_user"
REDSHIFT_PASSWORD = "your_password"
# --- Functions to Query Redshift Metadata ---
def get_redshift_connection():
"""Establishes and returns a connection to the Redshift cluster."""
try:
conn = psycopg2.connect(
host=REDSHIFT_HOST,
port=REDSHIFT_PORT,
dbname=REDSHIFT_DB,
user=REDSHIFT_USER,
password=REDSHIFT_PASSWORD
)
print("Successfully connected to Redshift.")
return conn
except Exception as e:
print(f"Error: Could not connect to Redshift. {e}")
return None
def get_table_inventory(conn):
"""
Queries Redshift for a list of all tables, their schemas, sizes, and row counts.
Ignores system tables.
"""
print("\nFetching table inventory...")
query = """
SELECT
"schema" AS table_schema,
"table" AS table_name,
size AS size_mb,
tbl_rows AS row_count
FROM svv_table_info
WHERE "schema" NOT IN ('pg_catalog', 'information_schema')
ORDER BY size_mb DESC;
"""
try:
df = pd.read_sql_query(query, conn)
return df
except Exception as e:
print(f"Error fetching table inventory: {e}")
return pd.DataFrame()
def get_query_usage_report(conn, days=30):
"""
Analyzes query history to find top queried tables and top users.
Scans the last 'days' of query history from stl_query.
"""
print(f"\nAnalyzing query usage for the last {days} days...")
start_time = (datetime.utcnow() - timedelta(days=days)).strftime('%Y-%m-%d %H:%M:%S')
query = f"""
WITH query_scans AS (
SELECT
userid,
query,
tbl AS table_id,
starttime,
endtime
FROM stl_scan
WHERE starttime >= '{start_time}'
AND tbl > 0 -- Exclude system tables and temporary tables
)
SELECT
u.usename AS user_name,
t.schemaname AS schema_name,
t.tablename AS table_name,
COUNT(*) as query_count
FROM query_scans qs
JOIN svv_tables t ON qs.table_id = t.table_id
JOIN pg_user u ON qs.userid = u.usesysid
WHERE t.schemaname NOT IN ('pg_catalog', 'information_schema')
GROUP BY 1, 2, 3
ORDER BY query_count DESC;
"""
try:
df = pd.read_sql_query(query, conn)
return df
except Exception as e:
print(f"Error analyzing query usage: {e}")
return pd.DataFrame()
# --- Main Execution ---
if __name__ == "__main__":
connection = get_redshift_connection()
if connection:
# Get and display table inventory
table_inventory_df = get_table_inventory(connection)
if not table_inventory_df.empty:
print("\n--- Table Inventory (Top 20 by Size) ---")
print(table_inventory_df.head(20).to_string())
# You can save this to CSV for further analysis
# table_inventory_df.to_csv("redshift_table_inventory.csv", index=False)
# Get and display query usage report
query_usage_df = get_query_usage_report(connection, days=30)
if not query_usage_df.empty:
print("\n--- Top 20 Most Queried Tables (Last 30 Days) ---")
print(query_usage_df.head(20).to_string())
# query_usage_df.to_csv("redshift_query_usage.csv", index=False)
connection.close()
print("\nConnection to Redshift closed.")
How to Use This Script:
- Fill in your Redshift connection details in the "Configuration" section.
- Run the script from your terminal:
python your_script_name.py. - The output will be printed to the console, showing you the largest tables and the most frequently accessed tables. You can uncomment the
.to_csv()lines to save the full reports for deeper analysis in a spreadsheet.
This automated report gives you a data-driven foundation for your migration plan, helping you prioritize which tables to move first and identify potential complexities.
Phase 2: Planning and Design
With your assessment complete, you can now design your target state in Databricks. A common mistake is to "lift-and-shift"—simply recreating your Redshift structure in Databricks. This approach fails to leverage the full power of the Lakehouse.
Design for the Lakehouse:
-
Embrace the Medallion Architecture: This is a best practice for structuring data in a lakehouse. It organizes data into three quality tiers:
- Bronze (Raw): Ingest raw data from your source systems here. This is your single source of truth, an append-only, immutable copy of the data as it arrived. For the migration, this would be the raw data unloaded from Redshift.
- Silver (Cleansed, Conformed): Data from the Bronze layer is cleaned, de-duplicated, and joined to create a validated, queryable data model. This layer is where you might merge historical Redshift data with new incoming data.
- Gold (Aggregated, Curated): The Silver layer is further refined and aggregated to create business-level tables optimized for specific BI and analytics use cases. This is what your BI tools will primarily connect to, analogous to your final reporting tables in Redshift.
-
Choose Your Tools:
- Data Ingestion: Will you use Databricks Auto Loader for incremental files from S3? Or a tool from Partner Connect like Fivetran or Arcion for Change Data Capture (CDC)?
- Transformation: Will you rewrite your Redshift stored procedures and ETL scripts in PySpark or Spark SQL? Or will you use a tool like dbt (Data Build Tool), which works beautifully on Databricks, to manage your SQL-based transformations?
- Orchestration: How will you schedule your jobs? Databricks Workflows provides a native, powerful orchestrator. If you're already using Airflow, you can simply repoint your tasks to run on Databricks.
Phase 3: The Migration Execution
This is where the rubber meets the road. We can break this phase down into four parallel streams: schema, data, code (ETL), and BI tools.
-
Schema and Code Migration:
- Translate DDL: Convert your Redshift
CREATE TABLEstatements to Databricks Spark SQL. Pay close attention to data type mappings (more on this in the errors section). - Replace Redshift-Specifics:
DISTKEY/DISTSTYLE: These control data distribution in Redshift's MPP architecture. This concept doesn't exist in Databricks. The Spark engine handles data distribution automatically.SORTKEY: This is used for query performance in Redshift. The equivalent in Databricks Delta Lake isZORDER. You'll want to applyZORDERon columns frequently used inWHEREclauses on your Gold tables.
- Translate SQL and UDFs: Most standard ANSI SQL will work with minor changes. However, you'll need to translate Redshift-specific functions (
LISTAGG,GETDATE()) and rewrite complex UDFs in Python, Scala, or Spark SQL.
- Translate DDL: Convert your Redshift
-
Data Migration:
The most efficient way to move large volumes of data from Redshift to Databricks is via S3.- Step 1: Unload from Redshift to S3: Use the Redshift
UNLOADcommand. This is a highly parallelized and performant operation. Best Practice: Unload your data into an open, splittable file format like Parquet. This will make ingestion into Databricks much faster.
- Step 1: Unload from Redshift to S3: Use the Redshift
-- Example: Unloading a 'sales' table to Parquet in S3
UNLOAD ('SELECT * FROM public.sales')
TO 's3://your-migration-bucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/YourRedshiftS3Role'
FORMAT AS PARQUET
CLEANPATH;
* **Step 2: Ingest from S3 into Databricks Delta Tables:** Once the data is in S3, use the Databricks `COPY INTO` command. This is an idempotent, efficient way to bulk-load data.
-- Example: Creating and loading a Bronze Delta table in Databricks
CREATE TABLE IF NOT EXISTS bronze.sales (
-- define schema here
);
COPY INTO bronze.sales
FROM 's3://your-migration-bucket/sales_data/'
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true');
* **For Incremental Data (CDC):** For ongoing changes after the initial load, use tools like AWS DMS, Fivetran, Qlik Replicate, or Databricks Auto Loader to process new and updated records.
-
ETL/ELT Pipeline Migration:
This is often the most time-consuming part of the migration. Your Redshift-based ETL, which might be a mix of SQL scripts, stored procedures, and external tools, needs to be re-platformed.- Rewrite Stored Procedures: Logic from stored procedures must be translated into Databricks notebooks (using PySpark or Spark SQL) or dbt models.
- Update Orchestration: Modify your Airflow DAGs or other schedulers to execute Databricks jobs instead of Redshift queries.
-
BI and Analytics Tool Re-platforming:
- Update the connection settings in your BI tools (Tableau, Power BI, etc.) to point to a Databricks SQL Warehouse instead of your Redshift cluster.
- Databricks SQL provides a serverless, low-latency, and high-concurrency engine specifically for BI workloads, offering a familiar SQL interface.
- Test and validate all existing dashboards to ensure they function correctly and perform well.
Phase 4: Validation and Cutover
Never decommission the old system until you are 100% confident in the new one.
- Data Validation: Perform rigorous checks to ensure data integrity.
- Row Counts: Simple but effective. Do the row counts match between source Redshift tables and target Delta tables?
- Checksums/Aggregations: Run
SUM(),AVG(),MIN(),MAX()on key numeric columns in both systems. Do the results match? - Spot Checks: Query for specific records in both Redshift and Databricks to ensure perfect fidelity.
- Performance Testing: Re-run key reports and ETL jobs on Databricks. Ensure they meet or exceed the performance SLAs established on Redshift. This is where you'll fine-tune your Databricks cluster sizes, partitioning, and Z-ordering.
- User Acceptance Testing (UAT): Get your business users to test their dashboards and reports on the new platform. Their sign-off is critical for a successful cutover.
- Cutover Strategy:
- Big Bang: A single, coordinated event where all workloads are switched to Databricks. Risky, but can be simpler if the system is small.
- Phased (Recommended): Migrate workloads in waves, starting with less critical ones. For example, move one department's dashboards first. This allows you to learn and adapt.
- Parallel Run: Run both Redshift and Databricks in parallel for a period (e.g., a week or a month). All new data flows to both systems. This provides the safest rollback path but incurs higher costs.
Phase 5: Decommissioning and Optimization
Once the cutover is complete and the new system is stable, it's time to reap the final rewards.
- Decommission Redshift: After a safe period (e.g., 30-60 days) and with final stakeholder approval, you can shut down your Redshift cluster to stop incurring costs. Remember to back up everything one last time before doing so.
- Post-Migration Optimization: The journey doesn't end at migration. Continuously monitor your Databricks environment. Use commands like
OPTIMIZEandZORDERon your Delta tables to maintain performance. Review cluster usage to ensure you are right-sizing your compute resources.
Navigating the Trenches: Common Migration Errors and How to Fix Them
No migration is perfectly smooth. Anticipating common problems is the best way to mitigate them. Here are the errors we see most often, grouped by category.
1. Connectivity and Security Errors
These often happen right at the start of your migration, when you're trying to connect tools or access data.
Error: Connection timed out or FATAL: no pg_hba.conf entry for host "..." when connecting to Redshift.
- Why it happens: This is a classic networking issue. The machine running your script (e.g., a local laptop or an EC2 instance) is not on the Redshift cluster's "allow list." Redshift clusters are secured by VPC Security Groups and cluster firewall rules.
- How to Fix:
- Check Security Group: Go to the Redshift Cluster settings in the AWS Console. Find the associated VPC Security Group.
- Add Inbound Rule: Add a new inbound rule that allows traffic on port
5439(the Redshift port) from the IP address or IP range of your client machine. Be specific; avoid opening it to0.0.0.0/0in production. - Publicly Accessible Setting: Ensure your Redshift cluster is set to "Publicly accessible" if you are connecting from outside the VPC. If connecting from within the same VPC, this is not necessary.
Error: Access Denied when running UNLOAD from Redshift to S3 or COPY INTO from S3 to Databricks.
- Why it happens: This is an IAM permissions failure. The entity performing the action (either the Redshift cluster or the Databricks cluster) does not have the necessary rights to read from or write to the S3 bucket.
- How to Fix:
- For Redshift
UNLOAD: The IAM role associated with your Redshift cluster needs a policy that grantss3:PutObject,s3:GetObject,s3:ListBucket, ands3:DeleteObject(if usingCLEANPATH) permissions on the target S3 bucket. - For Databricks
COPY INTO: The IAM role attached to your Databricks cluster's EC2 instances needs a policy that grantss3:GetObjectands3:ListBucketpermissions on the source S3 bucket. - Bucket Policies: Double-check that there are no conflicting bucket policies on the S3 bucket itself that are denying access.
- For Redshift
2. Schema and Data Type Mismatches
These are subtle but critical errors that can lead to data corruption or failed loads.
Error: Cannot cast [source_type] to [target_type] during data loading, or data appears truncated/incorrect in Databricks.
- Why it happens: Redshift and Spark SQL (used by Databricks) have similar, but not identical, data types. A direct, unthinking translation of a Redshift DDL can cause problems.
- Common Mismatches and How to Fix:
VARCHAR(n)vs.STRING: Redshift'sVARCHAR(256)has a fixed length. If you try to load a string of 257 characters, it will fail. Spark'sSTRINGtype is dynamic and has no practical length limit. Best Practice: Always map RedshiftVARCHARandTEXTtoSTRINGin Databricks. This provides flexibility and avoids truncation issues.TIMESTAMPvs.TIMESTAMPTZ: Redshift hasTIMESTAMP(no time zone) andTIMESTAMPTZ(with time zone). Spark also has aTimestampType. Be very careful during theUNLOADandCOPY INTOprocess to preserve time zone information if it's important. Ensure your Spark session has the correct time zone configuration (spark.sql.session.timeZone).DECIMAL(p, s)vs.DECIMAL(p, s): The precision and scale are generally compatible, but it's crucial to ensure the Databricks DDL matches or exceeds the precision of the Redshift source to prevent rounding errors or overflow.- Redshift
SUPER: Redshift'sSUPERdata type for semi-structured data can be mapped to Databricks' nativeSTRUCT,ARRAY, orMAPtypes for much more powerful and performant querying of nested data. You'll need to parse the JSON within theSUPERfield during transformation.
3. Performance Bottlenecks
Your migration works, but it's painfully slow, or post-migration queries are not as fast as you expected.
Error: UNLOAD from Redshift or COPY INTO to Databricks is taking many hours or days.
- Why it happens: Inefficient data transfer is the usual culprit.
- Unloading to a single file: The
UNLOADcommand is parallel, but if you force it to write to a single file (PARALLEL OFF), you create a massive bottleneck. - Unloading to CSV: CSV is text-based, slow to write, and slow for Spark to parse. It also doesn't retain schema information.
- "Small File Problem": Unloading millions of tiny files to S3 is just as bad. Reading them becomes very inefficient due to the overhead of opening each file.
- Unloading to a single file: The
- How to Fix:
- Use Parquet: Always
UNLOADtoFORMAT AS PARQUET. It's a columnar, compressed, and splittable format that Spark is highly optimized to read. - Allow Parallelism: Let the
UNLOADcommand run in parallel (the default behavior). It will create multiple files in S3, which is exactly what you want for parallel ingestion into Databricks. - Control File Size: If you end up with too many small files, you can use Databricks'
OPTIMIZEcommand withZORDERto compact them into larger, more optimally sized files after loading.
- Use Parquet: Always
Error: Queries on Delta tables in Databricks are slower than on Redshift.
- Why it happens: You haven't optimized your Delta tables for your query patterns. A "lift-and-shift" of the schema is not enough.
- How to Fix:
- Partitioning: Partition your large Delta tables on low-cardinality columns that are frequently used as filters (e.g.,
date,country). This allows the Spark engine to skip reading massive amounts of data ("data skipping"). A 10TB table partitioned by date might only require reading 100GB of data for a query filtering on a specific week. - Z-Ordering: For high-cardinality columns that are often used in
WHEREclauses (e.g.,user_id,product_id), useZORDER. This co-locates related data within the data files, dramatically improving data skipping. A good strategy is toPARTITION BYa date column andZORDER BYan ID column.
- Partitioning: Partition your large Delta tables on low-cardinality columns that are frequently used as filters (e.g.,
-- Run this periodically on your Silver/Gold tables
OPTIMIZE gold.fact_sales
ZORDER BY (productId, customerId);
3. **Right-Sized Clusters:** Ensure your Databricks SQL Warehouse or Job Cluster is appropriately sized for the workload. A query that's slow on a Small warehouse might be instant on a Medium or Large one. Monitor query history to identify and resize clusters as needed.
4. SQL and Function Incompatibility
Your tried-and-true Redshift SQL scripts are failing in the Databricks environment.
Error: Function [function_name] not found or general syntax errors.
- Why it happens: While both use SQL, the dialects have differences. Many utility functions are proprietary to Redshift.
- Common Translations and How to Fix:
| Redshift Function/Syntax | Databricks (Spark SQL) Equivalent | Notes |
|---|---|---|
GETDATE() |
current_timestamp() or now() |
Standard SQL function. |
LISTAGG(column, ',') |
collect_list() or collect_set() |
collect_list returns an array of values. You may need concat_ws(',', collect_list(column)) to get a comma-separated string. |
DATEDIFF(part, start, end) |
datediff(end, start) (returns days) |
For other parts (month, year), use functions like months_between() or extract parts and subtract. |
NVL(col1, col2) or COALESCE |
COALESCE(col1, col2) |
COALESCE is standard and preferred. |
Identity Columns (IDENTITY(seed, step)) |
GENERATED ALWAYS AS IDENTITY |
The syntax is slightly different but achieves the same auto-incrementing result. |
Stored Procedures (CREATE PROCEDURE) |
Databricks Notebooks or dbt Models | There is no direct translation for procedural logic. It must be rewritten in PySpark/Scala/SQL within a notebook or refactored into a series of modular SQL transformations in dbt. |
- Best Practice: Create a small "translation cheat sheet" for your team. For complex transformations and UDFs, consider using a third-party automated code conversion tool (e.g., from Partner Connect) to handle the bulk of the translation, followed by manual review.
5. Orchestration and Dependency Failures
You've migrated the data and the code, but the end-to-end pipelines are failing.
Error: An Airflow DAG fails on a task that runs a Databricks job, or a downstream job starts before an upstream Databricks job is truly finished.
- Why it happens:
- Hardcoded Connections: The orchestration tool might still have hardcoded connection details pointing to Redshift.
- Incorrect Dependency Mapping: The complex web of dependencies that existed in the Redshift world was not fully mapped and replicated for the Databricks workflows.
- Asynchronous Execution: Some tools might kick off a Databricks job and immediately return "success," even though the job is still running. The next task then starts prematurely with incomplete data.
- How to Fix:
- Parameterize Connections: In your orchestrator (Airflow, Prefect, etc.), use variables and connections stores to manage your Databricks host and token. This makes it easy to switch between dev and prod environments.
- Use Synchronous Operators: When using Airflow, use operators like
DatabricksRunNowOperatorin a synchronous way (the default) to ensure the Airflow task only succeeds when the Databricks job completes successfully. - Visualize and Map DAGs: Use the DAG visualization tools in your orchestrator to meticulously map the new dependencies. If
job_Bdepends onjob_A, ensure that dependency is explicitly defined. - Leverage Databricks Workflows: For new pipelines, consider using Databricks Workflows. It allows you to define multi-task jobs with dependencies directly within the Databricks UI, simplifying orchestration significantly.
Conclusion: Your Path to the Lakehouse
Migrating from AWS Redshift to Databricks is a strategic project that positions your organization for the future of data and AI. It's a move away from siloed, proprietary systems toward an open, unified, and highly scalable platform. By embracing the lakehouse architecture, you unlock the ability to run every data workload—from SQL analytics and BI to data engineering and machine learning—on a single, consistent copy of your data.
The path requires careful planning, a phased execution, and a proactive approach to problem-solving. But the challenges are well-understood and the solutions are mature. By starting with a thorough automated assessment, designing for the lakehouse paradigm, and anticipating the common pitfalls in connectivity, performance, and code translation, you can de-risk your migration and accelerate your time to value.
Migration Best Practices: A Quick Summary
To wrap up, here are the key takeaways for a successful migration:
- Assess, Don't Guess: Use automated scripts to get a data-driven inventory of your Redshift environment before you start.
- Design for the Lakehouse: Don't just "lift-and-shift." Embrace the Medallion Architecture (Bronze, Silver, Gold) to build a scalable and high-quality data foundation.
- Use S3 as Your Bridge: The most efficient data transfer path is
Redshift UNLOAD (to Parquet) -> S3 -> Databricks COPY INTO. - Prioritize Open Formats: Unload to Parquet, not CSV. Build on Delta Lake. This ensures performance and avoids future lock-in.
- Translate, Don't Just Copy: Pay close attention to SQL dialects, data types, and Redshift-specific features like
SORTKEY, which map to concepts likeZORDERin Databricks. - Validate Rigorously: Run parallel systems and perform extensive data and performance validation before decommissioning your legacy system.
- Optimize Post-Migration: The job isn't done at cutover. Continuously optimize your Delta tables (
OPTIMIZE,ZORDER) and right-size your clusters to manage performance and cost. - Engage Your Users: Involve BI users and data consumers early and often (especially in UAT) to ensure the new system meets their needs and to champion the adoption of the new platform.