How to migrate Bigquery to Databricks ?

My Unfiltered Playbook for Migrating BigQuery to Databricks: Lessons from the Trenches

I've sat across the table from dozens of teams facing the same crossroads. They built their data world on Google BigQuery—it was fast, it was serverless, and for a time, it was perfect. But then things changed. Their ML teams felt constrained. The bill started to look like a phone number. The desire for a more open, multi-cloud future became a strategic imperative. And the conversation always landed on the same question: "We're thinking about moving to Databricks. How hard is it, really?"

The short answer is: it's more than a data-moving exercise. It’s a paradigm shift. I've led these migrations, from multi-terabyte departmental data marts to petabyte-scale enterprise warehouses. I’ve seen the smooth wins and the late-night firefighting. This isn't a theoretical guide or a vendor's marketing fluff. This is my playbook, forged in the fires of real-world production migrations. It's the unfiltered truth about what to expect, where you'll get stuck, and how to get unstuck.

We're moving from a fully-managed, serverless SQL data warehouse to a flexible, Spark-based lakehouse platform. That means you're trading the simplicity of "just run a query" for unparalleled power, control, and openness. That trade-off is where the complexity—and the opportunity—lies.

Before You Write a Single Line of Code: The Discovery and Scoping Blueprint

The biggest mistake I see is teams jumping straight into moving data. They think, "We'll just export to GCS and load into Delta Lake." That's not a strategy; it's a recipe for scope creep, budget overruns, and a painful cutover. A successful migration is 70% planning and 30% execution.

Building a Comprehensive Inventory: You Can't Migrate What You Can't See

Your first job is to become an archaeologist. You need to unearth every single asset in your BigQuery environment. Don't just look at the main production datasets. You need to find the forgotten UDFs, the scheduled queries that run a critical-but-undocumented report, and the views that power a dozen BI dashboards.

On one project, we missed a set of JavaScript UDFs used for complex string parsing. It wasn't until user acceptance testing (UAT) that reports started failing. We had to scramble to rewrite that logic in Python for Spark, delaying the go-live by a week. Never again.

I always start with an automated inventory script. It connects to the BigQuery API and dumps a structured catalog of every object.

Practical Python Snippet: BigQuery Inventory Discovery

    # bq_inventory_discovery.py
    import os
    from google.cloud import bigquery
    from google.api_core.exceptions import NotFound
    import json

    # --- Configuration ---
    # Ensure you've authenticated via `gcloud auth application-default login`
    PROJECT_ID = "your-gcp-project-id" 
    OUTPUT_FILE = "bigquery_inventory.json"
    # ---------------------

    def get_bq_inventory(project_id):
        """
        Connects to BigQuery and builds a comprehensive inventory of all assets.
        """
        client = bigquery.Client(project=project_id)
        inventory = {"project_id": project_id, "datasets": []}

        print(f"Starting inventory for project: {project_id}...")

        try:
            datasets = list(client.list_datasets())
            if not datasets:
                print("No datasets found in the project.")
                return inventory

            for dataset_ref in datasets:
                dataset_id = dataset_ref.dataset_id
                dataset_obj = client.get_dataset(dataset_id)
                dataset_info = {
                    "id": dataset_id,
                    "location": dataset_obj.location,
                    "tables": [],
                    "views": [],
                    "routines": []
                }
                print(f"  Processing Dataset: {dataset_id}")

                # 1. Get Tables and Views
                tables_and_views = list(client.list_tables(dataset_id))
                for item in tables_and_views:
                    full_table_id = f"{project_id}.{dataset_id}.{item.table_id}"
                    try:
                        table_details = client.get_table(full_table_id)
                        table_meta = {
                            "id": item.table_id,
                            "type": table_details.table_type,
                            "num_rows": table_details.num_rows,
                            "num_bytes": table_details.num_bytes,
                            "partitioning_field": table_details.time_partitioning.field if table_details.time_partitioning else None,
                            "clustering_fields": table_details.clustering_fields,
                            "schema": [f"{field.name}:{field.field_type}" for field in table_details.schema]
                        }
                        if table_details.table_type == "TABLE":
                            dataset_info["tables"].append(table_meta)
                        elif table_details.table_type == "VIEW":
                            dataset_info["views"].append(table_meta)
                    except NotFound:
                        print(f"    - Warning: Could not fetch details for {full_table_id}. It might have been deleted.")


                # 2. Get Routines (Stored Procedures and UDFs)
                routines = list(client.list_routines(dataset_id))
                for routine in routines:
                    full_routine_id = f"{project_id}.{dataset_id}.{routine.routine_id}"
                    routine_details = client.get_routine(full_routine_id)
                    dataset_info["routines"].append({
                        "id": routine.routine_id,
                        "type": routine_details.routine_type,
                        "language": routine_details.language,
                        "definition": routine_details.body
                    })

                inventory["datasets"].append(dataset_info)

        except Exception as e:
            print(f"An error occurred: {e}")

        print("Inventory collection complete.")
        return inventory

    if __name__ == "__main__":
        inventory_data = get_bq_inventory(PROJECT_ID)
        with open(OUTPUT_FILE, 'w') as f:
            json.dump(inventory_data, f, indent=4)
        print(f"Inventory saved to {OUTPUT_FILE}")


This script gives you the "what." The next step is to figure out the "how" and "how often."

Beyond the Inventory: Usage and Dependency Analysis

An inventory list is static. You need to understand the dynamics of your warehouse. Which tables are the center of your data universe? Which ones are legacy junk that haven't been touched in a year?

BigQuery's INFORMATION_SCHEMA.JOBS view is your goldmine here. It contains a detailed audit log of every query, load job, and export job run in your project. By parsing this, you can build a dependency graph.

I use a set of SQL queries against this log data to answer critical questions:
* Query Frequency: Which tables are queried most often?
* Data Freshness: How frequently is each table updated?
* Upstream/Downstream Dependencies: When table_A is updated, which other tables (table_B, view_C) are subsequently queried or modified?
* User/Service Account Activity: Who or what is querying this data?

Here’s a conceptual query to get you started. You’d run this in the BigQuery console to analyze your own usage patterns.

    -- This query helps identify the most frequently read tables
    SELECT
        ref_table.project_id,
        ref_table.dataset_id,
        ref_table.table_id,
        COUNT(*) AS query_count,
        COUNT(DISTINCT user_email) AS distinct_users
    FROM
        `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, -- Adjust region as needed
        UNNEST(referenced_tables) AS ref_table
    WHERE
        creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
        AND job_type = 'QUERY'
        AND statement_type != 'SCRIPT'
    GROUP BY 1, 2, 3
    ORDER BY query_count DESC;

The output of this analysis is what informs your migration batching. You group tightly-coupled pipelines and tables together. You de-prioritize (or retire) unused assets. This is how you turn a monolithic migration into a series of manageable, low-risk phases.

The Migration Readiness Report: Your Go/No-Go Dashboard

With the inventory and usage analysis complete, I consolidate everything into a single "Readiness Report." This is our source of truth for the entire project. It's a simple spreadsheet or dashboard, but it's invaluable for planning and tracking.

Asset Name Asset Type Source Complexity Target Databricks Pattern Migration Complexity Score (1-5) Notes / Owner
project.sales.daily_transactions Table Partitioned (DAY) Delta Table (Partitioned) 1 Straightforward data move.
project.marketing.customer_segment View Complex QUALIFY Spark SQL View or materialized table 3 Needs SQL translation for QUALIFY.
project.utils.fn_parse_url UDF (JS) JavaScript Logic Python UDF (Spark) 4 Requires complete rewrite and testing.
project.finance.quarterly_report_sp Stored Proc Multi-statement SQL Databricks Workflow (Notebooks) 5 Complex logic; needs to be broken into multiple Spark jobs.
dbt_project_prod dbt Models Standard SQL dbt on Databricks 2 High compatibility, but requires new profile and adapter setup.

This report instantly highlights your biggest risks (like the JavaScript UDF and the complex Stored Proc) and allows you to allocate resources effectively.


Once planning is solid, the real technical work begins. This is where the subtle (and not-so-subtle) differences between BigQuery and Databricks will challenge you. I group these challenges into six key areas.

1. SQL Dialect Translation and Logic Conversion

This is, without a doubt, the most time-consuming part of the migration. BigQuery SQL is ANSI-compliant but has its own powerful extensions. Databricks uses Spark SQL, which also has its own flavor. Simply copying and pasting SQL code will fail.

The Problem I Faced: On an early migration, we had hundreds of views and scheduled queries that used the QUALIFY clause for complex window function filtering. Spark SQL has no QUALIFY keyword. Manually rewriting each one was turning into a nightmare. We also hit issues with different function names (SPLIT in BQ vs. split() in Spark), different type casting syntax, and subtle differences in JSON handling.

How We Diagnosed It: The process was painful but simple: run the BigQuery SQL in a Databricks notebook. The Spark AnalysisException or ParseException would pinpoint the exact line that was incompatible. The hard part was the sheer volume.

The Solution & Prevention:

  1. Automated Translation is Key: For the bulk of the work, we now use automated SQL translation tools. My go-to is the open-source sqlglot library. It's incredibly powerful for programmatic transpilation. It's not a silver bullet—it won't handle 100% of cases, especially complex UDF logic—but it gets you 80-90% of the way there.

    Practical Python Snippet: Automated SQL Transpilation

    # sql_transpiler.py
    import sqlglot
    import sqlglot.expressions as exp

    def translate_bq_to_spark(bq_sql_query):
        """
        Translates a BigQuery SQL query to Spark SQL dialect.
        """
        try:
            # The core translation step
            spark_sql = sqlglot.transpile(bq_sql_query, read="bigquery", write="spark", pretty=True)[0]
            print("--- Original BigQuery SQL ---")
            print(bq_sql_query)
            print("\n--- Transpiled Spark SQL ---")
            print(spark_sql)
            return spark_sql
        except Exception as e:
            print(f"Error during translation: {e}")
            return f"-- FAILED TO TRANSLATE: {e}\n{bq_sql_query}"

    if __name__ == '__main__':
        # Example using QUALIFY, which Spark doesn't support
        # sqlglot will rewrite this using a subquery or CTE.
        bq_view_ddl = """
        CREATE VIEW my_dataset.top_performers AS
        SELECT
            employee_id,
            sale_date,
            sale_amount,
            ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY sale_amount DESC) as rn
        FROM
            my_dataset.sales
        QUALIFY rn <= 5;
        """
        translate_bq_to_spark(bq_view_ddl)

Running this script shows how `sqlglot` cleverly rewrites the `QUALIFY` clause into a derived table, a pattern that Spark understands perfectly.
  1. Create a "Translation Pattern" Document: For logic that can't be auto-translated (like procedural Stored Procedures or JavaScript UDFs), we maintain a living document. It maps the BigQuery pattern to the "Databricks Way."
BigQuery Pattern Target Databricks Pattern Notes
Multi-statement Stored Proc Databricks Workflow with multiple Notebook tasks. Each statement becomes a cell or task. State passed via temp views or Delta tables.
JavaScript UDF Python UDF or Pandas UDF. Rewrite logic in Python. Pandas UDFs are best for vectorized performance.
bq CLI script databricks CLI or REST API calls. Wrap in a shell script or Python for automation.
dbt with BigQuery adapter dbt with Databricks adapter (dbt-databricks) Update profiles.yml. Most model code is compatible if it's standard SQL.

This approach industrializes the conversion process, preventing every developer from having to re-solve the same problems.

2. Schema and Data Type Discrepancies

This is a subtle but venomous issue. At first glance, the types look similar (STRING, INTEGER, TIMESTAMP), but the devil is in the details, especially with nested and semi-structured data.

The Problem I Faced: A client had extensive usage of BigQuery's STRUCT and ARRAY<STRUCT<...>> types to store nested JSON-like data within their tables. When we exported this data as JSON and tried to load it into Databricks, the schema inference was a mess. Timestamps were being read as strings, nested fields were being flattened incorrectly, and jobs failed with AnalysisException: cannot resolve '...'.

How We Diagnosed It: We used spark.read.json(...).printSchema() on the raw files. This immediately showed us that Spark's inferred schema didn't match the source BigQuery schema. The nesting was wrong, and data types were mismatched.

The Solution & Prevention:

  1. Define Your Schema Explicitly: Never, ever rely on schema inference for production data loads. Extract the schema from BigQuery (our inventory script does this) and convert it into a Databricks StructType. This gives you full control.

  2. Handle Nested Data with Care: Databricks is phenomenal with nested data, but you need to be explicit. Use explode to unpack arrays and dot notation (col.field.subfield) to access nested struct fields.

  3. Create a Type Mapping Cheat Sheet:

BigQuery Type Databricks (Spark) Type Notes
GEOGRAPHY STRING (WKT/WKB format) or use GeoMesa/Horovod Databricks has no native GEOGRAPHY type. Convert to a standard format first.
TIMESTAMP TimestampType Be very careful with timezones. BigQuery is UTC. Ensure Spark sessions are configured for UTC (spark.conf.set("spark.sql.session.timeZone", "UTC")) to avoid silent data corruption.
NUMERIC, BIGNUMERIC DecimalType(38, 9), DecimalType(38, S) You must define the precision and scale in Databricks. This is a common source of failed loads.
STRUCT StructType Direct mapping, but must be explicitly defined.
ARRAY ArrayType Direct mapping.

Proactively handling these types, especially in your data loading framework, saves countless hours of debugging.

3. Performance and Scalability: Thinking in Clusters, Not Slots

This is the biggest mindset shift. In BigQuery, you write SQL, and Google handles the compute. It's a black box that scales automatically (for a price). In Databricks, you are in control of the compute. This is both a great power and a great responsibility.

The Problem I Faced: A team migrated a large ETL pipeline that took 30 minutes in BigQuery. Their first run on Databricks took over 3 hours on a generously sized cluster. They were furious, asking, "Why is this 'better' platform so much slower?"

How We Diagnosed It: We dove into the Spark UI. This is your best friend and primary diagnostic tool in Databricks. We immediately saw two things:
1. Massive Skew: One or two tasks in a stage were taking hours, while the other 198 finished in minutes. The data wasn't being distributed evenly across the cluster.
2. A Huge Shuffle: A multi-terabyte shuffle was occurring because of a poorly structured join between two massive tables.

The Solution & Prevention:

  1. Master Partitioning and Z-Ordering: In BigQuery, you have limited control over physical data layout (time partitioning and clustering). In Databricks, this is your superpower.

    • Partitioning: Just like in BigQuery, partitioning your Delta tables by a low-cardinality column (like date or country_code) is the first and most important step. This allows for "partition pruning," where Spark doesn't even have to read data from irrelevant partitions.
    • Z-Ordering: This is a Databricks-specific optimization. It collocates related data within files. If you frequently filter by col_A and col_B, you should OPTIMIZE my_table ZORDER BY (col_A, col_B). This drastically speeds up queries by reducing the amount of data that needs to be scanned. We re-ran the slow job after Z-Ordering on the join keys, and the runtime dropped from 3 hours to 20 minutes.
  2. Choose the Right Cluster for the Job: Don't use one massive cluster for everything.

    • Small ETL Jobs: Use a small, all-purpose cluster that can autoscale.
    • Massive Data Transformation: Use a larger, compute-optimized cluster with spot instances to save cost.
    • Interactive BI Queries: Use a Databricks SQL Warehouse, which is optimized for low-latency queries, similar to BigQuery's BI Engine.
  3. Embrace the Photon Engine: For SQL-heavy workloads (which is everything coming from BigQuery), enabling the Photon engine is a no-brainer. It’s a C++ vectorized execution engine that sits under Spark. On the slow job mentioned above, just toggling on Photon gave us another 30% performance boost on top of the Z-Ordering improvements.

4. Orchestration and Scheduling

Your logic doesn't run itself. In GCP, you might be using a mix of Scheduled Queries, Cloud Composer (Airflow), or even Cloud Functions triggered by cron jobs. This all needs a new home.

The Problem I Faced: A client had a complex web of dependencies managed through Airflow, with operators specifically designed to call BigQuery jobs and check for completion. A "lift and shift" of the DAGs failed because all the BigQueryOperator instances were obviously no longer valid. They also had dozens of "simpler" pipelines running as BigQuery Scheduled Queries, and they weren't sure how to replicate that "fire and forget" model.

How We Diagnosed It: This was easy to spot. The Airflow DAGs would fail immediately on the BigQuery tasks. The business users complained that their daily reports, powered by the scheduled queries, were no longer updating.

The Solution & Prevention:

  1. Databricks Workflows are Your New Best Friend: For 90% of use cases, Databricks Workflows are the native, preferred solution. They are excellent for chaining together notebooks, SQL scripts, or dbt jobs. The "multi-task job" feature allows you to build complex DAGs directly within the Databricks UI or via the CLI/API. For the "Scheduled Query" use case, a single-task Databricks job pointing to a SQL notebook is a perfect one-to-one replacement.

  2. Adapt, Don't Abandon, Your Existing Orchestrator: If you have a massive investment in Airflow, you don't need to throw it away. Instead, you replace the BigQueryOperator with the DatabricksSubmitRunOperator or DatabricksRunNowOperator. Your DAG structure remains the same; you're just pointing it at a different execution engine. This is a much lower-friction path.

  3. Standardize on a Pattern: Decide on your primary orchestration tool early. My recommendation:

    • For pipelines entirely within Databricks: Use Databricks Workflows. It’s simpler, more integrated, and easier to monitor.
    • For complex, cross-system orchestration (e.g., triggering an API, then a Databricks job, then a Salesforce update): Stick with your enterprise scheduler like Airflow, but use the Databricks provider/operators.

5. Security and Compliance: From IAM to Unity Catalog

Security models are fundamentally different. BigQuery's security is a classic GCP IAM model: permissions are granted on projects, datasets, and tables to users, groups, or service accounts. Databricks, especially with Unity Catalog, introduces a more granular, ANSI SQL-like GRANT/REVOKE model.

The Problem I Faced: On a project pre-dating Unity Catalog, we had a nightmare of managing access. We were creating different clusters with different instance profiles to control which S3 buckets (acting as our data lake) different teams could access. It was brittle and hard to audit. On another project, a client wanted to replicate BigQuery's row-level security policies, and there was no straightforward way to do it.

How We Diagnosed It: Security audits were painful. We had to manually check instance profile permissions, cluster ACLs, and S3 bucket policies to answer a simple question like, "Who can see customer PII?" It was a mess.

The Solution & Prevention:

  1. Adopt Unity Catalog from Day One: If I can give you one piece of advice, it's this. Do not start a new Databricks project in 2023+ without Unity Catalog. It is the single most important feature for governance and security. It solves these problems by providing:

    • Centralized Governance: One place to define access controls for all your data and AI assets.
    • Fine-Grained Access Control: Standard SQL GRANT/REVOKE commands for catalogs, schemas (databases), tables, and views.
    • Row-Level Security and Column Masking: You can now create functions and apply them to tables to filter rows or mask column values based on the user or their group membership, directly replicating and even improving upon the BigQuery functionality.
    • Centralized Auditing: Every access request is logged, giving you a complete lineage and audit trail.
  2. Map IAM Roles to Unity Catalog Groups: Your migration plan must include a security mapping.

    • A GCP project_viewer role might map to a Unity Catalog group with SELECT privileges on a specific catalog.
    • A BigQuery data_owner role might map to a group with SELECT, MODIFY, CREATE on a schema.
    • Automate this by scripting against the Databricks CLI or REST API to create groups and apply grants. Don't do it manually.

6. Cost and Resource Optimization: The Double-Edged Sword of Control

This is the one that trips up everyone. People leave BigQuery because of unpredictable, high costs for analytical queries. They come to Databricks for the promise of lower TCO. But if you're not careful, you can easily spend more.

The Problem I Faced: A team migrated their workloads and let developers create their own "General Purpose" clusters. I saw one developer who had spun up a 20-node, i3.8xlarge cluster to run a 100MB data transformation. The cluster sat idle for 8 hours overnight, racking up thousands of dollars in costs. Their first monthly bill from AWS was a shock.

How We Diagnosed It: Databricks provides good cost visibility through its usage reports and system tables. We could see exactly which cluster was consuming the most DBUs (Databricks Units) and who created it. The cloud provider bill (AWS Cost Explorer in this case) confirmed the underlying EC2 costs.

The Solution & Prevention:

  1. Cluster Policies are Non-Negotiable: This is the most critical tool for cost governance. Cluster policies allow you to define "T-shirt sizes" for clusters and restrict what users can provision. You can set rules like:

    • Max number of nodes.
    • Enforce autoscaling.
    • Enforce a mandatory "auto-termination" timeout (e.g., 30 minutes of inactivity).
    • Restrict instance types to a pre-approved list.
    • Enforce mandatory tagging for chargeback.
    • On that project, we immediately implemented policies. The "freestyle" developer could now only choose from "Small," "Medium," or "Large" clusters, all of which had aggressive auto-termination and used cost-effective spot instances.
  2. Leverage Spot Instances: For most ETL workloads, there is no reason not to use spot instances. Databricks has a robust spot instance handler that will acquire new nodes if one is reclaimed. You can save up to 70-80% on compute costs. Your cluster policy can enforce a "spot-first" strategy.

  3. Right-Size Your SQL Warehouses: For BI and SQL users, use Databricks SQL Warehouses. Start with a small t-shirt size and monitor the query performance and load. The warehouse will autoscale, but you want the baseline to be cost-effective.

  4. Educate Your Team: The mindset shift from serverless to cluster management is key. Host training sessions on how to choose the right cluster, the importance of auto-termination, and how to read the Spark UI to debug their own jobs. A more efficient job is a cheaper job.


Executing the Data Transfer: The Right Tool for the Job

Once you've sorted out the logic and schema, you still need to move the petabytes.

My preferred, battle-tested method for large-scale transfers is a GCS-to-S3/ADLS path.

  1. Export from BigQuery to GCS: Use the EXPORT DATA statement in BigQuery to unload your tables into a GCS bucket. I strongly recommend exporting to a columnar format like Parquet or AVRO, not CSV. This preserves data types and is far more efficient for Spark to read.
    EXPORT DATA
      OPTIONS(
        uri='gs://your-bq-export-bucket/sales_data/*.parquet',
        format='PARQUET',
        overwrite=true
      ) AS
    SELECT * FROM my_dataset.sales_data;

  1. Transfer from GCS to Your Cloud Storage: Use your cloud provider's native storage transfer service (e.g., AWS DataSync, Azure Storage Mover) to efficiently move the data from the GCS bucket to your target S3 or ADLS Gen2 bucket. These services are multi-threaded and much faster than gsutil or azcopy for large volumes.

  2. Ingest into Delta Lake with Auto Loader: This is the magic step. Databricks Auto Loader can incrementally and efficiently process new files as they land in your cloud storage. You set up a streaming job one time, and it handles the rest. It's idempotent, scalable, and includes schema evolution capabilities.

    # Example of using Auto Loader to create a Delta table
    (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "parquet")
        .option("cloudFiles.schemaLocation", "/path/to/schema/location") # For schema tracking
        .load("s3://your-landing-bucket/sales_data/")
        .writeStream
        .option("checkpointLocation", "/path/to/checkpoint/location")
        .trigger(availableNow=True) # Runs one batch to process all available files
        .toTable("bronze.sales_data")
    )

For smaller tables or tactical moves, the JDBC connector is fine, but it will be a bottleneck at scale as it typically funnels data through the driver node.

Closing the Loop: Validation, Cutover, and Post-Migration Sanity

The final mile is often the hardest.

  • Data Validation is Non-Negotiable: Don't just check row counts. Run aggregate queries (SUM, AVG, COUNT DISTINCT) on key columns in both BigQuery and Databricks. For critical tables, use a tool like Great Expectations or write custom PySpark jobs to do checksums on a per-row basis. Your business users must trust the new platform implicitly.

  • Dual-Write and Phased Cutover: For critical pipelines, I never do a "big bang" cutover. We implement a dual-write strategy for a short period. The orchestrator writes data to both BigQuery and the new Databricks Delta table. BI tools can then be pointed to the new source, and we can run them in parallel for a few days to ensure the numbers match perfectly. Once everyone signs off, we cut the cord to BigQuery.

  • Decommissioning BigQuery: The final step is to turn it off. But before you do, monitor the BigQuery audit logs for a few weeks after cutover. You're looking for any stray queries from forgotten service accounts or hardcoded scripts. Once the logs are silent, you can confidently decommission your datasets and celebrate your win.

My Battle-Hardened Best Practices: The TL;DR

If you remember nothing else, remember these principles:

  1. Automate Discovery: Script your inventory and dependency analysis. Don't try to do it by hand.
  2. Don't Lift-and-Shift Logic Blindly: Rewrite and refactor your SQL and procedural logic to be "Spark-native." Use automated tools like sqlglot to do the heavy lifting.
  3. Master Unity Catalog Early: Make it the foundation of your security and governance model from day one.
  4. Think in Clusters, Not Slots: Embrace cluster management. Use policies, spot instances, and auto-scaling to control costs and optimize performance.
  5. Partition and Z-Order Your Data: Physical data layout is your primary performance tuning lever in Databricks.
  6. Validate, Validate, Validate: Trust is earned. Use robust data validation techniques to prove the new system is correct.
  7. It's a Team Sport: This is a mindset change for everyone. Train your data engineers, analysts, and scientists on the "Databricks Way" of working.

The Final Word: It's a Journey to a Better Architecture

Migrating from BigQuery to Databricks is a significant undertaking, but it's not just a technical task. It's a strategic move away from a closed, proprietary system to an open, flexible lakehouse architecture. You're gaining control, unlocking advanced ML/AI capabilities, and setting yourself up for a multi-cloud future.

The path is littered with challenges—SQL syntax, performance tuning, cost governance—but every single one is solvable with the right plan and the right expertise. The playbook I've outlined isn't theoretical; it's the direct result of navigating those challenges on real projects with real deadlines. With a robust discovery phase, a pragmatic approach to translation, and a disciplined focus on governance, you can not only survive the migration but unlock the transformative power of the Databricks Lakehouse.