How do I assess my AWS environment before migrating to Databricks?

L+ Editorial
Jan 28, 2026 Calculating...
Share:

How do I assess my AWS environment before migrating to Databricks?

Your organization is ready to harness the power of the Databricks Lakehouse Platform on AWS. You're looking forward to unifying your data, analytics, and AI workloads, breaking down silos between data engineering and data science, and accelerating innovation.

But before you dive headfirst into the migration, a crucial question arises: "Where do I even begin?"

Migrating complex data ecosystems is not a simple lift-and-shift operation. It's a strategic move that, without proper planning, can lead to budget overruns, missed deadlines, and frustrating performance issues. The key to a smooth, cost-effective, and successful transition lies in a thorough pre-migration assessment of your existing AWS environment.

This comprehensive guide will walk you through the essential steps to audit your current AWS setup. We'll cover how to inventory your assets, analyze your workloads, and anticipate common pitfalls. We'll even provide an automation script to kickstart your inventory process, saving you hours of manual effort. Think of this as your blueprint for a migration that's on time, on budget, and delivers immediate value.

Why a Pre-Migration Assessment is Your Most Important First Step

Jumping into a migration without a map is a recipe for getting lost. A detailed assessment of your AWS environment serves as that map, providing critical insights that inform your entire migration strategy.

  • De-risks Your Project: By identifying potential roadblocks early—be it complex dependencies, security gaps, or incompatible data formats—you can proactively address them instead of being blindsided mid-migration.
  • Provides Accurate Cost & Timeline Estimates: Understanding the scale of your data, the complexity of your pipelines, and the compute resources you currently use allows you to build a realistic budget and timeline. This is crucial for getting stakeholder buy-in and managing expectations.
  • Optimizes for Performance: This isn't just about moving workloads; it's about improving them. An assessment helps you identify inefficient jobs, poor data layouts, and performance bottlenecks that can be re-architected for superior performance on the Databricks platform.
  • Ensures a Secure & Governed Transition: You can map your existing security posture (IAM roles, network rules, encryption) to Databricks' security model, particularly with Unity Catalog, ensuring no gaps in governance from day one.

Phase 1: Building Your AWS Inventory - What Do You Have?

The first step is to create a comprehensive catalog of every data asset, compute resource, and processing job that is a candidate for migration. You can't migrate what you don't know exists. Your focus should be on these core AWS services:

  • Data Storage:

    • Amazon S3: The heart of your data lake. You need to know all your buckets, their sizes, storage classes (Standard, Intelligent-Tiering, Glacier), data formats (Parquet, ORC, JSON, CSV), and partitioning schemes.
    • Amazon RDS / Aurora: Identify all relational databases serving as sources or sinks for your data pipelines. Note their engine types (PostgreSQL, MySQL), instance sizes, and usage patterns.
    • Amazon Redshift: Catalog your data warehouse clusters, schemas, tables, and stored procedures. These are often prime candidates for modernization to a Databricks SQL warehouse.
  • Data Processing & Compute:

    • AWS Glue: Document all ETL jobs (Spark, Python Shell), crawlers, triggers, and the Data Catalog. These are often the most straightforward to migrate, as they are typically Spark-based.
    • Amazon EMR: List all active clusters, their instance types, configurations, and the steps/jobs they run. Pay close attention to custom bootstrap actions and installed libraries.
    • AWS Lambda: Find functions used for event-driven data processing (e.g., S3 triggers) or light transformations.
    • EC2 Instances: Identify any self-managed data processing nodes running custom scripts (Python, Bash) or open-source tools like Airflow.
  • Orchestration & Scheduling:

    • AWS Step Functions: Map out your state machines that coordinate Lambda functions, Glue jobs, and other AWS services.
    • Managed Workflows for Apache Airflow (MWAA) / Self-hosted Airflow: Document all DAGs, their schedules, and their dependencies.
    • Cron Jobs: Don't forget the classic! Find any cron jobs on EC2 instances that are kicking off critical scripts.
  • Security & Networking:

    • IAM: Roles, policies, and users associated with your data services.
    • VPC & Networking: VPCs, subnets, security groups, and VPC endpoints that govern connectivity between your services.

Automating Your AWS Inventory with Python and Boto3

Manually clicking through the AWS console to gather this information is tedious and error-prone. You can automate a significant portion of this inventory process using Python's boto3 library. The script below provides a starting point for gathering information about key services.

Prerequisites:

  1. Install Python: Ensure you have Python 3.6+ installed.
  2. Install Boto3: Run pip install boto3 pandas.
  3. Configure AWS Credentials: Configure your AWS CLI with credentials that have sufficient read-only permissions (e.g., ReadOnlyAccess IAM policy) for the services you want to inventory. You can do this by running aws configure.
    import boto3
    import pandas as pd
    from botocore.exceptions import ClientError
    import logging

    # Configure logging
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

    def get_aws_regions():
        """Gets a list of all available AWS regions."""
        ec2 = boto3.client('ec2', region_name='us-east-1')
        try:
            regions = [region['RegionName'] for region in ec2.describe_regions()['Regions']]
            logging.info(f"Found {len(regions)} AWS regions.")
            return regions
        except ClientError as e:
            logging.error(f"Could not list regions: {e}")
            return []

    def get_s3_inventory():
        """Generates an inventory of all S3 buckets."""
        s3 = boto3.client('s3')
        inventory = []
        try:
            response = s3.list_buckets()
            logging.info(f"Found {len(response['Buckets'])} S3 buckets.")
            for bucket in response['Buckets']:
                bucket_name = bucket['Name']
                try:
                    # Get bucket location
                    location_response = s3.get_bucket_location(Bucket=bucket_name)
                    location = location_response.get('LocationConstraint', 'us-east-1') or 'us-east-1'
                    inventory.append({
                        "BucketName": bucket_name,
                        "CreationDate": bucket['CreationDate'].strftime("%Y-%m-%d"),
                        "Region": location
                    })
                except ClientError as e:
                    logging.warning(f"Could not access bucket {bucket_name} to get location. Skipping. Error: {e}")
        except ClientError as e:
            logging.error(f"Could not list S3 buckets: {e}")
        return inventory

    def get_glue_inventory(region):
        """Generates an inventory of AWS Glue jobs and crawlers in a given region."""
        glue = boto3.client('glue', region_name=region)
        inventory = []
        paginator_jobs = glue.get_paginator('get_jobs')
        paginator_crawlers = glue.get_paginator('get_crawlers')

        try:
            for page in paginator_jobs.paginate():
                for job in page['Jobs']:
                    inventory.append({
                        "Region": region,
                        "ResourceType": "Glue Job",
                        "Name": job['Name'],
                        "Type": job.get('Command', {}).get('Name'),
                        "WorkerType": job.get('WorkerType', 'N/A'),
                        "NumberOfWorkers": job.get('NumberOfWorkers', 'N/A'),
                        "LastModified": job.get('LastModifiedOn', 'N/A').strftime("%Y-%m-%d")
                    })

            for page in paginator_crawlers.paginate():
                for crawler in page['Crawlers']:
                    inventory.append({
                        "Region": region,
                        "ResourceType": "Glue Crawler",
                        "Name": crawler['Name'],
                        "State": crawler.get('State'),
                        "DatabaseName": crawler.get('DatabaseName', 'N/A'),
                        "LastUpdated": crawler.get('LastUpdated', 'N/A').strftime("%Y-%m-%d") if crawler.get('LastUpdated') else 'N/A'
                    })
        except ClientError as e:
            logging.warning(f"Could not access Glue resources in {region}. Error: {e}")

        return inventory

    def get_emr_inventory(region):
        """Generates an inventory of active and recently terminated EMR clusters."""
        emr = boto3.client('emr', region_name=region)
        inventory = []
        paginator = emr.get_paginator('list_clusters')
        cluster_states = ['STARTING', 'BOOTSTRAPPING', 'RUNNING', 'WAITING', 'TERMINATING', 'TERMINATED', 'TERMINATED_WITH_ERRORS']

        try:
            for page in paginator.paginate(ClusterStates=cluster_states):
                for cluster_summary in page['Clusters']:
                    cluster_id = cluster_summary['Id']
                    cluster_details = emr.describe_cluster(ClusterId=cluster_id)['Cluster']
                    inventory.append({
                        "Region": region,
                        "ResourceType": "EMR Cluster",
                        "Id": cluster_id,
                        "Name": cluster_summary['Name'],
                        "State": cluster_summary['Status']['State'],
                        "ReleaseLabel": cluster_details.get('ReleaseLabel', 'N/A'),
                        "MasterInstanceType": cluster_details['InstanceCollection']['InstanceGroups'][0]['InstanceType'] if cluster_details.get('InstanceCollection') else 'N/A',
                        "CoreInstanceCount": cluster_details['InstanceCollection']['InstanceGroups'][1]['RequestedInstanceCount'] if len(cluster_details.get('InstanceCollection', {}).get('InstanceGroups', [])) > 1 else 0,
                        "CoreInstanceType": cluster_details['InstanceCollection']['InstanceGroups'][1]['InstanceType'] if len(cluster_details.get('InstanceCollection', {}).get('InstanceGroups', [])) > 1 else 'N/A'
                    })
        except ClientError as e:
            logging.warning(f"Could not access EMR resources in {region}. Error: {e}")

        return inventory

    def get_redshift_inventory(region):
        """Generates an inventory of Redshift clusters in a given region."""
        redshift = boto3.client('redshift', region_name=region)
        inventory = []
        paginator = redshift.get_paginator('describe_clusters')

        try:
            for page in paginator.paginate():
                for cluster in page['Clusters']:
                    inventory.append({
                        "Region": region,
                        "ResourceType": "Redshift Cluster",
                        "ClusterIdentifier": cluster['ClusterIdentifier'],
                        "NodeType": cluster['NodeType'],
                        "NumberOfNodes": cluster['NumberOfNodes'],
                        "DBName": cluster['DBName'],
                        "VpcId": cluster.get('VpcId', 'N/A')
                    })
        except ClientError as e:
            logging.warning(f"Could not access Redshift resources in {region}. Error: {e}")

        return inventory

    if __name__ == '__main__':
        all_regions = get_aws_regions()
        if not all_regions:
            logging.error("No regions found. Exiting.")
            exit()

        logging.info("Starting AWS Inventory Collection...")

        # S3 is global, so we run it once
        s3_inventory = get_s3_inventory()
        pd.DataFrame(s3_inventory).to_csv('aws_s3_inventory.csv', index=False)
        logging.info("S3 inventory saved to aws_s3_inventory.csv")

        # For regional services
        regional_inventory = []
        for region in all_regions:
            logging.info(f"--- Processing Region: {region} ---")
            regional_inventory.extend(get_glue_inventory(region))
            regional_inventory.extend(get_emr_inventory(region))
            regional_inventory.extend(get_redshift_inventory(region))

        if regional_inventory:
            pd.DataFrame(regional_inventory).to_csv('aws_regional_inventory.csv', index=False)
            logging.info("Regional inventory (Glue, EMR, Redshift) saved to aws_regional_inventory.csv")
        else:
            logging.info("No regional resources found or all regions were inaccessible.")

        logging.info("Inventory collection complete.")


This script will produce CSV files (aws_s3_inventory.csv and aws_regional_inventory.csv) that give you a solid foundation for your assessment. You can easily extend it to include other services like RDS, Lambda, and Step Functions.

Phase 2: Analyzing Usage, Dependencies, and Complexity

With your inventory in hand, the real analysis begins. This phase is about understanding how your assets are used, who uses them, and how they connect to each other.

  1. Workload Profiling:

    • Categorize Jobs: Classify each job (Glue, EMR, etc.) by its purpose: simple ETL, complex transformations, machine learning model training, interactive analytics, reporting.
    • Analyze Runtimes and Frequency: Use AWS CloudWatch metrics to determine how long each job runs and how often. A job that runs for 5 minutes every hour has different requirements than one that runs for 8 hours once a month.
    • Identify Code Complexity: Review the code for your most critical and long-running jobs. Are they using standard Spark SQL, or do they have complex UDFs (User-Defined Functions), heavy use of RDDs, or custom libraries? This will directly influence the refactoring effort.
  2. Data Usage Patterns:

    • Hot vs. Cold Data: Use S3 Storage Class Analysis and S3 Access Logs to determine which data is frequently accessed ("hot") and which is archival ("cold"). This helps in planning data layout and caching strategies in Databricks.
    • Data Lineage and Dependencies: This is one of the most critical and challenging steps. You need to map the flow of data. What S3 paths does a Glue job read from? Where does it write its output? Which downstream job consumes that output? Tools like AWS CloudTrail (for API calls) and parsing your job scripts can help, but this often requires manual investigation and interviews with your data teams. Create a Directed Acyclic Graph (DAG) diagram to visualize these dependencies.
  3. Cost Analysis:

    • Analyze your AWS Bill: Use AWS Cost Explorer to break down costs by service (EMR, Glue, S3, etc.).
    • Identify High-Cost Workloads: Pinpoint the specific EMR clusters or Glue jobs that contribute most to your bill. These are your top priorities for optimization during the migration. Is a cluster oversized and idle most of the time? Is a job inefficiently shuffling terabytes of data?

As you move from assessment to planning, you'll encounter common challenges. Here’s a breakdown of frequent issues, why they occur, and how to solve them.

Category 1: Connectivity and Networking Errors

Connecting your new Databricks workspace to your existing AWS resources securely is a common first hurdle.

Pitfall: Databricks clusters cannot access S3 buckets or RDS databases.

  • Why it happens: This is almost always a networking or permissions issue. The Databricks control plane and data plane (your clusters) live in a VPC. If this VPC cannot communicate with your data sources, the connection will fail. Common causes include:

    1. Missing VPC Endpoints: Your data in S3 or RDS might not be publicly accessible. Without a VPC endpoint for S3 or for your database service, traffic from your Databracks VPC has no route to reach the service over the AWS private network.
    2. Incorrect Security Group Rules: The security group attached to your Databricks cluster nodes may not have an outbound rule allowing traffic to the RDS database's security group on the required port (e.g., 5432 for PostgreSQL). Conversely, the RDS security group may not have an inbound rule allowing traffic from the cluster's security group.
    3. Network ACLs (NACLs): These stateless firewalls at the subnet level can block traffic even if security groups allow it. They are a less common but possible cause.
  • How to fix and best practices:

    • Deploy Databricks in Your Own VPC: The most robust solution is to deploy the Databricks data plane in a VPC that you control. This gives you full authority over routing, security groups, and endpoints.
    • Use VPC Endpoints: For any AWS service you need to access (S3, STS, Kinesis, etc.), create a Gateway or Interface Endpoint in your VPC. This keeps traffic within the AWS network, which is more secure and often lower latency.
    • Configure Security Groups Correctly: Think of security groups as "allow lists." Your Databricks cluster security group needs an outbound rule to your data source. Your data source's security group needs an inbound rule from your cluster's security group. Be specific with ports.
    • Use Instance Profiles for S3 Access: Attach an IAM role as an instance profile to your Databricks clusters. This role should have a policy granting it the necessary s3:GetObject, s3:PutObject, and s3:ListBucket permissions on the required buckets. This is the most secure way to grant S3 access without hardcoding keys.

Category 2: Performance and Cost Bottlenecks

A common expectation is that Databricks will automatically be faster and cheaper. While the platform is highly optimized, a "lift and shift" of inefficient workloads will likely lead to disappointing results.

Pitfall: My migrated Spark job runs slower and/or costs more on Databricks than on EMR/Glue.

  • Why it happens:

    1. Incorrect Instance Type Mapping: You might choose Databricks worker types that don't match the workload profile. For example, moving a memory-intensive EMR job from r5.4xlarge instances to compute-optimized c5.4xlarge workers in Databricks can lead to memory pressure and disk spilling, slowing down the job.
    2. Ignoring the Photon Engine: Running a vanilla Spark job without enabling Databricks' high-performance Photon engine means you're leaving a massive performance boost on the table. Photon is a C++ vectorized execution engine that dramatically accelerates Spark SQL and DataFrame operations.
    3. Inefficient Data Layout: Migrating raw, unoptimized Parquet or JSON files can lead to massive data scanning and shuffling. Your EMR job might have been "fast enough," but it was likely wasting compute cycles.
    4. Poor Cluster Configuration: Using fixed-size clusters for spiky workloads leads to wasted cost during idle times or insufficient resources during peaks.
  • How to fix and best practices:

    • Right-Size Your Clusters: Use the workload profiles from your assessment. For I/O-heavy jobs, choose storage-optimized instances (i-series). For memory-heavy jobs, use memory-optimized instances (r-series). Start with a comparable instance family to your EMR setup and then optimize.
    • Enable Photon Everywhere: For any workload that uses Spark SQL or DataFrames, enable Photon. It's a simple checkbox in the cluster configuration and often provides a 2-5x performance improvement with no code changes.
    • Convert to Delta Lake: This is the single most impactful optimization. Convert your Parquet tables to Delta Lake. This gives you ACID transactions, but more importantly for performance, it unlocks features like:
      • Z-Ordering: A multi-dimensional clustering technique that co-locates related data, dramatically reducing the amount of data scanned.
      • Data Skipping: Delta Lake collects statistics on your data, allowing queries to skip entire files that don't contain relevant information.
    • Leverage Autoscaling: Configure your Databricks clusters to autoscale. Set a minimum number of workers to handle the baseline load and a maximum to handle peaks. Databricks has optimized autoscaling that adds and removes nodes much more gracefully than traditional EMR autoscaling, saving significant costs.

Category 3: Schema and Data Format Challenges

Data is rarely clean. Migrating pipelines often exposes hidden issues with data quality and schema consistency.

Pitfall: My pipeline breaks when a source system adds a new column or changes a data type.

  • Why it happens:

    1. Rigid Schema Definition: Traditional ETL tools, including AWS Glue crawlers, often define a rigid schema. When new data arrives that doesn't fit (e.g., a new column appears), the job fails because it can't parse the file. This is known as schema drift.
    2. Inconsistent File Formats: You might have a mix of gzipped CSVs, snappy-compressed Parquet, and nested JSON files in the same S3 location. A single job trying to read all of these without proper configuration is bound to fail.
  • How to fix and best practices:

    • Embrace Schema Evolution with Delta Lake: When you use Delta Lake as the sink for your ingestion pipelines, you can enable schema evolution. Simply add the option .option("mergeSchema", "true") to your Spark write command. Delta will automatically and safely add the new column to the table's schema without breaking existing queries.
    • Use Auto Loader for Ingestion: Databricks Auto Loader is designed specifically for ingesting files from cloud storage like S3. It can automatically infer schema, handle schema drift, and efficiently process new files as they arrive. It's a massive improvement over setting up S3 events, SQS queues, and Lambda functions manually. It's the robust and scalable replacement for Glue crawlers and S3-triggered jobs.
    • Standardize on an Open Format: For your core data lake tables, standardize on Delta Lake. For raw ingestion zones, try to enforce a consistent format like Parquet or JSONL (newline-delimited JSON) from your source systems where possible.

Category 4: Security and Governance Missteps

Migrating data and compute is only half the battle. You must also migrate your security and governance model.

Pitfall: Users have either too much or too little access to data after migration.

  • Why it happens:

    1. Incorrect IAM Role Translation: A direct "lift and shift" of IAM policies designed for EMR/Glue doesn't always map cleanly to the Databricks model. You might grant a cluster overly broad permissions or fail to provide access to necessary resources.
    2. Lack of Fine-Grained Access Control: Relying solely on IAM roles for S3 bucket-level access is too coarse. You need to control access to specific tables, rows, and columns, especially in a multi-tenant environment.
  • How to fix and best practices:

    • Implement Unity Catalog: Unity Catalog is Databricks' unified governance solution for data and AI. It should be the cornerstone of your security strategy. It provides a central place to manage access control for all your data assets, regardless of where they are stored.
    • Map IAM to Unity Catalog:
      • Use IAM roles for infrastructure access (the instance profile on your cluster to access S3).
      • Use Unity Catalog for data access. GRANT SELECT ON TABLE my_catalog.my_schema.my_table TOdata_analysts_group;
    • Centralize Governance: Instead of managing permissions across dozens of IAM policies and S3 bucket policies, you manage them in one place with standard SQL GRANT/REVOKE commands in Unity Catalog. This also provides a full audit log of who accessed what data and when.
    • Embrace Attribute-Based Access Control (ABAC): For advanced use cases, use Unity Catalog to define policies based on user attributes (tags). For example, a user with the 'PII_access' tag can see columns marked with the 'PII' tag.

Category 5: Orchestration and Scheduling Conflicts

Your data pipelines don't run in a vacuum. They are part of a larger, scheduled workflow.

Pitfall: My migrated Airflow/Step Functions workflow fails to trigger Databricks jobs correctly.

  • Why it happens:

    1. Simple Refactoring: You might replace a GlueStartJobRunOperator in Airflow with a DatabricksRunNowOperator but fail to account for differences in how parameters are passed, how job status is reported, or how failures are handled.
    2. Mixing Orchestration Models: Trying to manage parts of a workflow in Databricks Workflows and other parts in an external orchestrator without a clear strategy can lead to race conditions and complex dependency management.
  • How to fix and best practices:

    • Option 1: Consolidate on Databricks Workflows: For new pipelines or pipelines that are being significantly re-architected, consider rebuilding them entirely using Databricks Workflows. This is the most native and tightly integrated solution, offering features like task value passing and easy multi-task job creation.
    • Option 2: Use an External Orchestrator (The Right Way): If you have a significant investment in Airflow or Step Functions, you don't have to abandon it.
      • For Airflow: Use the official Databricks provider. The DatabricksSubmitRunOperator is ideal, as it defines the entire job (cluster spec, libraries, parameters) within your DAG code, making your workflow self-contained and version-controlled.
      • For Step Functions: Use the Databricks REST API. You can invoke the Jobs API from a Lambda function within your Step Function state machine to start a job run and poll for its completion status.
    • Keep Orchestration Logic Centralized: As a general rule, avoid having one orchestrator call another. Decide on a primary orchestrator for each end-to-end business process. Let Airflow manage the business-level flow, and have it trigger self-contained, multi-task jobs within Databricks.

Your Pre-Migration Assessment Checklist

Use this checklist to guide your assessment process.

Phase 1: Inventory
* [ ] Run an automated script to inventory S3, Glue, EMR, and Redshift.
* [ ] Manually supplement the inventory with RDS, Lambda, Step Functions, and Airflow DAGs.
* [ ] For each S3 bucket, document its purpose, data format, and partitioning.
* [ ] For each job (Glue/EMR), document its code location, schedule, and libraries.
* [ ] For each database (Redshift/RDS), document its size, schemas, and primary users.

Phase 2: Analysis
* [ ] Categorize all processing jobs by workload type (ETL, ML, etc.).
* [ ] Analyze CloudWatch metrics to find the longest-running and most frequent jobs.
* [ ] Analyze AWS Cost Explorer to identify the most expensive workloads.
* [ ] Map data lineage for your top 5-10 most critical pipelines. Create a visual DAG.
* [ ] Review the code for complex jobs, noting custom UDFs, RDD usage, or non-Spark dependencies.

Phase 3: Planning & Pitfall Avoidance
* [ ] Design your target network architecture (e.g., Databricks in a new VPC with peering/endpoints).
* [ ] Plan your IAM and Unity Catalog security model. How will you map existing user groups?
* [ ] For each major workload, draft a migration plan:
* Lift-and-Shift: (e.g., Simple Spark SQL Glue job) -> Convert to a Databricks job.
* Refactor/Optimize: (e.g., Inefficient EMR job on Parquet) -> Convert to a Databricks job on Delta Lake with Z-Ordering.
* Re-architect: (e.g., Complex Step Functions/Lambda flow) -> Rebuild as a multi-task Databricks Workflow using Auto Loader.
* [ ] Identify your pilot workload—a business-critical but non-disruptive pipeline to migrate first.

Conclusion

Migrating your data workloads from a diverse AWS stack to the Databricks Lakehouse Platform is a transformative step. It promises a future of simplified architecture, accelerated performance, and unified governance. But this future is built on a foundation of diligent, upfront planning.

By treating the pre-migration assessment not as a chore, but as the first and most critical phase of your migration project, you turn the journey from a leap of faith into a well-executed strategy. You move from hoping for a good outcome to engineering one. Take the time to inventory your assets, analyze your workloads, and plan for the common pitfalls. Your future self—enjoying a faster, more efficient, and unified data platform—will thank you for it.


Migration Best Practices: A Quick Summary

  • Automate Your Inventory: Don't waste time in the AWS console. Use scripts to get a fast, accurate picture of your environment.
  • Analyze, Don't Just Catalog: Go beyond "what" you have to "how" it's used. Focus on dependencies, costs, and performance bottlenecks.
  • Prioritize with a Pilot: Don't try to boil the ocean. Select a meaningful but low-risk workload for your first migration to learn the process and build momentum.
  • Embrace Delta Lake and Photon: These are not just features; they are core to the value of Databricks. Plan to convert your data to Delta Lake and enable Photon on all eligible clusters.
  • Govern with Unity Catalog: Make Unity Catalog the heart of your security model from day one for fine-grained, centralized, and auditable data governance.
  • Choose the Right Orchestration Strategy: Decide whether to consolidate on Databricks Workflows or integrate with your existing orchestrator correctly using official providers and APIs.
  • Think Optimization, Not Just Migration: Use this as an opportunity to fix old problems. Re-architect inefficient pipelines instead of just moving them.
Talk to Expert