Top 100 Errors in DataStage to Databricks Migration (And How to Fix Them)

L+ Editorial
Jan 25, 2026 Calculating...
Share:

Migrating from a seasoned, on-premises ETL workhorse like IBM DataStage to a modern, cloud-native platform like Databricks is more than just a technical upgrade—it's a paradigm shift. It’s moving from a visual, server-based world to a code-first, distributed computing universe powered by Apache Spark. This journey promises incredible gains in scalability, flexibility, and the ability to unify your data and AI workloads.

But let's be honest: the path is rarely a straight line. It's a complex process filled with potential pitfalls that can derail timelines, inflate budgets, and frustrate even the most experienced data teams. The fundamental differences in architecture, execution models, and development practices mean that a simple "lift and shift" is a recipe for failure.

Think of your DataStage environment as a highly specialized, well-organized factory with fixed assembly lines. Databricks, on the other hand, is a state-of-the-art, modular workshop where you can build, reconfigure, and scale production lines on the fly. To succeed, you can't just move the old machinery; you have to learn how to use the new tools.

This guide is your field manual. We've compiled a comprehensive list of the top 100 errors encountered during a DataStage to Databricks migration, based on real-world projects and hard-won experience. We’ve grouped them into logical categories to help you anticipate challenges, debug issues, and ultimately, execute a smoother, more successful data platform modernization.


Category 1: Planning and Scoping Errors

The most critical errors often happen before a single job is migrated. A flawed strategy will lead to technical debt and organizational friction down the line.

  1. The "Boil the Ocean" Strategy: Attempting to migrate all DataStage projects at once.

    • Why it Occurs: Over-ambition and a failure to recognize the complexity involved.
    • How to Fix: Start with a pilot project. Select a business-critical but manageable data pipeline. Use it as a learning experience to develop patterns, best practices, and accurate estimates for the rest of the migration.
  2. Underestimating the "Paradigm Shift": Believing Databricks is just a faster version of DataStage.

    • Why it Occurs: Lack of understanding of distributed computing (Spark) versus parallel processing (DataStage).
    • How to Fix: Invest in training. Your team needs to think in terms of DataFrames, transformations, and lazy evaluation, not visual stages and links.
  3. Ignoring Business Stakeholder Buy-in: Treating migration as a purely technical IT project.

    • Why it Occurs: Failing to communicate the "why" behind the migration—cost savings, faster insights, AI/ML capabilities.
    • How to Fix: Develop a clear business case. Involve business users early to manage expectations about potential downtime, changes in reports, and new capabilities.
  4. No Clear Definition of "Done": Lacking concrete success criteria for a migrated job.

    • Why it Occurs: Vague project goals.
    • How to Fix: Define success for each pipeline: data must be 100% reconciled, performance must be within X% of the original (or better), and the job must run reliably within the new orchestration framework.
  5. Inaccurate Inventory and Complexity Analysis: Misjudging the number of jobs, their dependencies, and their inherent complexity.

    • Why it Occurs: Relying on outdated documentation or manual counts.
    • How to Fix: Use automated discovery tools or scripts to parse DataStage project files (.dsx). Analyze job types (parallel vs. server), stage usage (especially custom stages), and sequencer logic to build a data-driven migration plan.
  6. Choosing the Wrong Migration Approach (Manual vs. Automated): Committing to a full manual rewrite or over-relying on a "magic bullet" automation tool.

    • Why it Occurs: A binary view of the problem.
    • How to Fix: Adopt a hybrid strategy. Use automation tools to convert the 60-80% of "vanilla" ETL logic, freeing up your senior engineers to focus on the complex 20% that requires manual redesign and optimization.
  7. Neglecting a Data Governance Strategy: Assuming the old governance model will map to the new platform.

    • Why it Occurs: Focusing solely on the ETL code.
    • How to Fix: Plan for Databricks Unity Catalog from day one. Define how you will manage data access, lineage, and quality in the new ecosystem.
  8. Forgetting About the "Long Tail" of Scripts: Ignoring the shell scripts, Perl scripts, and other auxiliary processes that surround DataStage jobs.

    • Why it Occurs: A narrow focus on the .dsx files.
    • How to Fix: During the inventory phase, catalog all wrapper scripts and pre/post-processing commands. Plan to rewrite them in Python or integrate them into Databricks Workflows.
  9. Lack of a Decommissioning Plan: Having no strategy for turning off DataStage after the migration.

    • Why it Occurs: The migration is seen as "finished" once the jobs are running in Databricks.
    • How to Fix: Create a phased decommissioning plan. Run both systems in parallel for a defined period (e.g., one business cycle), validate the results, and then formally switch off and archive the old environment to realize cost savings.
  10. Not Defining Naming Conventions and Standards: Allowing a free-for-all in naming notebooks, jobs, clusters, and tables.

    • Why it Occurs: Rushing into development without setting ground rules.
    • How to Fix: Establish and enforce clear naming conventions (e.g., proj_layer_table_name for tables, job_pipeline_task_name for jobs). This is crucial for maintainability and automation.

Category 2: Connectivity and Environment Errors

You can't move data if you can't connect to it. These errors are the foundational roadblocks that often appear at the very beginning of the technical work.

  1. Firewall and Network Peering Issues: Databricks clusters cannot reach on-premises data sources (like DB2, Oracle, or mainframes).

    • Why it Occurs: Cloud VPCs and on-prem data centers are on separate, isolated networks.
    • How to Fix: Work with your network team to establish secure connectivity via VPN, AWS Direct Connect, or Azure ExpressRoute. Configure VPC peering and ensure security groups/NSGs allow traffic on the required database ports.
  2. Incorrect JDBC/ODBC Driver Configuration: Failing to install or configure the correct drivers for legacy sources.

    • Why it Occurs: The Databricks runtime doesn't include proprietary drivers out-of-the-box.
    • How to Fix: Use init scripts to install the necessary JDBC/ODBC driver JARs or libraries onto your Databricks clusters at startup. Store the driver files in a location like DBFS or a cloud storage bucket.
  3. Mishandling of Credentials: Hardcoding usernames and passwords in notebooks.

    • Why it Occurs: A quick-and-dirty approach to get a connection working. This is a major security risk.
    • How to Fix: Use Databricks Secrets, backed by Azure Key Vault or AWS Secrets Manager. Store credentials securely and reference them programmatically in your code using dbutils.secrets.get().
  4. Issues with Mainframe Connectivity: Struggling to read EBCDIC data formats or connect to mainframe datasets.

    • Why it Occurs: Mainframe data is often in proprietary, non-standard formats.
    • How to Fix: Use specialized third-party connectors or libraries designed for mainframe data integration. Alternatively, establish a staging area where mainframe data is first extracted and converted to a standard format (like UTF-8 encoded CSV or Parquet) before being ingested by Databricks.
  5. Slow Metadata Operations on File Systems: LIST operations on object storage (S3/ADLS) are slow when dealing with millions of small files.

    • Why it Occurs: Object stores are not traditional filesystems and can be slow for directory listings. DataStage might have worked with a faster, mounted NFS.
    • How to Fix: Use Delta Lake. It includes a transaction log that tracks file metadata, turning slow LIST operations into fast queries against the log. Also, avoid creating millions of tiny files; aim for larger, well-sized files (e.g., 1GB).
  6. Throttling from Source/Target APIs: Hitting API rate limits when pulling data from SaaS platforms (like Salesforce).

    • Why it Occurs: Spark's parallelism can overwhelm source system APIs that are designed for single-threaded access.
    • How to Fix: Reduce the parallelism of your Spark read (.repartition()), implement exponential backoff and retry logic in your code, or use a connector that handles throttling gracefully.
  7. Incorrect Cross-Account/Cross-Subscription IAM Roles: Databricks compute cannot access data in a different cloud account or subscription.

    • Why it Occurs: Improperly configured trust relationships and permissions for IAM roles or Service Principals.
    • How to Fix: Carefully configure cross-account IAM roles (AWS) or App Registrations with appropriate permissions on the target storage accounts (Azure). Attach these roles to your Databricks clusters.
  8. Environment Inconsistency (Dev/Test/Prod): A job works in development but fails in production.

    • Why it Occurs: Differences in cluster configurations, library versions, network access, or environment variables between environments.
    • How to Fix: Use infrastructure-as-code (Terraform, Pulumi) and Databricks Asset Bundles to define and deploy your environments consistently. Pin library and runtime versions.
  9. Forgetting to Mount Storage: Writing code that references local file paths (/tmp/) instead of distributed storage paths (/dbfs/mnt/).

    • Why it Occurs: A classic mistake when moving from a single-server mindset. Files written to a driver's local disk are not accessible to executors.
    • How to Fix: Always read from and write to distributed storage (S3, ADLS, DBFS mounts). Use mounts for convenience but prefer direct paths (e.g., s3a://...) for better performance and security.
  10. SSL/TLS Certificate Errors: Connection failures due to untrusted or misconfigured certificates.

    • Why it Occurs: On-premises systems may use internal Certificate Authorities (CAs) not trusted by the default Java trust store on Databricks clusters.
    • How to Fix: Add your organization's custom CA certificates to the trust store on the cluster using an init script.

Category 3: Schema and Data Type Mismatches

Data types are the grammar of your data language. When translations are wrong, the meaning is lost, and your pipelines break.

  1. Precision Loss with Decimal/Numeric Types: DataStage Decimal(38,18) becomes a Double in Spark, causing rounding errors.

    • Why it Occurs: Default type inference in Spark can choose floating-point types for high-precision numbers.
    • How to Fix: Explicitly define the schema when reading data. Use Spark's DecimalType(precision, scale) to match the source system's precision exactly.
  2. Incorrect Timestamp Handling: Time zone issues or incorrect parsing of Timestamp formats.

    • Why it Occurs: DataStage has flexible but sometimes implicit timestamp handling. Spark is strict and assumes UTC by default.
    • How to Fix: Explicitly set the session time zone (spark.conf.set("spark.sql.session.timeZone", "...")). When parsing strings, provide the exact format string (e.g., to_timestamp(col, 'yyyy-MM-dd HH:mm:ss.SSS')).
  3. Misinterpretation of NULL Values: Empty strings ('') from a source file are loaded as NULL, or vice-versa, breaking downstream logic.

    • Why it Occurs: Different systems have different conventions. DataStage might treat '' and NULL differently in its logic.
    • How to Fix: When reading data (especially from CSVs), use reader options like nullValue and emptyValue to define the behavior explicitly. Cleanse data with when(col === '', None).otherwise(col).
  4. Ignoring Schema Drift: The source system adds, removes, or renames a column, causing the Spark job to fail.

    • Why it Occurs: Brittle, hardcoded schemas in the migration code.
    • How to Fix: Use Delta Lake with Schema Evolution (.option("mergeSchema", "true")). This allows your target table to gracefully adapt to new columns in the source data without failing the job. Also, implement monitoring to alert on schema changes.
  5. Character Encoding Mismatches (e.g., UTF-8 vs. Latin-1): Text data appears garbled or corrupted after ingestion.

    • Why it Occurs: Source files are saved with a different encoding than what Spark's reader assumes (default is usually UTF-8).
    • How to Fix: Specify the encoding during the read operation, for example: spark.read.option("encoding", "ISO-8859-1").csv(...).
  6. DataStage VarChar vs. Spark String: Overlooking that VarChar has a length limit while Spark String does not, potentially hiding truncation issues that happened in the source.

    • Why it Occurs: A subtle difference in type systems. The data might already have been truncated in DataStage.
    • How to Fix: This is more of a validation issue. During testing, profile column lengths in the source and target to ensure no unexpected data loss occurred in the original DataStage job.
  7. Handling of Complex/Nested Structures: Flattening JSON or XML from a source instead of using Spark's native support for complex types.

    • Why it Occurs: A relational-only mindset.
    • How to Fix: Embrace structs and arrays. Read nested data directly into DataFrames and use dot notation (df.select("address.city")) or functions like explode() to work with the nested structure. This is often more efficient than premature flattening.
  8. Implicit Type Casting Failures: A Spark job fails because a column that was sometimes a number and sometimes a string can no longer be implicitly cast.

    • Why it Occurs: Spark's type system is less lenient than some legacy databases or DataStage's internal logic.
    • How to Fix: Explicitly cast columns using cast(). Add data quality checks to identify and clean mixed-type data at the source or in a staging layer.
  9. Case Sensitivity Mismatches: Queries fail because a column is named customerID in the DataFrame but queried as customerid.

    • Why it Occurs: Spark is case-sensitive by default, while some source databases (like SQL Server) may be case-insensitive.
    • How to Fix: Either enforce consistent casing standards (best practice) or, for specific queries, set spark.sql.caseSensitive to false. Be aware this can have unintended side effects.
  10. Boolean/Integer Representation: Confusing 1/0, T/F, and true/false for boolean values.

    • Why it Occurs: Different systems store booleans in different ways.
    • How to Fix: Standardize during ingestion. Use a CASE statement or when() clause to explicitly map all variants to a true Spark BooleanType.

Category 4: ETL Logic and Code Conversion Errors

This is the core of the migration—translating the visual, proprietary logic of DataStage into declarative, distributed Spark code.

  1. Thinking in Rows, Not Columns (Procedural vs. Declarative): Rewriting DataStage logic line-by-line in a Python for loop.

    • Why it Occurs: Trying to replicate the row-by-row processing of a DataStage Transformer stage. This is the #1 anti-pattern in Spark.
    • How to Fix: Think in terms of whole-DataFrame transformations. Use columnar functions like withColumn, select, filter, and Spark SQL. A for loop on a DataFrame is almost always wrong.
  2. Incorrectly Migrating the Transformer Stage: Using complex and inefficient User-Defined Functions (UDFs) for logic that can be done with native Spark functions.

    • Why it Occurs: UDFs seem like an easy way to port complex business logic from DataStage functions.
    • How to Fix: Avoid Python/Scala UDFs whenever possible. They are "black boxes" to Spark's optimizer. Spend time learning the vast library of built-in Spark SQL functions—they are far more performant. Use UDFs only as a last resort.
  3. Misunderstanding Lookup vs. Join: Implementing a DataStage Lookup stage as a full Spark join, causing performance issues.

    • Why it Occurs: The concepts are similar but have different performance implications.
    • How to Fix: For small lookup tables (the "reference" link in a DataStage Lookup), use a broadcast join. This sends a copy of the small table to every executor, avoiding a costly shuffle. spark.sql.functions.broadcast(small_df).
  4. Replicating Aggregator Stage Inefficiently: Performing aggregations in a way that causes massive data shuffles.

    • Why it Occurs: Not understanding how Spark's groupBy works.
    • How to Fix: Be selective in your groupBy keys. Filter data before the aggregation to reduce the amount of data being shuffled. If possible, use functions that can be partially aggregated on each executor before the final shuffle (e.g., sum, count, max).
  5. Not Using Window Functions for Complex Ranking/Lag Logic: Using self-joins to replicate DataStage functions that look at previous or next rows.

    • Why it Occurs: Unfamiliarity with modern SQL capabilities.
    • How to Fix: Use Spark's powerful window functions (row_number, lag, lead, rank) over a WindowSpec. This is vastly more efficient and readable than a self-join.
  6. ** mishandling NULLs in Joins and Logic:** Getting unexpected results because NULL does not equal NULL.

    • Why it Occurs: A join on a.key = b.key will drop rows where the key is NULL on both sides. DataStage might have different default behavior.
    • How to Fix: Use the NULL-safe equality operator (<=>) for join keys if you need to match NULLs. Explicitly handle NULLs in your filter and CASE logic (col IS NULL instead of col = NULL).
  7. Converting DataStage Basic Functions Incorrectly: Struggling to find equivalents for functions like Ereplace, Iconv, Oconv.

    • Why it Occurs: These are highly specific, non-standard functions.
    • How to Fix: There is no 1:1 mapping. You must understand what the function does and reimplement that logic using standard Spark SQL functions (e.g., regexp_replace, date_format, cast). This often requires manual analysis.
  8. Ignoring the Role of the Funnel Stage: Simply using union for everything.

    • Why it Occurs: A Funnel stage in DataStage has multiple modes (sequence, sort, continuous).
    • How to Fix: A simple Funnel can be a unionByName. A sorted funnel requires a union followed by an orderBy. A continuous funnel is more complex and may require a custom implementation depending on the specific need.
  9. Incorrectly Handling Surrogate Keys: Generating keys in a non-distributed-friendly way.

    • Why it Occurs: Porting a database sequence-based approach that doesn't work across a distributed cluster.
    • How to Fix: Use monotonically_increasing_id() for a unique but non-consecutive ID. For deterministic keys based on business columns, create a composite key and hash it (sha2 or md5). For true sequential keys, this is tricky; it often requires a single-node process post-load or using a Delta MERGE with a dedicated key generation table.
  10. Translating Job Control Language (JCL) or Sequencers Poorly: Creating a monolithic notebook for a complex multi-step DataStage sequence.

    • Why it Occurs: A naive translation of a sequence of jobs.
    • How to Fix: Use Databricks Workflows. Break the sequence into modular tasks (notebooks or Python scripts). Use task dependencies to orchestrate the flow, and pass parameters between tasks to replicate sequencer logic.
  11. Creating God-Notebooks: A single, 2000-line notebook that implements an entire complex DataStage job.

    • Why it Occurs: Easier to develop initially, but impossible to debug, test, or maintain.
    • How to Fix: Break down logic into modular functions and, if necessary, multiple notebooks. A common pattern is a "bronze" notebook for ingestion, a "silver" for cleansing/transformation, and a "gold" for aggregation/serving.
  12. Losing Job Lineage: Not being able to trace data from the final table back to its source.

    • Why it Occurs: Code-based pipelines can be harder to visualize than a DataStage diagram.
    • How to Fix: Leverage Databricks Unity Catalog, which automatically captures column-level lineage for Spark workloads. Adhere to good coding practices (modular functions, clear variable names) to make lineage easier to understand.
  13. Replicating Peek Stage for Debugging: Trying to view intermediate data in a way that disrupts the Spark execution plan.

    • Why it Occurs: In DataStage, Peek is a common debugging tool.
    • How to Fix: Use .display() or .show() in an interactive notebook for debugging. Be aware this triggers a Spark action and can be slow on large datasets. For production code, remove these and rely on logging and data validation tests.
  14. Overusing .collect(): Pulling large amounts of data from the Spark executors to the driver node.

    • Why it Occurs: Trying to process data in a traditional Python loop or inspect results.
    • How to Fix: Never use .collect() on a large DataFrame. It will cause an OutOfMemoryError on the driver. If you need to bring data back, use .take(n) or .tail(n) for a sample, or write the results to storage and read them from there.
  15. Misunderstanding Partitioning Logic: Translating DataStage's hash/round-robin partitioning without understanding Spark's partitioning.

    • Why it Occurs: DataStage partitioning is about distributing work across parallel processes on a server. Spark partitioning is about data locality and avoiding shuffles.
    • How to Fix: In Spark, you partition data on write using .partitionBy("date", "country"). This creates a directory structure in your storage that allows for efficient filtering (partition pruning). Use repartition() or coalesce() within a job only when you have a specific reason (like skewed data or controlling output file size).
  16. Incorrectly Implementing Slowly Changing Dimensions (SCDs): Writing complex, inefficient code to handle Type 1 and Type 2 updates.

    • Why it Occurs: SCD logic is complex to write from scratch.
    • How to Fix: Use the Delta Lake MERGE INTO statement. It is specifically designed for this purpose and provides an efficient, atomic, and easy-to-read syntax for handling inserts, updates, and deletes (or flagging old records) in a target table.
  17. Forgetting about Checkpointing in Stateful Streaming: Migrating a near-real-time job without configuring checkpoints.

    • Why it Occurs: DataStage's real-time capabilities have a different state management model.
    • How to Fix: In Spark Structured Streaming, you must specify a checkpoint location. This is where Spark saves the state of your stream (e.g., current offsets, aggregations) so it can resume correctly after a failure.
  18. Not Leveraging Autoloader: Manually listing and processing new files in a landing zone.

    • Why it Occurs: Replicating an old pattern of "list files, loop, process".
    • How to Fix: Use Databricks Autoloader (spark.readStream.format("cloudFiles")). It automatically and efficiently discovers and processes new files as they arrive in cloud storage, managing state and file discovery for you. It's far more scalable than manual listing.
  19. Incorrectly Handling Stage Variables: Using global variables or other non-Spark patterns to pass state between transformations.

    • Why it Occurs: Trying to replicate the behavior of stage variables in a DataStage Transformer.
    • How to Fix: The Spark equivalent is simply adding a new column to a DataFrame with withColumn. The result of one withColumn is available to the next one in the chain. This is the idiomatic way to carry state through a series of transformations.
  20. Failure to Parameterize Code: Hardcoding file paths, table names, and filter conditions in notebooks.

    • Why it Occurs: Quick development for a single use case.
    • How to Fix: Use Databricks Widgets or pass parameters via a workflow definition. This makes your code reusable across different environments (dev/test/prod) and for different data slices (e.g., processing a different date).

Category 5: Performance and Optimization Pitfalls

Your job runs, but it's 10x slower and 20x more expensive than in DataStage. Performance in a distributed system is a completely different ballgame.

  1. Ignoring Data Skew: One Spark task takes hours while the others finish in minutes.

    • Why it Occurs: One partition key has a disproportionately large amount of data (e.g., a NULL or default key). This overloads a single executor.
    • How to Fix: Analyze the data distribution. Filter out or handle skewed keys separately. Another technique is "salting": add a random key to the skewed data to distribute it across more partitions, then join it back.
  2. Causing Unnecessary Shuffles: Performing joins or aggregations on non-partitioned or poorly chosen keys.

    • Why it Occurs: A shuffle is when Spark has to move data between executors across the network. It's the most expensive operation.
    • How to Fix: Filter data early. Use broadcast joins for small tables. When writing data, partition it on the columns you will use for joining or filtering in downstream jobs.
  3. Using Inefficient File Formats: Sticking with CSV or JSON instead of columnar formats.

    • Why it Occurs: It's the easiest format to output from many source systems.
    • How to Fix: Standardize on Delta Lake (which uses Parquet underneath). Columnar formats allow Spark to read only the columns needed for a query (projection pushdown), dramatically improving performance.
  4. Not Compacting Small Files: A Delta table has tens of thousands of tiny files, slowing down reads.

    • Why it Occurs: High-frequency streaming or many small batch updates can create a small file problem.
    • How to Fix: Run the OPTIMIZE command on your Delta tables periodically. This compacts small files into larger, more optimal ones. Enable Optimized Write on your clusters to have Databricks manage file sizes automatically.
  5. Incorrect Cluster Sizing (Too Big or Too Small): Using a massive cluster for a small job (wasting money) or a tiny cluster for a huge job (causing it to fail or run forever).

    • Why it Occurs: Guesswork.
    • How to Fix: Start with a reasonable default and monitor performance using the Spark UI and Ganglia metrics. Use Databricks autoscaling clusters that can add/remove workers based on load. Right-size your instance types (compute-optimized vs. memory-optimized).
  6. Fetching Too Much Data: A SELECT * from a massive source table when you only need three columns.

    • Why it Occurs: Lazy coding and not understanding predicate/projection pushdown.
    • How to Fix: Always be specific. Use .select("col1", "col2") and .filter(...) early in your code. For many JDBC sources, Spark will "push down" these operations so the source database does the filtering, sending much less data over the network.
  7. Misusing Caching: Caching the wrong DataFrame or forgetting to un-cache.

    • Why it Occurs: Caching seems like a free performance boost.
    • How to Fix: Use caching (.cache() or .persist()) strategically when you will reuse a specific, expensive-to-compute DataFrame multiple times in the same job. Be sure to .unpersist() it when you are done to free up memory. Often, just letting Spark's query optimizer work is better than manual caching.
  8. Not Using Z-Ordering (Multi-dimensional Clustering): Queries on non-partitioned columns are slow.

    • Why it Occurs: Not taking advantage of Delta Lake's advanced optimization features.
    • How to Fix: If you frequently filter by a high-cardinality column that is not a partition key (e.g., user_id), use OPTIMIZE table_name ZORDER BY (user_id). This co-locates related data in the same set of files, making reads much faster.
  9. Ignoring the Spark UI: Trying to debug performance issues by just looking at the code.

    • Why it Occurs: Lack of familiarity with Spark's powerful diagnostic tools.
    • How to Fix: Learn to read the Spark UI. It is essential for understanding your execution plan (DAG), identifying bottlenecks, seeing data skew, and diagnosing shuffle issues.
  10. Driver Node Bottlenecks: The driver runs out of memory, but the executors are idle.

    • Why it Occurs: Using .collect(), performing a join where one side is too large to be broadcast but small enough to be pulled to the driver, or using certain UDFs that force data through the driver.
    • How to Fix: Refactor your code to keep all data and processing on the distributed executors. Increase driver memory only as a last resort; the problem is usually in the code.
  11. Failing to Configure Spark Properties: Running with default shuffle partitions, memory settings, etc.

    • Why it Occurs: The defaults are not tuned for your specific workload.
    • How to Fix: Tune properties like spark.sql.shuffle.partitions. For jobs with large joins or aggregations, increasing this value can improve performance by creating more parallel tasks.
  12. Photon Engine Not Enabled: Missing out on a massive, free performance boost.

    • Why it Occurs: Not selecting a Photon-enabled runtime for the cluster.
    • How to Fix: Use a Photon-enabled runtime whenever possible. Photon is Databricks' native C++ execution engine that can dramatically speed up Spark SQL and DataFrame operations with no code changes.
  13. Not Using Broadcast Hashes for Filters: Using a semi-join when an in clause with a broadcasted list would be faster.

    • Why it Occurs: A less-known optimization pattern.
    • How to Fix: For filtering a large DataFrame based on a small list of keys, collect the keys to the driver (if the list is small!), broadcast it, and use a filter. For larger key sets, a broadcast join is the way to go.
  14. Choosing the Wrong Instance Types: Using general-purpose VMs when a memory-optimized or storage-optimized instance is needed.

    • Why it Occurs: Not analyzing the workload's resource profile.
    • How to Fix: Use Ganglia or cloud monitoring tools to see if your jobs are CPU-bound or memory-bound. Choose instance types that match the need. For example, jobs with large shuffles and caches benefit from memory-optimized instances.
  15. Spilling to Disk: Spark runs out of memory and starts spilling data to disk, killing performance.

    • Why it Occurs: Under-provisioned memory or a highly inefficient operation (like a cross-join).
    • How to Fix: Find the cause in the Spark UI (look for "spill" metrics). Increase executor memory, refactor the code to be more memory-efficient, or provision more executors to distribute the memory pressure.

Category 6: Orchestration and Scheduling Issues

A pipeline isn't just one job; it's a collection of dependent tasks that must run reliably, be monitored, and handle failures.

  1. Trying to Replicate DataStage Sequencer Logic in a Single Notebook: Creating a tangled mess of conditional logic inside one large script.

    • Why it Occurs: A direct translation of the sequencer concept.
    • How to Fix: Use Databricks Workflows. Define each job as a separate task and use the UI or API to set up dependencies, creating a clear, maintainable Directed Acyclical Graph (DAG).
  2. Poor Error Handling and Retry Logic: A transient network blip causes the entire multi-hour pipeline to fail and start from scratch.

    • Why it Occurs: Not configuring task-level retries.
    • How to Fix: In Databricks Workflows, configure retry policies for tasks that might be prone to transient failures (e.g., connecting to an external API). Make your jobs idempotent so they can be safely re-run.
  3. Ignoring Job Concurrency Limits: Kicking off hundreds of jobs at once, overwhelming the source systems or the Databricks workspace.

    • Why it Occurs: Not understanding the "Max concurrent runs" setting.
    • How to Fix: Set appropriate concurrency limits on your jobs to prevent them from "thundering herd" issues. Use a centralized scheduler like Airflow if you have complex cross-workspace or cross-platform dependencies.
  4. Lack of Centralized Monitoring and Alerting: Finding out a critical daily job failed six hours later from an angry business user.

    • Why it Occurs: Relying on manually checking job run statuses.
    • How to Fix: Configure email or webhook notifications in Databricks Workflows for job failures, successes, or even long run times. Integrate with tools like PagerDuty or Slack for real-time alerting.
  5. Passing Data Between Tasks Inefficiently: Writing a large intermediate DataFrame to storage only to have the next task immediately read it back.

    • Why it Occurs: Treating each task as a completely isolated black box.
    • How to Fix: For small pieces of information (like a row count or a max date), use Task Values to pass data between workflow tasks. For large datasets, writing to storage is correct, but ensure it's in an efficient format (Delta).
  6. Hardcoding Cluster Definitions: Each job definition includes a full, manually configured cluster spec.

    • Why it Occurs: It's the most straightforward way in the UI.
    • How to Fix: Use Cluster Policies to define a set of standardized T-shirt sizes (e.g., "small-etl," "large-ml"). This enforces standards, controls costs, and simplifies job configuration. A job just needs to request a policy, not a full spec.
  7. Not Using a CI/CD Process: Developers manually copy-pasting notebook code from dev to prod.

    • Why it Occurs: Treating notebooks as one-off scripts instead of production code.
    • How to Fix: Integrate your Databricks workspace with Git (GitHub, Azure DevOps, etc.). Use Databricks Asset Bundles or dbx to define your project structure and deployment process. Automate the deployment of notebooks, jobs, and libraries through a CI/CD pipeline (e.g., GitHub Actions).
  8. Inability to Handle Conditional Execution: Not knowing how to run Task B only if Task A succeeds with a specific condition.

    • Why it Occurs: Workflows UI seems to only support simple success/failure dependencies.
    • How to Fix: Use Task Values. Task A can output a status (e.g., "NO_FILES_TO_PROCESS," "SUCCESS"). The workflow can then use an If/else condition task to check this value and decide which downstream task to run next.
  9. Mixing Job Clusters and All-Purpose Clusters: Using an expensive, always-on All-Purpose cluster for a scheduled batch job.

    • Why it Occurs: Convenience. The developer was already using that cluster for interactive work.
    • How to Fix: Always use Job Clusters for automated workloads. They are created just for the life of the job and are significantly cheaper. All-Purpose clusters are for interactive analysis and development.
  10. Scheduler Time Zone Confusion: A job scheduled for midnight runs at the wrong time.

    • Why it Occurs: The scheduler's time zone setting doesn't match the developer's expectation.
    • How to Fix: Be explicit about the time zone in the job schedule definition. Use a standard like UTC to avoid ambiguity.

Category 7: Data Validation and Quality Control Gaps

The new pipeline runs, but is the data correct? Without rigorous validation, you've just built a fast way to produce wrong answers.

  1. "Eyeballing" Data Instead of Automating Validation: Assuming the data is correct because the job didn't fail.

    • Why it Occurs: Manual checks seem faster for a one-off migration.
    • How to Fix: Write automated data validation tests. This is non-negotiable. Compare row counts, sum of key numeric fields, min/max dates, and NULL counts between the DataStage target and the Databricks target.
  2. Not Having a Data Reconciliation Strategy: Failing to prove bit-for-bit that the output of the new pipeline matches the old one.

    • Why it Occurs: Believing that if the logic is translated correctly, the data must be correct.
    • How to Fix: Create a separate validation process. Extract data from both the old and new target tables into a neutral format (like Parquet). Load them into two DataFrames in Databricks and perform a exceptAll or a full outer join to find the exact rows and columns that differ.
  3. Ignoring Data Quality Rules: Migrating the ETL but not the implicit or explicit data quality checks within the DataStage job.

    • Why it Occurs: A narrow focus on just moving the data.
    • How to Fix: Implement data quality rules directly in your pipeline using filter or CASE statements to flag/reject bad data. For more robust needs, use Delta Lake table constraints or integrate a library like Great Expectations.
  4. No Plan for Handling Bad Records: A single malformed record in a billion-row file causes the entire job to fail.

    • Why it Occurs: Brittle parsing logic.
    • How to Fix: Use permissive parsing modes. For example, when reading CSV or JSON, use the PERMISSIVE mode. This will put malformed records into a separate "corrupt records" column, allowing the rest of the job to succeed. You can then inspect and handle the bad records separately.
  5. Forgetting Edge Case Testing: The migration works for the happy path but fails on the last day of the month, year-end, or when a source file is empty.

    • Why it Occurs: Testing only with "typical" data.
    • How to Fix: Be deliberate about creating test data that covers edge cases: empty files, files with only a header, data with special characters, leap years, etc.

Category 8: Security and Governance Oversights

Migrating to the cloud means adopting a new security model. Neglecting this can lead to data breaches and compliance failures.

  1. Using Overly-Permissive IAM Roles: Granting S3:* or Storage Blob Data Contributor to a cluster that only needs to read from one bucket.

    • Why it Occurs: It's the fastest way to resolve a permissions error.
    • How to Fix: Follow the principle of least privilege. Create fine-grained IAM roles or Service Principal permissions that grant only the necessary access (e.g., s3:GetObject on my-bucket/raw/*).
  2. No Column- or Row-Level Security: Giving users access to an entire table when they should only see certain columns or rows.

    • Why it Occurs: The legacy security model was at the database or table level.
    • How to Fix: Use Databricks Unity Catalog to define column-level masks and row-level filters using standard SQL. This provides fine-grained, policy-based access control.
  3. Exposing the Databricks Workspace to the Public Internet: Not deploying the workspace within a secure virtual network.

    • Why it Occurs: A default, quick-start deployment.
    • How to Fix: Deploy your Databricks workspace with VNet injection (Azure) or into a customer-managed VPC (AWS). This places the entire workspace, including the control plane, within your private network, with no public IPs.
  4. Lack of an Auditing Trail: Not being able to answer "who accessed this table and when?"

    • Why it Occurs: Assuming logging is enabled and sufficient by default.
    • How to Fix: Ensure audit logs are enabled and configured to be delivered to a secure, long-term storage location. Unity Catalog provides detailed auditing of data access out-of-the-box.
  5. Forgetting to Manage Personal Access Tokens (PATs): Users creating long-lived, powerful tokens and storing them insecurely.

    • Why it Occurs: PATs are a convenient way to authenticate for APIs.
    • How to Fix: Use service principal or managed identity authentication for automated processes. Limit the use of PATs for users, enforce lifetime limits, and monitor their usage.

Category 9: Cost Management and TCO Blunders

The cloud's pay-as-you-go model is a double-edged sword. Without proper management, costs can quickly spiral out of control.

  1. Runaway Cluster Costs: Developers leave large, all-purpose clusters running 24/7.

    • Why it Occurs: Forgetting to terminate clusters after interactive sessions.
    • How to Fix: Enforce mandatory auto-termination on all all-purpose clusters. Use cluster policies to limit the maximum size and uptime.
  2. Using On-Demand Instances for Everything: Not taking advantage of cheaper spot/preemptible instances.

    • Why it Occurs: On-demand is the default and perceived as more reliable.
    • How to Fix: For stateless, fault-tolerant batch workloads, use spot instances. Databricks can be configured to use a mix of on-demand and spot instances, automatically falling back to on-demand if spot capacity is unavailable. This can cut compute costs by 70% or more.
  3. Ignoring Storage Costs: Creating dozens of copies of large tables in different storage accounts.

    • Why it Occurs: Debugging, ad-hoc analysis, and lack of a central data catalog.
    • How to Fix: Use Delta Lake's Zero-Copy Clone feature. It allows you to create a new version of a table without duplicating the underlying data files, saving significant storage costs. Implement lifecycle policies on your cloud storage to automatically delete or archive old, unused data.
  4. No Cost Monitoring or Chargeback: The monthly cloud bill is a surprise, and you have no idea which team or project is responsible.

    • Why it Occurs: Lack of tagging and monitoring.
    • How to Fix: Enforce a tagging strategy. Tag clusters and jobs with a project name, team, and cost center. Use the Databricks usage reports or cloud provider cost management tools to analyze costs and implement chargeback.
  5. Over-provisioning IOPS on Cloud Storage: Paying for premium, high-performance disks when they aren't needed.

    • Why it Occurs: A "more is better" mindset.
    • How to Fix: Most Spark workloads are network or CPU-bound, not disk I/O-bound, especially when reading from object storage. Start with standard storage tiers and only upgrade if you identify a clear I/O bottleneck.
  6. Inefficient API Calls: Making millions of LIST or GET requests to cloud storage, incurring significant API charges.

    • Why it Occurs: A "small files" problem or an inefficient file discovery process.
    • How to Fix: Use Delta Lake and Autoloader. Both are designed to minimize API calls to the underlying storage, reducing costs and improving performance.
  7. Not Utilizing Reserved Instances or Savings Plans: Paying on-demand prices for steady-state, predictable workloads.

    • Why it Occurs: Lack of long-term capacity planning.
    • How to Fix: Analyze your baseline, 24/7 compute usage. Purchase reserved instances (AWS) or Savings Plans (AWS/Azure) for that baseline to get a significant discount over on-demand pricing.
  8. Forgetting about Network Egress Costs: Moving large amounts of data between cloud regions or out to the internet.

    • Why it Occurs: Not considering data locality in the architecture.
    • How to Fix: Keep your compute (Databricks) and storage in the same cloud region. Be mindful of any jobs that pull large datasets from other regions or send data to external on-premises systems.
  9. Using the Wrong Databricks Pricing Tier: Using the Premium tier when the Standard tier would suffice for a given workspace.

    • Why it Occurs: Not aligning features with needs.
    • How to Fix: Understand the differences. If you don't need role-based access control, audit logs, or credential passthrough in a specific dev/sandpit workspace, the Standard tier may be more cost-effective. Use the Premium/Enterprise tier for production and governance-heavy environments.
  10. Not Cleaning Up Failed Job Runs: Failed jobs can leave behind orphaned clusters or resources.

    • Why it Occurs: Incomplete cleanup scripts or workflow logic.
    • How to Fix: Ensure your orchestration and deployment scripts are robust enough to terminate all associated resources, even in the event of a failure.
  11. Ignoring Model Serving Costs: Deploying an ML model on a GPU-enabled, always-on endpoint that gets little traffic.

    • Why it Occurs: A focus on model performance, not cost-efficiency.
    • How to Fix: For models with intermittent traffic, use serverless model serving that can scale to zero. This avoids paying for idle compute.
  12. Underestimating the Cost of "SELECT *": A user runs a SELECT * on a petabyte-scale table in an interactive notebook, scanning terabytes of data.

    • Why it Occurs: Lack of user education and guardrails.
    • How to Fix: Educate users on cost-implications. Encourage the use of LIMIT clauses and filtering. Use Unity Catalog to see query history and identify costly queries.
  13. Data Duplication Across Bronze/Silver/Gold Layers: Physically copying all data at each stage of the Medallion architecture.

    • Why it Occurs: A logical separation is implemented as a physical separation.
    • How to Fix: Use Delta Lake. You can often perform transformations in-memory and write directly to the next layer. For large, mostly-static tables, consider using views or Zero-Copy Clones to create silver/gold versions without duplicating petabytes of storage.
  14. Not Deleting Old Delta Log Files: The _delta_log directory grows indefinitely, increasing storage costs and slowing down queries.

    • Why it Occurs: Not running routine maintenance.
    • How to Fix: Run the VACUUM command on your Delta tables. This removes old, unreferenced data files and log entries that are older than the retention period (default 7 days).
  15. Ignoring the Human Cost: Burning out your best engineers on a frustrating, poorly planned manual migration.

    • Why it Occurs: Viewing migration as a brute-force coding exercise instead of a strategic initiative.
    • How to Fix: Invest in the right tools (automation, discovery), training, and partnerships. A well-planned, tool-assisted migration with clear goals keeps your team engaged, productive, and focused on delivering value, which is the most important ROI of all.

Conclusion: From a List of Errors to a Blueprint for Success

This extensive list might seem daunting, but it's not meant to discourage. It's a roadmap—a collection of signposts pointing to where the path gets rough. Every error listed is an opportunity for learning and improvement.

A successful DataStage to Databricks migration is not about a 1:1 translation. It’s about re-imagining your data pipelines to take full advantage of the power and elasticity of the cloud and Apache Spark. It requires a shift in mindset, a commitment to new best practices, and a healthy respect for the differences between the old world and the new.

By anticipating these challenges, you can move from a reactive, problem-solving mode to a proactive, strategic one. You can build pipelines in Databricks that are not only faster and more scalable but also more robust, maintainable, and cost-effective than their predecessors.

Migration Best Practices: A Summary

If you take nothing else away, remember these key principles for a successful migration:

  • Plan, Pilot, and Phase: Don't boil the ocean. Start with a representative pilot project to establish patterns and build momentum.
  • Think in Spark, Not DataStage: Embrace declarative, DataFrame-based transformations. Avoid UDFs and row-by-row processing.
  • Automate Where Possible, Manually Optimize Where Necessary: Use code conversion tools to handle the bulk of the work, but dedicate your expert engineers to redesigning complex logic and tuning for performance.
  • Embrace the Lakehouse: Use Delta Lake from day one. Its features—ACID transactions, schema evolution, time travel, and optimization commands (OPTIMIZE, ZORDER, VACUUM)—will solve dozens of the problems on this list.
  • Govern from the Start: Plan for security and governance with Unity Catalog. Don't treat it as an afterthought.
  • Test, Reconcile, and Validate: Trust but verify. Automated data validation is the only way to guarantee a successful migration.
  • CI/CD is Non-Negotiable: Treat your ETL code like any other production software. Use Git, automated testing, and deployment pipelines.
  • Monitor Everything: Keep a close eye on performance with the Spark UI and on costs with cloud and Databricks monitoring tools. Tag everything.

The journey from DataStage to Databricks is challenging, but the destination—a modern, scalable, and unified platform for all your data and AI needs—is well worth the effort. Good luck, and happy migrating

Talk to Expert