101 Common Mistakes to Avoid During DataStage to Databricks Migration

Published on: January 05, 2026 10:52 AM

101 Common Mistakes to Avoid During DataStage to Databricks Migration

I’ve been in the data trenches for over two decades. I spent the first half of my career building, debugging, and scaling massive DataStage environments. I knew its quirks, its power, and its limitations like the back of my hand. For the last ten years, I've been leading the charge to migrate these same behemoth systems to Databricks.

I’ve seen the PowerPoints. I’ve heard the promises of "one-click conversion" and "seamless lift-and-shift." And I’ve been the one getting the 2 a.m. call when those promises shatter against the reality of a production failure.

This isn't a theoretical exercise. This is a collection of 101 scars, lessons learned from production outages, budget overruns, and frustrated teams. If you’re a developer, an architect, a program manager, or a CXO about to embark on this journey, my hope is that this list saves you from some of the pain we went through. Read this, print it, and tape it to the wall of your war room.


Section 1: CXO & Leadership Missteps

The biggest fires often start with small sparks of misunderstanding in the boardroom. Get this level wrong, and the project is handicapped from day one.

1. The "Lift-and-Shift" Fantasy
* The Mistake: Believing you can simply move DataStage jobs to Databricks like moving files between folders.
* Why It Happens: It’s a seductive, simple narrative for non-technical stakeholders. It implies speed and low cost.
* Impact: Massive underestimation of effort, leading to blown budgets and timelines. The resulting code is inefficient, unmaintainable, and fails to leverage any of Databricks' strengths.
* Avoidance: Frame the project as "Modernization and Re-platforming," not "Lift-and-Shift." Emphasize that the goal is to reimagine pipelines using cloud-native patterns, not just emulate an on-prem tool in the cloud.

2. Treating It as a Purely IT Project
* The Mistake: Running the migration without deep, continuous involvement from the business data owners.
* Why It Happens: It's seen as a "backend" technology swap. "The business reports will look the same, right?"
* Impact: Missed requirements, incorrect logic translation, and a total loss of business trust when a "validated" pipeline produces wrong numbers for a critical KPI.
* Avoidance: Embed business analysts and data stewards in the migration pods. Mandate their sign-off on data validation results for every single pipeline.

3. Falling for the "100% Automated Conversion" Pitch
* The Mistake: Buying into a tool or service that promises to automatically convert all DataStage jobs to production-ready Databricks code.
* Why It Happens: The desire for a silver bullet is powerful. Vendors know this and exploit it.
* Impact: You get syntactically correct but semantically flawed, inefficient, and often unreadable "Spark-flavored COBOL." The cost of refactoring this generated code often exceeds the cost of a thoughtful, manual rewrite.
* Avoidance: Use conversion tools as discovery accelerators, not as code generators. Their value is in parsing complex jobs to create a "bill of materials" for the migration, not in writing your production code. Budget for manual, expert-led development.

4. Underestimating the "People" Problem
* The Mistake: Assuming your expert DataStage developers will magically become expert Databricks/Spark engineers in a two-week training course.
* Why It Happens: A desire to keep the existing team intact without appreciating the fundamental paradigm shift from GUI-based ETL to code-first software engineering.
* Impact: Low morale, slow development velocity, poor quality code, and attrition of key personnel who feel left behind or overwhelmed.
* Avoidance: Create a realistic upskilling plan that includes structured learning, pair programming with experienced Spark developers, rigorous code reviews, and time for experimentation. Acknowledge that not everyone will make the leap.

5. No Clear Definition of "Done"
* The Mistake: Starting the migration without a crystal-clear, business-approved definition of success for each migrated pipeline.
* Why It Happens: Teams rush into the "doing" without defining the "what."
* Impact: Endless arguments about whether a job is "good enough." The project becomes a zombie, never truly finishing as the goalposts constantly shift.
* Avoidance: For each pipeline, "Done" must mean: code is in Git, CI/CD is working, data is fully reconciled with the source job, performance meets SLA, costs are within budget, security and governance are applied, and the business has signed off.

6. Focusing on Cost Reduction as the Only Driver
* The Mistake: Justifying the migration solely on the promise of decommissioning DataStage and saving on licensing.
* Why It Happens: It’s the easiest business case to sell.
* Impact: Missed opportunities. You fail to leverage Databricks for new capabilities like AI/ML, real-time analytics, and data sharing that could generate immense business value far exceeding the infrastructure savings.
* Avoidance: Build a business case based on value creation and agility first, with cost savings as a secondary benefit. Highlight what new things the business can do with a modern data platform.


Section 2: Discovery & Assessment Mistakes

You can't migrate what you don't understand. A shallow discovery phase is the most common predictor of project failure.

7. Just Counting Jobs
* The Mistake: Estimating effort based on the number of DataStage jobs. "We have 2,000 jobs, so at X hours per job..."
* Why It Happens: It's a simple, but dangerously misleading metric.
* Impact: A "simple" job with a 50-stage Transformer is a hundred times more complex than a 2-stage "copy" job. This leads to completely inaccurate planning and resource allocation.
* Avoidance: Classify jobs by complexity patterns: simple pass-through, medium-complexity lookups/aggregates, high-complexity transformers/routines, and "monster" jobs that need a full rewrite. Use stage count, type of stages, and custom code as complexity drivers.

8. Ignoring Shared Containers and Routines
* The Mistake: Focusing only on the *.dsx job files and missing the shared components they rely on.
* Why It Happens: These are often stored in different projects or libraries and are not immediately visible when looking at a single job.
* Impact: You migrate a job, and it fails because a critical piece of shared business logic is missing. This causes constant rework and delays.
* Avoidance: Perform a full dependency analysis before migration begins. Identify all shared containers, routines (BASIC, C++), and function libraries. Plan to migrate these as shared, modular Python/Scala libraries in Databricks.

9. Forgetting About Parameter Sets
* The Mistake: Analyzing the default parameters of a job without considering the dozens of Parameter Sets it might be invoked with.
* Why It Happens: Parameter Sets are runtime configurations, not design-time artifacts.
* Impact: You build a pipeline that only works for one environment or one data source. It breaks in production when a different parameter set is used to point it at a different database or file path.
* Avoidance: Extract all Parameter Sets and analyze their usage in the job sequences. This defines the required configuration variables for your Databricks workflow.

10. Not Analyzing the Orchestration Layer (Sequences)
* The Mistake: Migrating individual jobs without understanding how they are chained together in DataStage Sequences.
* Why It Happens: It's easier to look at one job at a time. Analyzing complex sequences with loops, conditions, and error handling is hard.
* Impact: You lose the master orchestration logic. Jobs run in the wrong order, dependencies are missed, and the end-to-end business process fails.
* Avoidance: Treat the DataStage Sequence as the primary unit of migration, not the individual job. Map the entire sequence to a single Databricks Workflow, preserving the dependencies, conditional paths, and error handling.

11. Missing External Dependencies
* The Mistake: Only looking at what's inside DataStage and ignoring the ecosystem around it.
* Why It Happens: Teams are focused on the ETL tool, not the full system.
* Impact: A job fails because a pre-processing shell script wasn't run, an expected trigger file never arrived, or a downstream application couldn't find its output file.
* Avoidance: Analyze before/after-job subroutines, ExecSH stages, and the scripts that invoke dsjob commands. Catalog every external script, FTP/SFTP dependency, trigger file, and external scheduler (like Control-M or Autosys).

12. Relying on Outdated Documentation
* The Mistake: Trusting the Confluence page from 2008 that claims to describe how a job works.
* Why It Happens: Wishful thinking and pressure to move fast.
* Impact: The documentation is wrong. The code is the only truth. You waste weeks building to an incorrect spec.
* Avoidance: Declare a "documentation amnesty." The code and the sequence are the source of truth. Use discovery tools to parse the actual code, not what someone wrote about it a decade ago.

13. Ignoring Data Characteristics
* The Mistake: Assessing job complexity without assessing the data it processes.
* Why It Happens: Access to production data can be difficult. It's easier to just look at the code.
* Impact: A simple "sort and aggregate" job becomes a performance nightmare in Spark because the source data has extreme skew, or the data volume is 1000x larger than in the dev environment.
* Avoidance: Profile the production data. Get metrics on row counts, data volumes, key cardinality, and data skew for your most critical and largest jobs. This information is more important for performance tuning than the code itself.


Section 3: DataStage Job & Stage Misinterpretation

This is where the junior developers burn the project budget. DataStage has decades of specific behaviors that are easy to misinterpret.

14. The Transformer Is Not Just a "SELECT"
* The Mistake: Looking at a Transformer stage and assuming it's a simple set of column derivations that can be mapped 1-to-1 to a SELECT statement in Spark SQL.
* Why It Happens: A surface-level glance makes it look that way.
* Impact: You miss stage variables (which act as procedural code), complex IF-THEN-ELSE nesting, constraints, and loop-over-group logic. The resulting SQL is incorrect and produces the wrong data.
* Avoidance: Treat every non-trivial Transformer as a mini-application. Deconstruct its components: Stage Variables, Constraints, Derivations, and Loop Variables. Map them methodically, often requiring a mix of Spark SQL and procedural DataFrame operations.

15. Misunderstanding Stage Variables
* The Mistake: Ignoring the order of execution and stateful nature of stage variables within a Transformer.
* Why It Happens: It’s a procedural concept in a declarative-looking UI.
* Impact: Calculations are performed in the wrong order, leading to subtly incorrect results that are incredibly difficult to debug.
* Avoidance: Document the exact order of stage variable calculations. In Spark, this often translates to a series of withColumn transformations, where the order is critical. For more complex stateful logic (e.g., comparing to a previous row), you must use Window functions (lag, lead).

16. Incorrectly Handling Nulls
* The Mistake: Assuming DataStage's null handling and Spark's null handling are identical.
* Why It Happens: null seems like a simple concept. It's not.
* Impact: Joins that worked in DataStage (where null might equal null in some contexts) suddenly drop records in Spark (where null != null). Function calls fail, and aggregations produce different results.
* Avoidance: Explicitly test and define the behavior for nulls. Use null-safe operators in Spark (<=>), and functions like coalesce, isnull, isnotnull to explicitly manage nulls instead of relying on default behavior. This is a huge "gotcha."

17. Ignoring "Reject" Links
* The Mistake: Only migrating the main data flow and ignoring the reject links coming out of stages like Lookups, Joins, or Database connectors.
* Why It Happens: They are often seen as "error handling" and deprioritized.
* Impact: In DataStage, reject links are often part of the business logic (e.g., "unmatched records go here for separate processing"). By ignoring them, you lose a critical part of the output, causing data loss and silent failures.
* Avoidance: Treat every reject link as a required output stream. In Spark, this means implementing an anti-join or filtering for records that failed a lookup/join condition and writing them to a separate Delta table or path.

18. Mishandling Before/After-Job Subroutines
* The Mistake: Seeing these as simple logging or cleanup tasks and ignoring the business logic they might contain.
* Why It Happens: They are text-based scripts, not visual stages.
* Impact: A job runs successfully, but the entire process fails because a critical downstream trigger wasn't created, or a control table wasn't updated by the "after-job" routine.
* Avoidance: Manually review every single before/after subroutine. They often contain critical orchestration logic, control table updates, or environment setup that must be replicated in your Databricks Workflow or a dedicated task.

19. Forgetting DataStage's Implicit Type Casting
* The Mistake: Assuming data types will behave the same way.
* Why It Happens: DataStage is very lenient with type casting (e.g., treating ' 123 ' as a number). Spark is much stricter.
* Impact: Jobs fail with CastException errors in Spark that never occurred in DataStage.
* Avoidance: Be explicit. Add explicit casting and trimming (trim(), cast('integer')) in your Spark code. Do not rely on implicit conversion. This is a common issue when reading from flat files.

20. Misinterpreting the Aggregator Stage
* The Mistake: Doing a simple GROUP BY in Spark SQL to replicate an Aggregator stage.
* Why It Happens: It seems like a direct equivalent.
* Impact: You miss the subtle but critical "Method" property of the Aggregator, which could be "Hash" or "Sort." More importantly, you might miss complex aggregation logic embedded in BASIC routines.
* Avoidance: Analyze the Aggregator's properties. A "Sort" method implies pre-sorted data, which might be an optimization to carry over. For complex aggregations, you may need to implement a UDAF (User-Defined Aggregate Function) in Spark, though this should be a last resort.


Section 4: Incorrect Mapping of DataStage Stages to Databricks Patterns

This is the architectural core of the migration. Bad patterns created here will haunt you for years with poor performance, high costs, and technical debt.

21. The "One-Job-to-One-Notebook" Fallacy
* The Mistake: Insisting that every DataStage job must become a single Databricks notebook or script.
* Why It Happens: It seems logical and easy to track.
* Impact: You create monstrous, unmaintainable notebooks that replicate the "spaghetti" nature of complex DataStage jobs. You lose the opportunity to consolidate, simplify, and modularize logic.
* Avoidance: Think in terms of data transformations, not jobs. A series of 5 DataStage jobs that perform staging might become a single, elegant DLT (Delta Live Tables) pipeline. A complex job might be broken into several smaller, dependent tasks in a Databricks Workflow for clarity and easier reruns.

22. Replicating Stage-by-Stage Execution
* The Mistake: Writing Spark code that materializes the result of every single DataStage stage to storage before reading it back in for the next step. df1.write.parquet(...), df2 = spark.read.parquet(...).
* Why It Happens: It's an attempt to emulate DataStage's visual flow and provides debug points.
* Impact: Catastrophic performance degradation and high storage I/O costs. This completely defeats the purpose of Spark's in-memory processing and lazy evaluation.
* Avoidance: Chain DataFrame transformations together. Let Spark's Catalyst optimizer figure out the most efficient physical execution plan. Only write data to storage when you need to create a persistent layer (e.g., bronze/silver tables) or as a strategic checkpoint for very long, complex jobs.

23. Overusing PySpark UDFs
* The Mistake: When faced with any complex column-level logic from a Transformer, the default reflex is to write a Python UDF (User-Defined Function).
* Why It Happens: It feels like a direct translation of a DataStage function or routine. It's easy for Python developers to write.
* Impact: Massive performance penalty. Standard UDFs break Spark's ability to optimize. The data has to be serialized from the JVM to a Python interpreter and back for every single row, destroying performance.
* Avoidance: A UDF is your last resort. 99% of what you need can be accomplished with built-in Spark SQL functions or by combining them. For more complex scenarios, investigate higher-order functions (transform, filter), Pandas UDFs, or refactoring the logic to operate on the DataFrame as a whole.

24. Not Using the Medallion Architecture
* The Mistake: Migrating jobs and just dumping the final output into a target directory, replicating the old point-to-point ETL chaos.
* Why It Happens: Pressure to "just get the job working" without thinking about the platform's structure.
* Impact: You build a data swamp, not a data lakehouse. There's no single source of truth, no data lineage, and no reusability.
* Avoidance: Embrace the Medallion Architecture from Day 1. Every pipeline should read from a source layer (Bronze, raw), apply cleansing and business rules to create an integrated layer (Silver), and then aggregate for consumption (Gold). This structure is non-negotiable for a successful Databricks implementation.

25. Ignoring Delta Live Tables (DLT)
* The Mistake: Manually building all the boilerplate for data quality checks, dependency management, and incremental processing that DLT provides for free.
* Why It Happens: Teams are used to the old way of building everything from scratch. DLT can feel like "magic" they don't control.
* Impact: You spend 80% of your time on plumbing (orchestration, error handling, schema evolution) and 20% on business logic. DLT flips that ratio.
* Avoidance: For all new and migrated pipelines, evaluate DLT first. Its declarative nature, built-in data quality expectations, and automated orchestration are exactly what's needed to move faster and build more reliable pipelines. It's a direct answer to the brittleness of complex DataStage sequences.

26. Rebuilding Shared Containers as Monolithic Functions
* The Mistake: Taking a giant DataStage shared container and converting it into one massive Python function with hundreds of lines of code.
* Why It Happens: It's a direct, literal translation.
* Impact: The code is untestable, unreadable, and impossible to debug or reuse in different contexts.
* Avoidance: Decompose shared containers into a library of smaller, pure, testable functions. Package these as a Python wheel and attach it to your clusters. This promotes software engineering best practices and dramatically improves maintainability.

27. Choosing the Wrong Table Format
* The Mistake: Defaulting to writing data as Parquet or CSV files instead of using Delta Lake.
* Why It Happens: Old habits from the Hadoop world or a misunderstanding of what Delta provides.
* Impact: No ACID transactions, no time travel, no MERGE (upsert) capability, poor performance due to small files (OPTIMIZE), and no schema evolution. You've built a data lake, not a lakehouse.
* Avoidance: Use Delta Lake for everything. There is no good reason to use raw Parquet for any managed table in Databricks in 2024. Period.

28. Not Designing for Idempotency
* The Mistake: Building jobs that, if run twice with the same input, produce duplicate data or incorrect results.
* Why It Happens: DataStage jobs were often run in tightly controlled batches where reruns were a manual, painful process. In the cloud, automated retries are common.
* Impact: A simple network blip causes a workflow to retry, leading to duplicated or corrupted data in your gold tables that poisons business reports.
* Avoidance: Design every job to be idempotent. Use MERGE INTO statements for your writes. Use overwriteSchema and partition overwrites. Structure your jobs so you can run them 100 times and the final state of the target table will be correct.


Section 5: Parallelism & Performance Assumptions

This is where you either save millions or burn through your cloud budget with nothing to show for it. Spark's parallelism isn't magic; it's a science.

29. Assuming Databricks Auto-Magically Handles Everything
* The Mistake: Believing the default Spark configuration is optimal for every workload.
* Why It Happens: Marketing materials often highlight the "auto-pilot" features.
* Impact: Out-of-memory errors, terrible performance, and sky-high costs due to inefficient resource usage.
* Avoidance: Use the Spark UI to understand your execution plans. Learn what a shuffle is, why it's bad, and how to avoid it. Understand concepts like partitioning, caching, and broadcast joins.

30. Ignoring Data Skew
* The Mistake: Performing a join or group-by on a key with a non-uniform distribution (e.g., a customer_id where '-1' represents "guest" and accounts for 90% of the data).
* Why It Happens: It works fine on small test data.
* Impact: 99 tasks in a Spark stage finish in seconds, while one task runs for hours, holding up the entire job. This is the single most common cause of "unexplained" performance issues.
* Avoidance: Profile your join/grouping keys. If skew is present, use techniques like salting (adding a random key to distribute the skewed value) or breaking the job into two parts (one for the skewed key, one for the rest) and unioning the results.

31. Lifting and Shifting DataStage Partitioning Schemes
* The Mistake: Replicating the exact Hash/Round-Robin partitioning from a DataStage job in Spark.
* Why It Happens: A literal translation of the old design.
* Impact: DataStage partitioning is designed for a fixed number of on-prem nodes. Spark's partitioning is dynamic. You create unnecessary shuffles and prevent the Spark optimizer from doing its job.
* Avoidance: Let Spark handle most of the partitioning. The only time you should manually repartition is if you know something the optimizer doesn't (e.g., to align data for a series of subsequent joins on the same key). Use partitionBy on your Delta tables for efficient data pruning on reads.

32. Using .collect() in Production Code
* The Mistake: Calling .collect() on a large DataFrame to bring all the data from the distributed cluster nodes back to the driver node.
* Why It Happens: It's an easy way to see the data during development or to pass it to a Python library that doesn't understand Spark DataFrames.
* Impact: OutOfMemoryError on the driver. This is the cardinal sin of Spark programming. It breaks the entire distributed computing model.
* Avoidance: Never, ever use .collect() unless you are 100% certain the data fits comfortably in the driver's memory (e.g., a small lookup table). If you need to work with the data, either use a distributed method (Pandas UDF) or process it in its distributed form.

33. Choosing the Wrong Cluster Type
* The Mistake: Using All-Purpose clusters for scheduled production workflows.
* Why It Happens: It's what developers use for interactive development, so it's familiar.
* Impact: Dramatically higher costs. All-Purpose clusters are billed as long as they are on, and at a higher DBU rate.
* Avoidance: Use Job clusters for all automated workflows. They spin up for the job, run it, and terminate, meaning you only pay for what you use. Use cluster policies to enforce this.

34. Not Using Photon
* The Mistake: Running standard Spark and not enabling the Photon execution engine.
* Why It Happens: It's a checkbox that people forget to tick or don't understand.
* Impact: You are leaving a massive amount of free performance (and therefore cost savings) on the table. Photon is a native C++ vectorized engine that dramatically speeds up most Spark SQL and DataFrame operations.
* Avoidance: Enable Photon by default on all your clusters. It's rare that a workload runs slower on Photon, and the upside is huge.

35. Over-provisioning Clusters "Just in Case"
* The Mistake: Configuring a job cluster with 50 nodes because "it's a big job."
* Why It Happens: Fear of failure and a lack of understanding of how to right-size a cluster.
* Impact: Wasted cloud spend. You might be paying for 50 nodes when the job could run just as fast (or even faster, due to less shuffle overhead) on 10.
* Avoidance: Start small. Run the job, observe the Spark UI and Ganglia metrics. Is CPU maxed out? Are you spilling to disk? Is the network saturated? Scale up methodically based on evidence, not fear. Use Databricks' auto-scaling feature, but set reasonable min/max boundaries.

36. Ignoring Shuffle Partitions
* The Mistake: Leaving spark.sql.shuffle.partitions at its default value of 200 for every job.
* Why It Happens: It's a deep-level configuration that most people don't touch.
* Impact: For small jobs, you create 200 tiny partitions, adding unnecessary scheduling overhead. For massive jobs, 200 partitions may be too few, causing each partition to be too large and spill to disk, leading to OOM errors.
* Avoidance: Adjust spark.sql.shuffle.partitions based on your data size. A good rule of thumb is to aim for partition sizes of around 128-200MB. For a 1TB shuffle, you'd want 1024*1024 / 200 = ~5000 partitions, not 200. This can be set per job.

37. Not Caching Intelligently
* The Mistake: Either never using .cache() or using it on every single DataFrame.
* Why It Happens: A misunderstanding of when Spark's lazy evaluation triggers re-computation.
* Impact: No caching means a DataFrame that is used multiple times (e.g., for a join and then an aggregation) is re-computed from scratch each time, wasting resources. Over-caching fills up memory with data that's only used once, starving more important operations.
* Avoidance: Use .cache() or .persist() strategically. The prime candidate is a DataFrame that is used in more than one subsequent action. Look at the DAG in the Spark UI; if you see a stage being the parent of multiple child stages, its result is a good candidate for caching.


Section 6: SQL, Stored Procedure, and Complex Transformation Pitfalls

Many DataStage jobs are wrappers around massive SQL scripts or stored procedures. Migrating this logic is a minefield.

38. Blindly Translating Vendor-Specific SQL
* The Mistake: Copy-pasting Oracle PL/SQL or Teradata BTEQ into a Spark SQL query and expecting it to work.
* Why It Happens: The path of least resistance.
* Impact: Syntax errors, performance nightmares. Vendor-specific functions (DECODE, NVL2), query hints, and procedural extensions don't exist in ANSI SQL-compliant Spark.
* Avoidance: Treat stored procedure migration as a re-engineering effort. Translate the intent of the procedure, not the literal code. Use modern Spark patterns like CTEs (Common Table Expressions) and Window functions to replace archaic procedural logic. Use a reference guide to map proprietary functions to Spark equivalents (e.g., DECODE -> CASE WHEN).

39. Mishandling Procedural Logic (Loops, Cursors)
* The Mistake: Trying to replicate a FOR loop or a CURSOR from a stored procedure by iterating over a collected DataFrame in Python.
* Why It Happens: A direct translation of a procedural mindset.
* Impact: You destroy all parallelism. The logic runs on a single driver node, is incredibly slow, and will fail with OOM errors on any significant data volume.
* Avoidance: Think in sets, not rows. Almost every cursor or loop can be rewritten as a set-based operation in Spark. A loop that updates rows based on a condition is a MERGE or UPDATE statement. A cursor that aggregates data is a GROUP BY.

40. Using String Concatenation to Build SQL Queries
* The Mistake: Building Spark SQL queries like this: spark.sql("SELECT * FROM table WHERE city = '" + user_variable + "'").
* Why It Happens: It's a common, but terrible, pattern seen in many languages.
* Impact: Massive SQL injection security vulnerability. It also makes the code unreadable and hard to maintain.
* Avoidance: Use parameterized queries. In PySpark, use f-strings with caution and proper sanitization or, better yet, use the DataFrame API which is immune to SQL injection. df.filter(df.city == user_variable).

41. Ignoring Transaction Control
* The Mistake: Assuming Spark has the same multi-statement transaction control (BEGIN TRANSACTION, COMMIT, ROLLBACK) as a traditional database.
* Why It Happens: A misunderstanding of the distributed, file-based nature of a Lakehouse.
* Impact: A multi-stage update process fails halfway through, leaving the target table in a corrupt, inconsistent state. There is no simple ROLLBACK.
* Avoidance: Design atomic operations. Delta Lake provides ACID transactions at the level of a single statement (MERGE, UPDATE, DELETE, INSERT). A "transaction" that needs to update three tables must be broken into three separate, idempotent steps, with robust error handling and recovery logic in the orchestration layer.

42. Not Understanding MERGE INTO Performance
* The Mistake: Using MERGE to update a massive Delta table with a tiny number of changes.
* Why It Happens: MERGE is the correct logical tool for upserts.
* Impact: The job is incredibly slow and expensive. A MERGE on a non-partitioned table might require rewriting the entire table, even for a single-row update.
* Avoidance: For large tables, always use MERGE with partition pruning. Ensure the ON condition of your merge allows Spark to prune the files it needs to scan. For high-volume updates, consider patterns like Change Data Capture (CDC) with DLT's APPLY CHANGES INTO API, which is highly optimized for this workload.

43. Replicating Database-Specific "Magic" Functions
* The Mistake: Trying to find a direct Spark equivalent for every obscure, vendor-specific SQL function.
* Why It Happens: Developers want a 1-to-1 mapping.
* Impact: Wasted time searching for a function that doesn't exist. The alternative is often writing a complex and slow UDF.
* Avoidance: Step back and understand what the function does. Often, its logic can be replicated by combining several standard Spark functions. This leads to more portable and often more performant code.

44. Ignoring Floating Point Precision Issues
* The Mistake: Using standard Float or Double types for financial calculations.
* Why It Happens: It's the default for non-integer numbers.
* Impact: Tiny rounding errors accumulate in large aggregations, leading to reconciliation failures that are off by a few cents. This completely erodes business trust.
* Avoidance: Use Decimal(precision, scale) type for any monetary or other high-precision values. Match the precision and scale exactly to the source system. This is non-negotiable.


Section 7: Orchestration Mistakes (Sequences vs. Workflows)

You can have perfect jobs, but if you chain them together incorrectly, the whole system collapses.

45. Recreating Sequence Loops as Giant, Fragile Workflows
* The Mistake: Translating a DataStage sequence loop (e.g., "process files for each day of the month") into 30 copied-and-pasted tasks in a Databricks Workflow.
* Why It Happens: A lack of knowledge about dynamic workflows.
* Impact: An unmanageable, brittle workflow that is impossible to update. Adding a new parameter means editing 30 tasks.
* Avoidance: Use programmatic orchestration. Either use the for loop construct in a Databricks Notebook Workflow or use an external orchestrator like Airflow to generate tasks dynamically based on parameters.

46. Ignoring Conditional Logic and Error Handling Paths
* The Mistake: Migrating the "happy path" of a sequence and ignoring the conditional branches (IF...THEN) and exception handlers.
* Why It Happens: It's the 80% of the work, and teams want to show progress.
* Impact: The first time a condition is met or a job fails, the entire workflow grinds to a halt with no recovery or alternative path, unlike the robust DataStage sequence.
* Avoidance: Use the "If/else condition" task in Databricks Workflows to replicate conditional logic. Use the Run if settings (e.g., All failed, At least one failed) to build robust error-handling paths, such as sending a notification or running a cleanup job.

47. Not Using Task Values to Pass State
* The Mistake: Having one task write a small value to a file in DBFS, only to have the next task read that file to get the value.
* Why It Happens: It mimics how old scripting environments worked.
* Impact: It's slow, clumsy, and adds unnecessary I/O and potential failure points.
* Avoidance: Use dbutils.jobs.taskValues.set() and dbutils.jobs.taskValues.get() to pass small amounts of state (like row counts, status flags, or calculated dates) between tasks within a single workflow run.

48. Hardcoding Environment Details in Notebooks
* The Mistake: Putting database hostnames, file paths, or credentials directly into notebook code.
* Why It Happens: It's fast during development.
* Impact: The code is not portable between Dev, UAT, and Prod. It's a security risk. Promoting code requires manual editing and is error-prone.
* Avoidance: Externalize all configuration. Use Widgets for parameters, Databricks Secrets for credentials, and environment-specific configuration files that are read at runtime. Use a "bootstrap" notebook to set configurations for the entire workflow.

49. Failing to Design for Retries
* The Mistake: Assuming jobs will always succeed on the first try.
* Why It Happens: Optimism.
* Impact: Transient issues (e.g., a temporary network blip to a source system) cause an entire multi-hour workflow to fail completely, requiring manual intervention.
* Avoidance: Configure retry policies on your workflow tasks. For non-idempotent tasks, ensure the retry policy is 0. For idempotent, transient-prone tasks (like reading from an external API), a policy of 2-3 retries with a delay can dramatically improve reliability.

50. Mixing Business Logic and Orchestration Logic
* The Mistake: Having a single notebook that contains both complex business transformations and the logic to call other notebooks.
* Why It Happens: It's easy to add dbutils.notebook.run() calls anywhere.
* Impact: The code is untestable and not reusable. You can't test the business logic without triggering the orchestration, and you can't reuse the orchestration pattern with different logic.
* Avoidance: Keep them separate. Business logic belongs in notebooks/libraries designed to be called with parameters. Orchestration logic belongs in a master "conductor" notebook or, better yet, in the Databricks Workflows UI itself.


Section 8: Data Quality, Reconciliation, and Validation Failures

"The job ran successfully" is not the same as "The data is correct." This is where you lose the trust of the business.

51. "We'll Validate the Data Later"
* The Mistake: The single most deadly phrase in a migration project. Pushing data validation to the end of the project.
* Why It Happens: Teams are under pressure to migrate code and show progress in terms of "jobs moved."
* Impact: You get to the end of a year-long project, and none of the numbers match. You have no idea where the error was introduced. The entire project is at risk.
* Avoidance: Validate as you go. For every single pipeline migrated, perform a full data reconciliation against the legacy job's output before moving to the next one. This is non-negotiable.

52. Only Doing Row Counts
* The Mistake: Believing that if source_rows == target_rows, the data is correct.
* Why It Happens: It's the easiest check to perform.
* Impact: This is a vanity metric. It won't catch a single incorrect calculation, a bad join, a null mishandling, or a precision error. It provides a false sense of security.
* Avoidance: Implement a proper reconciliation framework. At a minimum, perform SUMs and AVGs on all numeric columns and MIN/MAX on all key/date columns. For critical fields, perform a full MINUS query (EXCEPT ALL in Spark SQL) to find the exact mismatched rows.

53. Not Building a Reusable Reconciliation Framework
* The Mistake: Writing custom, one-off validation queries for every single job.
* Why It Happens: It seems faster in the short term.
* Impact: Inconsistent validation, massive amounts of boilerplate code, and no auditable, centralized record of validation results.
* Avoidance: Build a generic, configuration-driven reconciliation utility. It should take a source query/table, a target query/table, key columns, and columns to aggregate as parameters, and then execute a standard suite of checks (counts, sums, mins, maxes, standard deviations) and output a clean report.

54. Ignoring Data Type Precision
* The Mistake: Migrating a Decimal(38, 10) from a database into a Spark Double type.
* Why It Happens: Laziness or ignorance of the consequences.
* Impact: Reconciliation on this column will always fail due to floating-point arithmetic. The business will see tiny discrepancies and lose all faith in the new platform.
* Avoidance: Be a stickler for types. Schema must match exactly. Use DecimalType(p, s) in Spark to match the source system's precision and scale.

55. Forgetting About Character Encoding
* The Mistake: Reading a file created on a Windows machine (e.g., latin-1) with the default UTF-8 reader in Spark.
* Why It Happens: It's an invisible property of the data.
* Impact: Special characters (like é or ü) become garbled (? or ©), causing data corruption and join failures.
* Avoidance: Identify the source encoding for every single flat file. Explicitly set it in your Spark read options: spark.read.option("encoding", "ISO-8859-1").csv(...).

56. Not Using a Data Quality Tool
* The Mistake: Manually writing hundreds of IF statements or filters to check for data quality rules.
* Why It Happens: Not wanting to learn a new tool.
* Impact: Brittle, hard-to-manage code. There's no central repository of data quality rules or history of their execution.
* Avoidance: Use a framework like Great Expectations or the built-in EXPECTATIONS feature of Delta Live Tables. This allows you to declaratively define your quality rules (e.g., expect "user_id" to not be null, expect "order_total" to be between 0 and 100000). You can choose to fail the job, alert, or quarantine bad records.

57. Trusting That "If the Job Runs, the Data is Correct"
* The Mistake: The developer's mantra when they are tired of debugging.
* Why It Happens: Exhaustion and pressure to close tickets.
* Impact: Silent data corruption. This is how you end up reporting negative sales to the CEO.
* Avoidance: Foster a culture of skepticism. The job isn't done when the light turns green. The job is done when the reconciliation report is clean and signed off by the data steward.


Section 9: Security, Governance, and Compliance Oversights

This is the stuff that doesn't just cause a production issue, but gets your company on the front page of the news for the wrong reasons.

58. Procrastinating on Unity Catalog
* The Mistake: Starting your migration on a legacy workspace without Unity Catalog enabled, planning to "add it later."
* Why It Happens: It seems faster to get started without it. The setup requires some initial thought.
* Impact: You end up with a security mess of instance profiles, table ACLs, and no central governance. The cost and effort to retrofit UC onto a running system are 10x that of starting with it. You lose out on lineage, data sharing, and fine-grained access control.
* Avoidance: Start with Unity Catalog. It is the foundation of modern Databricks security and governance. Do not start a new project without it.

59. Using All-Powerful Service Principals
* The Mistake: Creating a single Service Principal, giving it workspace admin rights and storage account admin rights, and using it to run every single job.
* Why It Happens: It's easy. It makes all access problems "go away."
* Impact: A catastrophic security hole. If that principal's secret is compromised, the attacker owns your entire data platform. It also makes auditing impossible.
* Avoidance: Apply the principle of least privilege. Create granular service principals for specific functional areas. Grant them access only to the specific catalogs, schemas, and tables they need to do their job via Unity Catalog.

60. Storing Secrets in Notebooks or Git
* The Mistake: Hardcoding passwords, API keys, or other secrets directly in notebook code or configuration files checked into Git.
* Why It Happens: Carelessness during development. "I'll fix it later."
* Impact: A massive, compliance-violating security breach waiting to happen. Anyone with read access to the notebook or repo now has your production credentials.
* Avoidance: Use Databricks Secrets, backed by Azure Key Vault or AWS Secrets Manager. Teach your developers how to use dbutils.secrets.get(scope="...", key="...") from day one. Run pre-commit hooks to scan for secrets before they ever get into your codebase.

61. Ignoring Network Security
* The Mistake: Running everything over the public internet.
* Why It Happens: It's the default and easiest setup.
* Impact: Data is exfiltrated or intercepted. You fail to meet compliance requirements like HIPAA or PCI-DSS.
* Avoidance: Deploy your Databricks workspace in your own VNet/VPC (VNet Injection). Use Private Endpoints to connect to your data sources (like ADLS Gen2, S3, or databases) over the cloud provider's backbone network, never touching the public internet.

62. Failing to Map Old Security Roles
* The Mistake: Migrating the data but not migrating the access control model that governed it.
* Why It Happens: Security is often an afterthought.
* Impact: Users either have no access to the data they need, or worse, they suddenly have access to sensitive PII or financial data they should never see.
* Avoidance: Before decommissioning the old system, catalog the existing user roles and their permissions. Methodically recreate this permission model using Unity Catalog GRANT statements on catalogs, schemas, and tables.

63. No PII/Sensitive Data Handling Strategy
* The Mistake: Treating PII data the same as any other data during the migration.
* Why It Happens: It requires extra effort to identify and handle it.
* Impact: You could be in breach of GDPR, CCPA, and other regulations. Developers in a lower environment might see production PII data.
* Avoidance: Use data discovery tools to scan for and tag PII columns. Implement a strategy for handling them, such as data masking (for lower environments), tokenization, or column-level encryption. Use Unity Catalog's column-level access control to restrict who can see these tagged columns.

64. Forgetting About Audit Logs
* The Mistake: Not having a plan to capture and analyze who is accessing what data and when.
* Why It Happens: It's not a feature that directly contributes to making a pipeline run.
* Impact: When a data issue occurs, you have no way to trace who or what caused it. You cannot satisfy auditor requests.
* Avoidance: Enable and configure diagnostic logging for your Databricks workspace. Ship these audit logs to a central, searchable location like Azure Log Analytics or Splunk. Set up alerts for suspicious activity, such as a user trying to access a table they don't have permission for.

65. Not Using Cluster Policies
* The Mistake: Letting developers define and configure their own clusters freely.
* Why It Happens: It seems to give developers flexibility.
* Impact: Cost overruns from oversized clusters, security issues from bad configurations (e.g., no secrets passthrough), and a "wild west" environment.
* Avoidance: Implement Cluster Policies. These act as guardrails, limiting the instance types, sizes, and configurations developers can choose from. You can create policies like "Small T-Shirt Size," "Large T-Shirt Size," and "PII-Compliant Cluster" to enforce standards and control costs.


Section 10: FinOps & Cost Management Mistakes

The cloud is a powerful tool, but it's also a blank check if you're not careful. I've seen migration "successes" get shut down because they cost 10x the system they replaced.

66. Thinking Databricks is "Cheaper" By Default
* The Mistake: Assuming a move to the cloud will automatically be cheaper than on-prem DataStage licenses and hardware.
* Why It Happens: It's a key part of the sales pitch for cloud in general.
* Impact: Sticker shock. A poorly implemented Databricks solution, with inefficient code and oversized clusters running 24/7, can be vastly more expensive than a depreciated on-prem system.
* Avoidance: Cost is an engineering metric, just like performance. You have to design for it. This means right-sizing clusters, using job clusters, writing efficient code, and actively managing storage.

67. No Showback/Chargeback Model
* The Mistake: Having all Databricks costs roll up into one giant, opaque bill for the entire organization.
* Why It Happens: It's the easiest way to set up billing.
* Impact: No one feels accountable for the costs they generate. The "Data Science" team can run a massive GPU cluster for a week with no visibility into the impact.
* Avoidance: Tag everything. Use cluster tags, job tags, and workspace tags to associate costs with specific projects, teams, or cost centers. Use the system tables or build dashboards to show each team their consumption. Nothing drives cost optimization like accountability.

68. Letting Developers Pick Any Instance Type
* The Mistake: Giving developers a free-for-all on the entire catalog of available VM instance types.
* Why It Happens: Simplicity of setup.
* Impact: Developers pick the newest, biggest, most expensive memory-optimized or GPU-enabled instances for simple CSV parsing jobs, leading to outrageous costs.
* Avoidance: Use Cluster Policies to restrict the available instance types to a curated, cost-effective list. For most ETL, general-purpose or storage-optimized instances are fine. Reserve the expensive ones for workloads that can actually use them.

69. Ignoring Storage Costs
* The Mistake: Focusing only on DBU (compute) costs and ignoring the underlying cloud storage costs (ADLS, S3).
* Why It Happens: Compute cost is more visible in the Databricks UI.
* Impact: Costs creep up. Delta Lake's time travel is great, but it keeps old versions of data, which consumes storage. Unmanaged intermediate data and failed job outputs accumulate.
* Avoidance: Set a VACUUM policy on your production tables to remove old, unneeded data files. Clean up temporary and intermediate data paths. Implement storage lifecycle policies (e.g., move old logs to archive-tier storage).

70. Not Setting Up Budget Alerts
* The Mistake: Waiting for the end-of-month bill to find out you've had a massive cost overrun.
* Why It Happens: It's an extra setup step that's easy to skip.
* Impact: A runaway job or a developer mistake can cost tens of thousands of dollars in a single weekend. By the time you find out, it's too late.
* Avoidance: Use your cloud provider's budgeting tools (Azure Cost Management, AWS Budgets). Set budgets at the subscription, resource group, or tag level. Configure alerts to notify key people when you hit 50%, 75%, and 90% of your monthly budget.

71. Failing to Use Spot Instances
* The Mistake: Using on-demand instances for all workloads, including development, testing, and non-critical production jobs.
* Why It Happens: Fear of preemption. On-demand instances feel "safer."
* Impact: You're leaving massive savings on the table. Spot instances can be 70-90% cheaper than on-demand.
* Avoidance: Use spot instances with a fallback to on-demand for most workloads. Databricks job clusters handle this gracefully. For your absolute tier-1, can't-be-late production jobs, on-demand is fine. For everything else, the cost savings from spot are too significant to ignore.

72. Manually Optimizing Code and Missing FinOps Opportunities
* The Mistake: Relying solely on developers to manually review Spark UIs, rewrite code for optimization, validate outputs, and manage costs, which is a slow, error-prone, and expensive process.
* Why It Happens: Teams underestimate the continuous effort required for performance tuning and cost governance in a usage-based platform like Databricks. The skills are rare and expensive.
* Impact: A huge amount of developer time is spent on low-level tuning instead of delivering business value. Inconsistent optimization leads to unpredictable performance and costs. Manual data validation is often skipped, leading to data quality issues. This results in an overall TCO that is much higher than anticipated, eroding the business case for the migration.
* Avoidance: Leverage specialized tooling to automate the code conversion, validation, optimization, and FinOps cycle. For example, a tool like Travinto can analyze both the legacy DataStage jobs and the new Spark code to identify anti-patterns, suggest specific optimizations (like replacing UDFs or improving join strategies), automate the data reconciliation process to ensure correctness, and provide continuous FinOps dashboards to track costs against performance. This automates the tedious work, saving an estimated 30% of total migration and operational costs, and frees up your expert engineers to focus on building new capabilities rather than just tuning old ones.

73. Not Rightsizing Storage Tiers
* The Mistake: Keeping all data, including old logs and archived raw files, on hot, premium storage.
* Why It Happens: It's the default and requires no management.
* Impact: You pay a premium to store data that is accessed once a year, if ever.
* Avoidance: Use cloud storage lifecycle policies. Automatically move data from Hot/Standard tiers to Cool/Infrequent-Access tiers after 30-60 days, and then to Archive tiers after 180 days. This can reduce storage costs by over 80%.

74. Leaving Interactive Clusters Running Overnight
* The Mistake: A developer finishes their work at 5 PM, leaves their All-Purpose cluster running, and goes home for the weekend.
* Why It Happens: Human error, pure and simple.
* Impact: A totally idle cluster racks up thousands of dollars in charges for no reason. This is one of the biggest and most easily preventable sources of waste.
* Avoidance: Set a non-negotiable auto-termination timeout on all All-Purpose (interactive) clusters. 60-120 minutes is a reasonable starting point. This is a mandatory setting, enforced by a Cluster Policy.


Section 11: Testing, Cutover, and Post-Go-Live Support Mistakes

The final mile is the hardest and most visible. A failure here can erase all the goodwill you've built.

75. No Parallel Run Period
* The Mistake: Decommissioning the DataStage job the moment the Databricks workflow is deployed to production.
* Why It Happens: Pressure to declare victory and start saving on license costs.
* Impact: The first time a business-critical edge case occurs that wasn't in your test data (e.g., end-of-year processing), the new job fails or produces wrong data, and you have no fallback. You are in a crisis.
* Avoidance: Plan for a parallel run period. For critical pipelines, run both the old DataStage job and the new Databricks workflow in production for at least one full business cycle (e.g., a month for monthly reporting). Continuously reconcile the outputs. This is your ultimate safety net.

76. "Big Bang" Cutover
* The Mistake: Trying to migrate and switch over an entire, massive, interconnected system of hundreds of jobs all at once over a single weekend.
* Why It Happens: It seems conceptually simpler than a phased approach.
* Impact: It never works. The complexity is too high. The cutover fails, you have a massive rollback effort, and the project's credibility is shot.
* Avoidance: Migrate in logical, vertical slices. Pick a single business process or data subject area, migrate it end-to-end (from source to gold table), validate it, run it in parallel, and then cut it over. Then move to the next slice. This de-risks the program and allows you to deliver value incrementally.

77. Inadequate Performance Testing
* The Mistake: Testing with 10,000 rows of data and assuming the job will scale linearly to 10 billion rows.
* Why It Happens: Getting access to and storing production-scale data in test environments is hard.
* Impact: A job that runs in 5 minutes in UAT runs for 12 hours in production, missing its SLA and causing a cascade of downstream failures.
* Avoidance: You must performance test with production-like data volumes. If you can't use a full copy of production, use tools to generate realistic, scaled-up data that preserves the skew and cardinality of the real data.

78. Forgetting to Test Failure and Recovery Scenarios
* The Mistake: Only testing the "happy path" where everything works perfectly.
* Why It Happens: It's more pleasant than thinking about failure.
* Impact: The first time a source file arrives late or a database is down, the workflow enters a state no one has ever seen or planned for, requiring frantic manual intervention.
* Avoidance: Practice "chaos engineering." Deliberately delete a source file, shut down a database, submit a malformed record. Test your error handling, your retry logic, and your alerting. A robust system is one that fails gracefully and recoverably.

79. The "Throw it Over the Wall" Handoff
* The Mistake: Having a dedicated "migration factory" team that builds everything and then hands it over to a separate "operations" or "support" team at go-live.
* Why It Happens: It's a common organizational model.
* Impact: The ops team has no idea how the new system works, why decisions were made, or how to debug it. The migration team, having "delivered," has moved on. The result is a prolonged period of instability and finger-pointing.
* Avoidance: Embed your future support staff into the migration team from the beginning. They should be part of the design reviews, code reviews, and testing. They are your first customers. This ensures a smooth transition and builds ownership.

80. No Rollback Plan
* The Mistake: Assuming the cutover will be flawless and not having a documented, tested plan to revert to the old system.
* Why It Happens: Overconfidence.
* Impact: When a critical failure is found post-cutover, panic ensues. There's a scramble to figure out how to restart the old jobs, repoint dependencies, and clean up bad data.
* Avoidance: Have a detailed rollback plan. This includes steps for data cleanup, restarting DataStage sequences, and redirecting downstream consumers. And most importantly: test it. A rollback plan that hasn't been tested is not a plan; it's a prayer.

81. Underestimating "Hypercare" Support
* The Mistake: Reassigning the entire development team to new projects the day after go-live.
* Why It Happens: Project-based thinking and pressure to start the next initiative.
* Impact: The support team is immediately overwhelmed by the flood of minor issues, questions, and "unexpected behaviors" that always accompany a new system.
* Avoidance: Plan for a "hypercare" period of 2-4 weeks post-go-live. During this time, the core development team remains on-call and dedicated to stabilizing the new system, fixing bugs, and optimizing performance.

82. Not Testing Downstream Consumer Dependencies
* The Mistake: Validating that the final Gold table is correct, but not checking if the Tableau dashboard, Power BI report, or downstream application that reads from it still works.
* Why It Happens: The migration team's scope ends at the table.
* Impact: You declare success, but the Head of Sales calls the CIO because their daily sales dashboard is broken. You may have subtly changed a column name (e.g., userName to user_name), a data type, or the timestamp format.
* Avoidance: Identify and test all downstream consumers as part of your test plan. Engage with the owners of those systems and have them run their processes against the new tables in a UAT environment before you cut over.

83. Go/No-Go Meeting is a Rubber Stamp
* The Mistake: Holding the final Go/No-Go meeting as a formality where everyone is expected to say "Go."
* Why It Happens: Political pressure to meet a deadline. No one wants to be the one to stop the train.
* Impact: The project goes live with known critical defects or unacceptable risks, leading to an almost guaranteed production incident.
* Avoidance: Create a formal, data-driven Go/No-Go checklist. This must include items like: "All P1 test cases passed," "Data reconciliation signed off by business for all critical pipelines," "Performance test results within SLA," "Rollback plan tested." The decision should be based on the checklist, not on feelings.


Technology is only half the battle. The people and process challenges are often harder.

84. Creating a "Databricks Team" Silo
* The Mistake: Creating a new, elite "Databricks Team" and separating them from the existing "legacy" ETL teams.
* Why It Happens: It seems like a way to focus new skills and move fast.
* Impact: It creates a toxic "us vs. them" culture. The legacy teams feel devalued and become a source of resistance. Knowledge transfer is poor.
* Avoidance: Create integrated, cross-functional teams. Pair your experienced DataStage developers (who have immense business logic knowledge) with experienced Spark developers. The goal is to upskill everyone, not to create a new priesthood.

85. Ignoring the Cultural Shift to a Software Engineering Mindset
* The Mistake: Thinking you can build enterprise-grade Databricks solutions using the same ad-hoc development practices that were common in the GUI-based ETL world.
* Why It Happens: The path of least resistance.
* Impact: You end up with a mess of untitled notebooks, no version control, no testing, and no deployment automation. The system is brittle and unmaintainable.
* Avoidance: This is a software engineering project. Mandate the use of Git for all code. Implement CI/CD pipelines for testing and deployment (e.g., using Databricks Asset Bundles or dbx). Enforce mandatory, rigorous code reviews.

86. Lack of a Platform Engineering Team
* The Mistake: Expecting every single data engineering team to figure out their own CI/CD, security, monitoring, and library management.
* Why It Happens: In a rush to deliver, no one builds the foundation.
* Impact: Massive duplication of effort. Inconsistent, and often incorrect, implementations of core infrastructure. Each team reinvents the wheel, poorly.
* Avoidance: Invest in a small, dedicated Platform Engineering team. Their job is not to build pipelines, but to build the "paved road" for other teams. They provide standardized templates, CI/CD pipelines, security modules, and best practices that accelerate all other teams.

87. Rewarding "Hero Developers"
* The Mistake: Praising the developer who stays up all night to write a 3,000-line, incredibly complex notebook that solves a hard problem.
* Why It Happens: They got the job done.
* Impact: You encourage the creation of unmaintainable, "clever" code that only the hero understands. When they leave, the system becomes a black box of technical debt.
* Avoidance: Reward simplicity, clarity, and collaboration. Praise the developer who writes clean, modular, well-tested code that everyone on the team can understand. Celebrate great code reviews and good documentation.

88. No CI/CD or Testing Strategy
* The Mistake: Allowing developers to write code in the workspace and promote it to production by exporting and importing notebooks manually.
* Why It Happens: It's the "notebook" way of thinking.
* Impact: Error-prone deployments, no automated testing, no audit trail of what was deployed and when. This is not acceptable for an enterprise system.
* Avoidance: Implement a Git-based workflow with CI/CD. A pull request should trigger automated unit tests (on pure functions) and integration tests (running the notebook with test data). A merge to the main branch should trigger an automated deployment to your production environment.

89. Resistance to Code Reviews
* The Mistake: Treating code reviews as an optional, time-consuming chore.
* Why It Happens: Developers who are new to a code-first world may feel their work is being personally criticized.
* Impact: Poor quality code, anti-patterns, and bugs make it into production. Knowledge remains siloed with the original author.
* Avoidance: Make code reviews a mandatory, blameless part of the process. Frame them as a tool for knowledge sharing and collective code ownership. The goal is to make the code better, not to criticize the person.

90. No Standard Project Structure
* The Mistake: Letting each developer organize their notebooks, libraries, and tests in whatever way they see fit.
* Why It Happens: Lack of standards.
* Impact: It's impossible for a new person to navigate a project. CI/CD automation is difficult because it has to account for dozens of different layouts.
* Avoidance: Define and enforce a standard project template. For example: a src directory for pipeline code, a tests directory for tests, a conf directory for configurations, and a notebooks directory for experimentation.

91. Treating Notebooks as Production Scripts
* The MisContinuing take: Using .ipynb notebooks, with their mix of code, output, and markdown, as the final production artifact.
* Why It Happens: It's the primary development interface.
* Impact: Notebooks are JSON files containing both code and output, which makes code reviews and diffing in Git very difficult. They encourage a non-modular style.
* Avoidance: Use notebooks for exploration and development. For production, refactor the core logic into Python (.py) files that can be properly tested and version controlled. Use orchestration tools to call these Python scripts or use tools that can sync notebooks as pure Python files to Git. Databricks Asset Bundles encourage this best practice.


Section 13: Final Words - More Mistakes to Fill the Gaps

I've thrown a lot at you. Let's round out the list with some final, rapid-fire lessons from the field.

92. Not Migrating Lineage: You lose all visibility into how data flows. Plan to use Unity Catalog for lineage from day one.

93. Misunderstanding the Shared Responsibility Model: Assuming Databricks or the cloud provider will handle all your security. You are responsible for what's in your VNet and your code.

94. "Forgetting" the Decommissioning Step: The migration is "done," but DataStage is never actually turned off "just in case." The cost savings are never realized. Be brave and pull the plug.

95. Not Setting Up a Center of Excellence (CoE): No central group to define best practices, provide guidance, and evaluate new features. Leads to chaos.

96. Ignoring the Learning Curve of Scala: If you choose Scala, you must budget for a significantly steeper learning curve for most developers compared to Python/SQL.

97. Bad Library Management: Using %pip install in notebooks leads to non-reproducible environments. Use cluster libraries or %conda environments defined in a file.

98. Not Using Databricks Repos (Git Integration): Developers downloading/uploading notebooks manually. A recipe for disaster.

99. Ignoring Time Zone Differences: DataStage server time vs. Spark cluster time vs. source data timestamps. A classic cause of off-by-one-day errors. Be explicit with time zones everywhere.

100. Believing the Job is Ever "Done": A data platform is a living product. The migration is just the beginning. You need a long-term roadmap for optimization, new features, and maintenance.

101. Reading a List Like This and Doing Nothing About It: The biggest mistake of all is to see the warnings, acknowledge them, and then let institutional inertia and project pressure push you down the same old path.


If you've made it this far, you're already ahead of 90% of the teams who attempt this journey. This migration is not just a technical challenge; it's an organizational, cultural, and financial one. It's hard. But by avoiding these common pitfalls, you can navigate the complexities and build a data platform that is not just a replacement for DataStage, but a true engine for business innovation for the next decade. Good luck. You'll need it.