Case Study: Migrating 1,000+ DataStage Jobs to Databricks

Published on: December 11, 2025 03:47 PM

Case Study: Migrating 1,000+ DataStage Jobs to Databricks

By: A Principal Data Engineer & Migration Lead

1. Introduction

I'm writing this as someone who has spent the last decade in the trenches of large-scale ETL migrations. Before the cloud-native world, I spent 15 years building and managing enterprise data warehouses on the IBM stack. This case study details a multi-year program I led at a large financial services institution to migrate over 1,000 critical IBM DataStage jobs to the Databricks Lakehouse Platform.

The Legacy Environment:
Our data ecosystem was a classic of its time: a powerful, on-premises IBM Netezza appliance served as the core data warehouse, fed by an extensive network of IBM DataStage (v9.1) jobs. This environment was the backbone of our regulatory reporting, risk management, and business intelligence for over a decade. It was stable and well-understood, but it was also a fortress—expensive, inflexible, and rapidly approaching its technical and financial breaking point.

The "Why":
The migration wasn't driven by a desire to chase a shiny new toy. It was a strategic imperative fueled by several factors:
* End-of-Life Hardware: Our Netezza appliances were nearing their end-of-support date, forcing a multi-million dollar hardware refresh decision.
* Crippling Costs: The annual maintenance and licensing costs for the IBM stack were staggering. A consumption-based cloud model presented a compelling financial alternative.
* Lack of Agility: Onboarding new data sources or modifying existing pipelines was a months-long process. We couldn't support the business's growing demand for advanced analytics, machine learning, and faster data access.
* Talent Scarcity: Finding and retaining skilled DataStage developers was becoming increasingly difficult and expensive.

The Scope:
This was not a small undertaking. The program scope included:
* ETL Jobs: ~1,200 DataStage Parallel and Sequencer jobs.
* Data Volume: ~5 PB of raw and processed data in Netezza.
* Applications: 100+ downstream applications relying on the data warehouse, from Tableau dashboards to critical regulatory filing systems.
* Complexity: Jobs ranged from simple file-to-table loads to monstrous, 200-stage behemoths with embedded C++ routines and complex business logic.


2. Migration Planning

Success in a migration of this scale is determined long before the first line of code is converted. Our planning phase lasted nearly six months and was the single most critical factor in our success.

Discovery & Inventory:
You can't migrate what you don't know you have. Our first step was a deep, forensic analysis of the DataStage environment. We couldn't rely on documentation, which was outdated or non-existent.

  • Action: We wrote a suite of scripts to parse the DataStage metadata repository (IMAM) and the underlying .dsx export files. This allowed us to programmatically extract every job, its stages, parameters, source/target tables, and dependencies.
  • Output: A comprehensive inventory in a relational database, which became our single source of truth for the migration.

Complexity Analysis & Risk Assessment:
Not all jobs are created equal. We developed a quantitative complexity model to score each job. This prevented us from making subjective prioritization decisions.

Complexity Criteria Count (Approx.)
Low < 20 stages, simple lookups, no custom code, standard connectors. ~400
Medium 20-70 stages, complex joins/aggregations, some custom SQL. ~550
High > 70 stages, use of Transformer stage variables, pivot stages, multiple data sources. ~200
Very High Use of custom C++/BASIC routines, non-standard connectors (e.g., MQ), complex Sequencer logic. ~50

Prioritization and Phasing Strategy:
We adopted a "wave" approach, grouping jobs by a combination of business value and technical complexity.

  • Wave 1 (The "Quick Wins"): We targeted ~100 Low-to-Medium complexity jobs that fed high-visibility, but non-critical, BI dashboards. The goal was to build momentum, test our target architecture, and prove the value of the program to stakeholders.
  • Wave 2 (The "Factory"): This was the bulk of the migration (~800 jobs). We focused on entire subject areas (e.g., all marketing data marts) to deliver cohesive business value. This phase was all about scaling our process.
  • Wave 3 (The "Monsters"): We saved the 50 "Very High" complexity jobs and other high-risk pipelines for last. By this point, the team was deeply experienced, our frameworks were mature, and we could dedicate senior talent to these challenging migrations.

Tool Selection Rationale:
We evaluated several automated conversion tools, including LeapLogic and BladeBridge, alongside a "build-our-own" approach.

  • Decision: We chose an automated conversion tool (LeapLogic in our case) as an accelerator, not a "magic bullet."
  • Rationale: The tool's ability to parse DSX files, understand complex Transformer stage logic, and generate initial PySpark code was invaluable. It wasn't perfect, but it reduced the manual "translation" effort by an estimated 60-70%. I cannot stress this enough: there is no such thing as 100% automated conversion. The value is in creating a consistent, readable, and analyzable PySpark starting point. It did the heavy lifting of code parsing, allowing our engineers to focus on validation, optimization, and handling the complex edge cases.

3. Architecture & Target Design

We didn't just want to "lift and shift" our old architecture to the cloud. We wanted to build a future-proof platform.

  • Target Data Architecture: We implemented the Databricks Medallion Architecture on Delta Lake.
    • Bronze: Raw, immutable data ingested from source systems (often 1:1 with the source). This replaced our persistent staging areas.
    • Silver: Filtered, cleaned, and conformed data. This is where we applied the core business rules and data quality checks previously found in our DataStage integration layers. SCD Type 2 history was handled here using Delta Lake's MERGE capability.
    • Gold: Aggregated, business-level tables for reporting and analytics. These were the direct replacements for our Netezza data marts.
  • Orchestration Strategy: DataStage Sequencer jobs were migrated to Databricks Workflows. We found that the Jobs API provided robust dependency management, parameterization, and error handling. For complex, cross-application dependencies, we used Apache Airflow to orchestrate multiple Databricks Workflows.
  • Metadata, Governance, and CI/CD:
    • CI/CD: We mandated a Git-based workflow using Databricks Repos integrated with Azure DevOps. All code (Notebooks, Python libraries) went through pull requests, automated linting, and unit testing before being deployed to our dev, test, and prod Databricks workspaces via Terraform and custom deployment scripts. This was a massive cultural shift from the GUI-based, manual deployment model of DataStage.
    • Governance: We leveraged Unity Catalog (which became generally available midway through our project) for fine-grained access control, data lineage, and discovery. This was a game-changer compared to the siloed permissions model we had before.

4. Execution Strategy

This is where the rubber met the road. We established a "migration factory"—a repeatable process for converting, validating, and deploying jobs at scale.

Automated vs. Manual Migration:
Our approach was pragmatic: automate the tedious, refactor the complex.
* The conversion tool generated the initial PySpark code. This was our baseline.
* A "Refactoring & Optimization" step was mandatory for every job. A developer reviewed the generated code, applied our internal framework standards, optimized Spark-inefficient patterns (e.g., row-by-row processing), and documented the logic.
* For the "Very High" complexity jobs with embedded C++ logic, we performed a full manual rewrite. We treated it as a greenfield development effort, focusing on the business logic rather than trying to replicate the legacy code line-by-line.

Handling Complex Transformations:
* DataStage Transformer Stage: The heart of most jobs. The tool converted this logic into a series of PySpark withColumn and selectExpr transformations. The biggest challenge was untangling nested stage variables, which often required manual refactoring into intermediate DataFrames for clarity.
* Lookups: DataStage lookups were converted to Spark broadcast joins for small tables and standard joins for larger ones. We had to pay close attention to handling NULL key mismatches and lookup failures, which DataStage often handles implicitly.
* SCDs: Slowly Changing Dimensions (Type 1, 2, 3) were a common pattern. We replaced the complex lookup/insert/update logic in DataStage with the elegant and highly efficient MERGE INTO command in Delta Lake. This was one of the biggest single simplification wins.

Parallel Run Approach for QA:
This was our non-negotiable quality gate. For every single job, we ran the legacy DataStage job and the new Databricks job in parallel against the exact same source data for a given business day.
* Source Data Freezing: We took snapshots of all source tables/files before either job ran.
* Execution: Both pipelines ran independently.
* Reconciliation: The outputs (target tables in Netezza and Databricks) were then compared using our custom reconciliation framework. The job was not promoted until the reconciliation passed with zero discrepancies.


5. Tools & Frameworks Used

You can't manage a migration this size with manual effort alone. We invested heavily in building and integrating tooling.

  • Discovery: Custom Python scripts parsing DataStage IMAM XML exports.
  • Conversion: LeapLogic. It provided the best balance of readable code generation and metadata analysis. It wasn't perfect, but it was a massive force multiplier.
  • Orchestration: Databricks Workflows, supplemented by Airflow for enterprise-level scheduling.
  • CI/CD: Azure DevOps, Git, Terraform.
  • Validation: Our custom Python Reconciliation Framework. This was our most valuable internal tool. It connected to both Netezza (via jaydebeapi) and Databricks (via Spark Connector), and performed three levels of validation:
    1. Metadata Check: Compared column names, data types, and nullability.
    2. Aggregate Check: Ran COUNT(*), SUM(), AVG() on all numeric columns, and COUNT(DISTINCT) on key columns. This gave us a quick pass/fail.
    3. Row-Level Checksum: If aggregates passed, it generated a checksum (MD5 hash) of concatenated, consistently-formatted column values for each row in both tables. It then performed a MINUS operation on the checksums to identify specific mismatched rows. This was computationally expensive but essential for rooting out subtle data type or transformation logic bugs.
  • Monitoring: Databricks Jobs UI, Spark UI, and custom dashboards in Datadog built on top of Databricks metrics.

The reconciliation framework was the hero of this project. It turned a subjective, error-prone manual validation process into an objective, automated, and auditable one.


6. Challenges & Roadblocks

It wasn't all smooth sailing. We hit numerous technical and organizational hurdles.

  • The Devil in the Details: Subtle differences caused countless headaches:
    • Data Types: DataStage's Decimal type vs. Spark's DecimalType(precision, scale) required meticulous validation. We found cases where silent truncation was occurring.
    • NULL Handling: Spark's handling of NULL in joins and functions differs from DataStage's, which often treats NULLs in specific ways depending on the stage. This required explicit .isNull() checks in our Spark code.
    • Character Encoding: We spent weeks debugging issues that turned out to be UTF-8 vs. latin-1 mismatches between source files and how Spark was reading them.
  • Debugging in a Distributed World: DataStage developers were accustomed to the visual debugger, where they could see data flowing through a job row-by-row. Debugging a failing Spark job on a 100-node cluster is a different beast. It required a paradigm shift: learning to read Spark driver/executor logs, understand the Spark UI, and use interactive notebooks for isolating problematic data and logic. The learning curve was steep.
  • Organizational & Scheduling Inertia:
    • The "Parallel Run Window": Finding a time window to freeze sources and run two multi-hour pipelines was a logistical nightmare. It required intense coordination with source system owners, business users, and operations teams.
    • "But it works in DataStage": We constantly battled the tendency for developers to try and replicate DataStage's anti-patterns in Spark (e.g., using UDFs for everything) instead of adopting a set-based, declarative DataFrame approach. This required continuous training and rigorous code reviews.

7. Validation & Quality Assurance

Our mantra was "Trust, but verify with code."

  • Data Reconciliation Strategy: The three-tiered (Metadata, Aggregate, Row-level Checksum) automated framework was our primary strategy. We ran it on 100% of the output data for most tables. For exceptionally large tables (>10 billion rows), we ran the checksum on a statistically significant, randomly selected sample.
  • Testing Approach:
    1. Unit Testing: We enforced Pytest for our custom Python libraries and functions.
    2. Integration Testing: Executing a single converted job in the dev environment with sample data.
    3. Parallel Run Testing: The full-scale comparison in our QA environment, as described above. This was our formal UAT.
  • Exception Handling: Pipelines were designed for failure. We wrote detailed exception logs to a dedicated Delta table. Failed pipelines triggered alerts in PagerDuty and did not commit any partial data to the final gold tables, ensuring atomicity.

8. Performance & Cost Optimization

Once we proved correctness, we focused on efficiency.

  • Cluster Sizing: We moved from a one-size-fits-all cluster model to job-specific clusters defined in our Databricks Workflow configurations. We started with larger clusters to ensure stability during initial parallel runs and then used the Spark UI and monitoring data to right-size them, often reducing core counts by 50-75% for steady-state production runs.
  • Workload Tuning:
    • Partitioning & Z-Ordering: We aggressively used Delta Lake partitioning on date and business keys, and Z-Ordering on commonly filtered columns. This dramatically improved query performance in the Gold layer.
    • Code Optimization: We ran a "Top 10 Most Expensive Jobs" analysis each month and assigned senior engineers to refactor them, often replacing inefficient PySpark UDFs with native Spark SQL functions, or optimizing join strategies.
  • Cost Savings: The results were immediate and significant. By decommissioning the on-prem Netezza and DataStage environments, we eliminated millions in annual licensing and maintenance. Even with Databricks DBU consumption, our TCO for data processing was reduced by approximately 45%.

9. Results & Outcomes

The program was officially completed after 2.5 years. The results exceeded expectations.

  • Jobs Migrated: 1,185 of the ~1,200 in-scope jobs were successfully migrated and running in production on Databricks. The remaining ~15 were decommissioned as they were found to be obsolete.
  • Data Accuracy: Our automated reconciliation framework consistently measured >99.999% data accuracy between the legacy and migrated pipelines before cutover.
  • SLA Achievement: Average batch processing windows were reduced by 30%. Some critical overnight jobs that previously finished at 7 AM were now completing by 4 AM.
  • Cost Savings: Realized TCO reduction of ~45% year-over-year.
  • Agility: The time to onboard a new data source and make it available for analytics dropped from an average of 2 months to under 2 weeks.

10. Lessons Learned & Recommendations

  1. Invest in a Reconciliation Framework First. Before you convert a single job, build or buy a tool to automatically and definitively prove that your target output matches your source. This tool will be your shield and your sword throughout the entire project. It builds trust with the business and removes emotion from the validation process.
  2. Automated Conversion Tools are Accelerators, Not Oracles. Set the expectation that these tools will handle 70% of the code but 30% of the effort. The remaining 30% of the code (the complex parts) will take 70% of the human effort. They are a means to an end, not the end itself.
  3. Don't Lift and Shift Anti-Patterns. The goal is modernization, not replication. Actively train your team to think in Spark/DataFrames, not in DataStage stages. Forbid direct translation of inefficient patterns. Use the migration as an opportunity to refactor and improve.
  4. Prioritize by Subject Area, Not Just Complexity. Migrating all the jobs for a single data mart or business unit (e.g., "Customer 360") delivers tangible value and allows you to decommission parts of the legacy world sooner. A purely technical prioritization can leave you with a Frankenstein's monster of a hybrid environment for too long.
  5. Your People are the Hardest Part of the Migration. The technology shift is easier than the cultural shift. DataStage developers need to become software engineers—learning Git, CI/CD, Python, and distributed computing concepts. Invest heavily in training, pair programming, and creating a psychologically safe environment for them to learn and, inevitably, make mistakes.

11. Executive Summary / CXO Takeaways

For fellow leaders evaluating a similar modernization program, here is my candid advice:

This migration was far more than a technical upgrade; it was a fundamental modernization of our company's data capabilities. By moving from a legacy IBM stack to the Databricks Lakehouse, we achieved three primary strategic objectives:

  1. Significant Cost Reduction: We eliminated millions in fixed annual licensing and support costs, moving to a flexible, consumption-based model that directly aligns with business usage. The program delivered a 45% reduction in TCO for our core data processing.
  2. Accelerated Business Agility: The program was a direct enabler for our advanced analytics and AI/ML initiatives. Business teams can now access, model, and deploy insights in weeks instead of months. We are no longer constrained by the technology of the past.
  3. Reduced Operational & Technical Risk: We successfully retired aging, unsupported hardware and software, mitigating a significant business continuity risk. Our new platform is built on modern, cloud-native principles, improving reliability, scalability, and security.

My Advice for Leadership:

  • This is a Business Program, Not an IT Project: Secure executive sponsorship from outside of IT. The biggest challenges will be organizational change, business process re-engineering, and user acceptance.
  • Fund a Dedicated, Empowered Team: Do not attempt this as a "side project" for your existing operations team. Create a dedicated "migration factory" team composed of your best legacy and modern data engineers. Protect them from daily operational fire-fighting.
  • Embrace Pragmatism over Perfection: A 100% like-for-like migration is a fool's errand. Empower your team to make pragmatic decisions to refactor, simplify, and in some cases, decommission old logic. Trust the automated validation process to ensure correctness.

The journey was challenging, but the outcome was transformative. We didn't just move our data; we changed how our entire organization uses it to compete and win.