Case Study: How One Enterprise Migrated from DataStage in 90 Days

Author: A Principal Data Engineer & Migration Lead

I’ve spent the better part of two decades neck-deep in IBM DataStage, building and managing enterprise ETL. For the last ten, I've been on the other side of the fence, leading migrations to modern data platforms like Databricks. When people hear we migrated a critical business unit for a major retail bank from DataStage to Databricks in 90 days, their first reaction is usually disbelief. Their second is, "How?"

This isn't a marketing document. This is a candid, technical breakdown of how we did it. It was a forced march, filled with trade-offs, roadblocks, and hard-won lessons. Here’s the story from the trenches.

1. Introduction: The Burning Platform

Our client was a top-10 retail bank, running its core Customer 360 and Risk Analytics on an aging IBM Information Server stack. The environment was classic legacy:

Platform: IBM DataStage 8.7 running on-prem on AIX servers.
Database: A mix of Oracle, DB2, and a massive Teradata warehouse for analytics.
Scale: Approximately 3,200 DataStage jobs processing around 5-7 TB of new data daily.

The migration wasn't a choice; it was a necessity. The drivers were a perfect storm:
1. Legacy Risk: The AIX hardware was past its end-of-life. A hardware failure was no longer a risk, but an inevitability.
2. Crippling Cost: The annual renewal for IBM and Teradata was north of $2M, with a significant price hike on the horizon.
3. Business Stagnation: The business wanted to deploy ML models for fraud detection and customer churn. DataStage and the on-prem infrastructure were simply not built for the iterative, large-scale processing that modern machine learning requires.

The 90-day timeline was imposed by the hardware EOL contract. We had no room to negotiate. The scope was defined as the entire Customer 360 and Risk Analytics portfolio—the jobs that fed the bank's most critical marketing and regulatory reporting.

2. Migration Planning & Strategy: The Blueprint for Speed

We don't survive a 90-day migration without a ruthless, front-loaded planning phase & good automated tool . We had about 5-7 days to build a bulletproof plan.

Discovery and Inventory:
We couldn't afford to manually read 3,200 jobs. We used an automated discovery and assessment tool (in this case, Travinto X2X Analyzer) to parse the DataStage project export files (.dsx). This gave us an immediate, albeit imperfect, inventory:
Job names, types (Server, Parallel), and parameters.
Stage-level lineage (e.g., Oracle Connector -> Transformer -> Teradata Connector).
Dependencies and job sequencing.
Identification of complex patterns (e.g., extensive use of Transformer stage variables, BASIC routines, C++ functions).

Complexity Analysis and Prioritization:
With the inventory, we created a simple but effective quadrant analysis:

Quadrant	Complexity	Business Impact	Tools	Our Strategy
Quick Wins	Low	High	Travinto X2XConverter	Wave 1: Automate conversion, test, and get them ready first. Builds momentum.
The Monsters	High	High	Travinto X2XConverter	Wave 2: Assign senior engineers. Acknowledge manual refactoring is needed.but Travinto tool help to achieve with 80% accuracy
Challenge	High	Low	Travinto Analyzer	Push Back: We went to the business and asked, "Do you really need this convoluted report that three people read?" We decommissioned ~50 jobs this way.
Backlog	Low	Low		De-prioritize: Park these. If we have time, we'll do them. If not, they get decommissioned or handled post-migration.

The 90-Day Phased Approach:
Our entire plan fit on a single slide:
Days 1-15: Foundation & Deep Dive. VPC/VNet setup, IAM roles, Unity Catalog setup, CI/CD pipeline skeleton (GitHub Actions), and validation of the automated discovery.
Days 15-45: Wave 1 - The Velocity Phase. Automated conversion of ~80% of jobs (the "Quick Wins"). Our junior-to-mid-level engineers focused here, validating the auto-generated PySpark code and running initial tests.
Days 45-75: Wave 2 - The Heavy Lift. Tackling "The Monsters." My senior-most engineers and I lived in this phase, manually refactoring complex those jobs Transformer logic and custom routines into clean, maintainable PySpark which was not done by Tool. This is where parallel runs began in earnest.
Days 75-90: Validation, Freeze, & Cutover. Code freeze. Intensive parallel run reconciliation. Performance tuning. Cutover planning and execution.

Tool Selection Rationale:
We evaluated a few options. Some were just "code spitters" that translated DataStage stages into raw, un-idiomatic PySpark. This was a non-starter. We needed an accelerator that understood context. We chose Travinto X2XConverter because it provided an end-to-end solution:
1. Discovery: Gave us the inventory and dependency map.
2. Conversion: Translated not just jobs but also sequences/workflows into Databricks Workflows, and dsx parameters into Databricks widgets. The code quality was surprisingly high for 70-80% of the patterns.
3. Validation: The tool provided a starting point for reconciliation by generating queries to check row counts and checksums, which we then integrated into our own QA framework.

For a 90-day project, an integrated tool that handles discovery-to-validation is a force multiplier. Doing it all manually would have been impossible.

3. Target Architecture & Design: Keep It Simple

Speed demands simplicity. We adopted a standard, no-frills Databricks architecture.

Medallion Architecture:
- Bronze: Raw data landed as-is from source systems (Oracle, DB2, files) into Delta tables. Structure was inferred or lightly defined.
- Silver: Cleansed, de-duplicated, conformed data. This is where most of our PySpark transformations ran. Data types were strictly enforced.
- Gold: Business-level aggregates, feature stores, and reporting tables. Ready for consumption by BI tools (Power BI) and data science teams.
Orchestration: We used Databricks Workflows exclusively. I made the call to avoid introducing Airflow or another external orchestrator. Why? It would add another point of failure and a learning curve we couldn't afford. Databricks Workflows were good enough for our dependency chains.
Governance: We implemented Unity Catalog from day one. It was non-negotiable. It gave us a central metastore, lineage out-of-the-box for notebooks, and a clear security model (ACLs). Trying to bolt on governance later is a recipe for disaster.
CI/CD: A simple setup using Databricks Repos and GitHub Actions. A push to the main branch would trigger a workflow to deploy the notebooks and job JSON definitions to the Databricks workspace.

4. Execution Approach: The Factory Model

We treated the migration like an assembly line.

Automated vs. Manual: The conversion tool handled about 80% of the transformation logic. The remaining 20%—the complex, messy, undocumented business logic buried in Transformer stages—took 80% of our manual effort. My most senior people focused exclusively on this 20%.
Handling Complexity: A common "monster" was a DataStage job with a Transformer stage containing dozens of stage variables that acted as a state machine, processing rows in a specific order. This imperative, row-by-row logic is a complete anti-pattern in Spark's declarative, distributed model.
- Our Solution: We had to re-architect this. We used Spark's Window functions (lag, lead, row_number) to create the necessary context across rows. In a few rare, painful cases, we had to collect() a small dataset to the driver, process it in a Python loop, and then parallelize the result. We avoided UDFs like the plague for performance reasons, except for trivial string manipulations.
Parallel Runs: This was our safety net and the core of our QA strategy. Starting on Day 45, we set up a parallel pipeline.
1. Daily source data was duplicated. One copy went to the legacy DataStage environment.
2. The second copy was ingested into our Databricks Bronze layer.
3. Both the DataStage jobs and the new Databricks Workflows ran against the same source data.
4. Our automated reconciliation framework (more below) compared the final target tables in Teradata (from DataStage) and our Gold Delta tables (from Databricks) every single night.

5. Tools & Frameworks We Used

Conversion Accelerator: Travinto X2X Toolset. Most valuable contribution: Saved us thousands of hours of manual translation and provided the initial dependency graph.
Orchestration: Databricks Workflows.
QA & Reconciliation Framework: This was our secret weapon. We built a custom PySpark framework that used great_expectations for assertions. Every night, it ran a series of checks:
1. Row Count Match: Simple, but catches major errors.
2. Column Checksum: SHA2 hash on a concatenation of all columns (or key columns for wide tables) to check for data differences.
3. Numeric Measure Aggregation: SUM(), AVG(), MIN(), MAX() on all numeric columns. We allowed for a tiny tolerance (e.g., 0.001%) on floating-point numbers.
  The framework generated an HTML report with a clear Red/Green status for every table, which was posted to a shared Teams channel. This gave everyone, including business stakeholders, immediate visibility.
Monitoring: Default Databricks monitoring (Ganglia, Spark UI) coupled with custom alerts on job failures/duration thresholds pushed to PagerDuty.

6. Challenges & Roadblocks: The Real Story

It was not a smooth ride. Here are a few "war stories":

Technical Challenge: The Dreaded DECIMAL(38,x): DataStage and Teradata handled high-precision decimals flawlessly. Spark, at the time, had some inconsistencies in how DecimalType was handled, especially with potential for precision loss during transformations.
- Mitigation: We had to be absolutely religious about our schemas. We defined schemas explicitly, double-checking the precision and scale at every step. We wrote unit tests specifically to pass a high-precision decimal through a workflow and ensure it came out identical.
Operational Challenge: Team Friction & Skepticism: The existing DataStage support team felt their jobs were at risk. They were skeptical and, at times, uncooperative. They knew the undocumented logic that we desperately needed.
- Mitigation: I made it my personal mission to win them over. I didn't position it as "your tool is old and bad." I positioned it as, "You are the business logic experts. We are the Spark experts. We can't succeed unless you teach us what this job is supposed to do." We embedded two of their most senior members directly into our migration team. It was slow, but it built trust.
Performance Challenge: "Why is Spark Slower?" Early on, stakeholders would point to a Databricks job that took 60 minutes while the "equivalent" DataStage job took 30.
- Mitigation: Education and tuning. I had to explain that a "lift-and-shift" conversion is not optimized. We held tuning sessions where we'd use the Spark UI to show a terrible shuffle on a 2TB table, then fix it by implementing a broadcast join or re-partitioning, cutting the runtime by 75%. Proving the platform's capability with tangible results was key.

7. Validation & Quality Assurance

Our reconciliation framework was the bedrock of trust. A job wasn't "done" when the code ran. It was done when it passed three consecutive daily parallel runs with a "Green" status.

Unit Tests: For complex transformation logic, we wrote notebook-based unit tests using sample input DataFrames and asserting the output.
Regression Tests: The nightly parallel run was our regression suite. If a code change in one job broke a downstream table, our reconciliation report would catch it within hours.
Handling Exceptions: We didn't aim for 100% reconciliation on day one. We categorized discrepancies:
- Blockers: Mismatched row counts or checksums on critical ID columns. All hands on deck to fix.
- Warnings: Minor floating-point differences in averages. We documented, got business sign-off, and moved on.
- Known Gaps: Differences due to logic we intentionally changed (e.g., replacing an old, buggy currency conversion with a new, correct one). These were documented and approved.

8. Performance & Cost Optimization

Initially, our costs were high. Our first full parallel run was alarmingly expensive.

Cluster Sizing: We quickly moved from a few large, all-purpose clusters to right-sized, ephemeral job clusters for each workflow.
Workload Tuning: We identified the top 10 most expensive jobs. For each one, a senior engineer was tasked with optimizing it. Common wins included:
- Switching from large i3.2xlarge instances to smaller, more numerous r5d.xlarge instances to increase parallelism.
- Aggressive caching (.cache()) of dataframes that were used multiple times.
- Rewriting Python UDFs into native Spark SQL functions.
Cost Reduction: Through this focused effort, we reduced our projected daily run cost by over 60% from the initial, un-optimized state, bringing it well under the cost of running the legacy stack.

9. Results & Outcomes

On Day 88, we executed the final cutover.
Jobs Migrated: 3,050 jobs were successfully migrated and running in production. 50 jobs were decommissioned.
Business Continuity: The cutover occurred over a weekend. On Monday morning, all business reports were delivered on time, meeting all SLAs. There was zero business disruption.
Data Accuracy: Our reconciliation reports showed >99.9% data consistency on all critical financial and customer metrics.
Strategic Benefits: Within 45 days of the migration, the data science team had deployed their first customer churn prediction model on the new platform—a project that had been on the roadmap for two years. The bank retired its AIX servers and decommissioned its Teradata appliance six months later, realizing over $2M in annual savings.

10. Lessons Learned & My Recommendations

Ruthless Prioritization is Everything. Don't try to migrate everything. Challenge, simplify, and decommission. Your biggest enemy in a tight timeline is scope creep.
Invest in Automated Validation. Don't trust manual checks. An automated reconciliation framework that runs daily is not a nice-to-have; it's the only way to move fast without breaking things. It builds trust with stakeholders and allows your team to focus on fixing, not finding, problems.
Don't Just Translate; Modernize. A "lift-and-shift" of DataStage logic into PySpark will create a slow, expensive, and unmaintainable mess. Use the migration as an opportunity to re-architect for the distributed paradigm. Replace imperative logic with declarative window functions.
A Blended Team is Crucial. You need your DataStage veterans who understand the "why" behind the business logic, sitting right next to your Spark experts who know "how" to implement it efficiently on the new platform. A siloed approach will fail.
One Decider. A project this fast needs a single point of authority (in this case, me) who can make rapid decisions on technical trade-offs, priority calls, and resource allocation without waiting for a committee.

11. Executive Summary / CXO Takeaways

For the senior leaders evaluating a similar project, here's the bottom line:

Migrating a core ETL function from DataStage to Databricks in 90 days is not only possible, but it can be a massive strategic win. Our project retired $2M+ in annual legacy costs, eliminated critical hardware risks, and unlocked new AI/ML capabilities that directly impact revenue and customer retention.

Success wasn't based on magic, but on a pragmatic, disciplined approach:
1. De-risk with Parallel Runs: We proved the new system worked before switching it on by running it side-by-side with the old one, ensuring zero business disruption.
2. Invest in Acceleration: Using an end-to-end migration tool and building a custom validation framework were investments, not costs. They were the primary drivers of our speed and quality.
3. Empower a Focused Team: This was a special-forces operation, not a committee-led program. It required executive air cover, a clear mandate, and an empowered lead to make the tough calls.

A rapid migration is a high-intensity endeavor, but the payoff is the immediate realization of the benefits of a modern data platform, leapfrogging years of slower, incremental change.