From Legacy ETL to Modern Lakehouse: My DataStage to Databricks Journey
Published on: November 22, 2025 02:40 PM
From Legacy ETL to Modern Lakehouse: My DataStage to Databricks Journey
The hum of the server room used to be the soundtrack of my career. For over a decade, I was a DataStage specialist. I knew its quirks, its power, and its limitations like the back of my hand. IBM DataStage was the undisputed king of enterprise ETL, a reliable workhorse that processed terabytes for the world’s biggest companies. We built our entire data warehousing practice on it.
But the hum started to sound more like a groan. Our nightly batch windows, once a comfortable 8 hours, were shrinking. Business users, now accustomed to the instant gratification of the web, couldn't understand why a new report took six weeks to build. Our licensing and hardware maintenance costs were astronomical. DataStage, our fortress of data processing, had become a cage. It was powerful but rigid, stable but slow to evolve.
The business wanted agility, self-service, and a path to machine learning. We, the data team, wanted to break free from proprietary tools, embrace open standards, and scale on demand. The answer was a modern cloud data platform. After a thorough evaluation, we chose the Databricks Lakehouse. This is the story of that journey—the good, the bad, and the brutally honest lessons we learned migrating over 1,200 critical DataStage jobs to a new world.
1. The Sobering Reality: Planning & Discovery
You can't migrate what you don't understand. Our first and most critical phase was a deep, honest discovery. Our DataStage environment was a 15-year-old city with gleaming skyscrapers, forgotten alleyways, and a lot of undocumented plumbing.
Inventory and Dependency Mapping:
We started with the naive assumption that we could just get a list of jobs from the DataStage Director. We were wrong. We had jobs triggered by file drops, others by enterprise schedulers like Control-M, and some that were run manually by "that one person in finance" once a quarter.
This is where automated discovery tools became non-negotiable. We brought in LeapLogic to scan our entire DataStage project. It parsed the XML exports of our jobs, scheduler logs, and scripts to build a comprehensive inventory and, most importantly, a dependency graph. The output was both illuminating and terrifying. We discovered jobs we thought were long-decommissioned were still feeding critical, albeit obscure, downstream reports.
Complexity and Risk Analysis:
With a full inventory, we categorized every job on two axes: Complexity and Business Criticality.
| Low Business Criticality | High Business Criticality | |
|---|---|---|
| Low Complexity | Phase 1: Quick Wins (Simple file-to-table loads) | Phase 2: Core Migration (Standard transformations, lookups) |
| High Complexity | Phase 3: The Tough Nuts (Custom routines, complex pivots, slow-changing dimensions) | Phase 4: The Crown Jewels (High-risk, high-complexity financial reporting) |
This simple quadrant became our migration bible. It allowed us to show incremental progress with Phase 1, build confidence and new patterns with Phase 2, and save our "A-team" for the really hairy stuff in Phases 3 and 4.
The Phasing Strategy: Avoid the "Big Bang"
Leadership initially wanted a "lift and shift" over a long weekend. I fought this, hard. A "big bang" migration for a system this complex is a recipe for catastrophic failure. Instead, we adopted a "strangler fig" pattern, migrating one business domain at a time (e.g., Sales Analytics, then Supply Chain). This let us run the old and new systems in parallel for a specific domain, get business sign-off, and then "strangle" the old DataStage jobs for that domain, freeing up resources and reducing risk.
2. The Blueprint: Our Databricks Lakehouse Architecture
We didn't just want to replicate our old warehouse in the cloud. We wanted to build something fundamentally better. We embraced the Medallion Architecture.
- Bronze Layer: The landing zone. Raw data, ingested as-is from sources, stored in Delta Lake format. This gave us schema evolution and time travel right out of the box—a huge improvement over our old file-based staging areas.
- Silver Layer: The validation and cleansing layer. Here, we applied data quality rules, conformed data types, and joined key datasets. These tables were our new source of truth for departmental analytics.
- Gold Layer: The destination. Highly aggregated, project-specific data marts ready for BI and reporting. These tables directly fed our Power BI dashboards and executive reports.
Orchestration & Metadata:
We replaced our complex web of Control-M and DataStage sequences with Databricks Workflows. The ability to define a DAG (Directed Acyclic Graph) of notebooks, Python scripts, and dbt models in a single workflow was a game-changer for clarity and maintenance.
For governance, we went all-in on Unity Catalog. Early in my Databricks journey, we managed permissions with workspace-level controls, and it was chaos. Unity Catalog gave us a centralized place for access control, audit logging, and—most importantly—data lineage. Seeing a visual graph from a Gold table all the way back to a Bronze ingestion notebook saved us countless hours in debugging.
CI/CD:
This was a cultural shift. DataStage development was a GUI-driven, point-and-click affair. We moved to a code-first world, storing all our notebooks and PySpark code in GitHub. We used GitHub Actions and the Databricks CLI to automate the deployment of our code from dev to UAT to production workspaces. It was a steep learning curve for the team, but it made our process repeatable and far less error-prone.
3. The Messy Middle: Migration Execution
This is where the rubber meets the road. Our strategy was "automate what you can, perfect what you must."
Automated Conversion:
We used LeapLogic to do the first pass of conversion. It took our DataStage jobs (DSX files) and translated them into PySpark notebooks. My rule of thumb, based on several of these projects, is that automation gets you 70-80% of the way there for 80% of your jobs.
It was fantastic at converting basic source-to-target mappings, simple filters, and standard joins. But it struggled with:
* Complex Transformer Stages: DataStage's Transformer stage allows for intricate, nested if-then-else logic and proprietary functions. These often translated to messy, hard-to-read PySpark that needed significant manual refactoring.
* Custom Routines: We had a library of C++ and Basic routines we'd built over the years. These had to be rewritten from scratch in Python or Scala.
* Lookup Stages: A simple lookup was fine, but lookups with complex conditions or "return multiple rows" logic required careful re-implementation using Spark broadcasts or different join strategies.
Parallel Runs & Validation:
This was the most nerve-wracking phase. For each domain, we ran the new Databricks pipeline in parallel with the old DataStage job, loading into a separate set of validation tables.
Our secret weapon here wasn't a fancy tool; it was a custom PySpark reconciliation framework we built. It would:
1. Connect to both the legacy database (fed by DataStage) and the new Delta table.
2. Perform row count comparisons.
3. Perform checksums/hashes on key columns for a sample of rows.
4. Run a MINUS query (or the DataFrame equivalent) to find exact row-level differences.
5. Generate a detailed report of any discrepancies.
This automated validation was the only way we could get business sign-off with confidence.
4. Our Toolkit: Frameworks and Tools That Delivered
| Purpose | Tool(s) Used | My Takeaway |
|---|---|---|
| Discovery & Inventory | LeapLogic | Essential. Don't even attempt a large-scale migration without one. The dependency graph alone is worth the investment. |
| Code Conversion | LeapLogic | A massive accelerator, but set expectations. It's a starting point, not a magic button. Budget for manual refactoring. |
| Data Validation | Custom PySpark Framework | The unsung hero. Building this framework was the single best technical decision we made. It provided irrefutable proof of correctness. |
| Orchestration | Databricks Workflows, Airflow | We started with Workflows for simplicity. For hyper-complex, cross-system dependencies, Airflow still has an edge, but Workflows cover 95% of use cases. |
| Data Quality | Great Expectations (GE) | We embedded GE checks directly into our Silver pipelines. It helped us catch data issues from sources before they polluted our Gold tables. |
| CI/CD | GitHub Actions, Databricks CLI | The foundation of our new MLOps/DataOps culture. Non-negotiable for any serious data engineering team. |
5. Scars and Wisdom: Challenges & Lessons Learned
No migration is smooth. Here’s where we stumbled, and what we learned.
-
The Devil is in the Data Types: The classic
Decimal(38,10)in Oracle vs. Spark'sDecimalType. DataStage was very forgiving with implicit type casting. Spark is not. We spent weeks chasing down tiny rounding differences in financial data caused by mismatched precision and scale. Lesson: Create a "type mapping" standard early and enforce it ruthlessly. -
Debugging is a Different Beast: In DataStage, you could set a breakpoint, watch data flow through a link, and visually inspect rows. In Databracks, you're debugging a distributed system. You have to learn to love reading Spark UI, digging into driver and executor logs, and writing defensive code. Lesson: Train your team on Spark debugging before the migration starts. It's a completely different skill set.
-
The "Simple" Jobs Were the Hardest: One of our biggest surprises was that the "simple" file-to-table jobs caused the most headaches. Why? They were often the oldest, poorly documented, and had implicit business logic nobody remembered. One job was failing because it expected a file with a specific
CRLFline ending from a mainframe, a detail lost to time. Lesson: Assume nothing is simple. The least-documented parts of your system are the most dangerous. -
Process & People are Harder than Tech: The biggest challenge wasn't the code; it was coordinating between teams. Getting the DataStage team, the new Databricks team, QA, and business stakeholders all aligned for parallel runs and cutovers was a project management feat. Lesson: Have a dedicated migration lead whose primary job is communication and coordination.
6. Life After Migration: Optimization
Getting the jobs to run correctly was victory. Getting them to run efficiently and cheaply was the war.
- Performance Tuning: We learned to love the Spark UI. We identified skewed joins, optimized shuffling by correctly partitioning our Delta tables on the right keys, and enabled Photon for C++ vectorized performance on all our compute. A job that took 2 hours was tuned down to 15 minutes.
- Cost Optimization: We moved 90% of our workloads to job clusters that terminate upon completion. We aggressively used spot instances for non-critical workloads, saving up to 70% on compute costs. Auto-scaling policies were tuned to scale up quickly but, more importantly, to scale down aggressively when idle.
- Data Quality Monitoring: We set up dashboards in Databricks SQL to monitor the output of our Great Expectations checks. If a source system suddenly started sending 50%
NULLsin a key column, an alert was fired before the business user saw a broken dashboard.
7. The Payoff: Outcomes & Impact
After 18 months of focused effort, the results were transformative.
- Successfully Migrated: Over 98% of the target jobs (around 1,180) were migrated and running in production on Databricks.
- SLA Adherence: Our main financial reporting pipeline's SLA was reduced from a 6-hour batch window to a 45-minute, event-driven process.
- Cost Savings: We decommissioned multiple DataStage servers and retired seven-figure annual licensing and support contracts. Even with our Databricks consumption, the TCO was significantly lower.
- Strategic Advantage: This wasn't just a cost-saving exercise. Within six months of a stable Lakehouse, the data science team built and deployed a customer churn prediction model directly on the same Silver-layer data our BI team was using. That simply would not have been possible before. We went from being the "data janitors" to being enablers of innovation.
8. My Advice to You: Key Takeaways
If you're about to embark on this journey, here’s my advice, distilled from experience.
For Engineers & Architects:
1. Don't Trust, Verify: Don't trust the automated conversion tools blindly. Use them as an accelerator, but every single line of complex logic needs to be reviewed and understood.
2. Master Your New Tools: Become an expert in the Spark UI. Learn how distributed systems fail. This is not your old world.
3. Build a Reconciliation Framework First: Before you migrate a single job, build the tool that proves your migration is correct. This will be your safety net and your source of truth.
For Leadership & CXOs:
1. This is a Business Transformation: Frame this as an investment in agility and future capabilities, not just an IT cost-cutting project. The real ROI comes from what you can do after the migration.
2. Invest in Your People: Your DataStage experts are your most valuable asset. They hold the business logic in their heads. Invest in training them on Databricks, Python, and Spark. You're not replacing them; you're upskilling them.
3. Be Realistic: This will take longer and be more complex than you think. There will be setbacks. An iterative, phased approach that celebrates small wins is the only way to maintain momentum and morale.
Leaving DataStage felt like leaving a city I'd helped build. But the view from the Databricks Lakehouse is expansive. We traded walled gardens for open plains, rigidity for agility, and limitations for possibilities. The journey was tough, but I'd do it again in a heartbeat. The hum of the server room has been replaced by the quiet, scalable power of the cloud—and that’s a much better sound.