Metadata-Driven Migration from DataStage to Databricks

Published on: January 02, 2026 05:33 PM

Metadata-Driven Migration from DataStage to Databricks: A Principal Engineer's Playbook

Introduction

I've been in the DataStage world for over 15 years and have spent the last decade leading enterprise-scale migrations to modern platforms like Databricks. The single biggest mistake I see organizations make is treating a migration as a massive, manual rewrite effort. They throw teams of engineers at the problem, who then spend months or years playing a game of "whack-a-mole" with thousands of individual ETL jobs. This approach is slow, expensive, prone to error, and nearly impossible to audit.

There is a better way.

The blueprint for your new Databricks platform is already written; it’s just locked away inside your DataStage environment. It's the metadata—the definitions of your jobs, your stages, your data flows, and your dependencies. By systematically extracting, analyzing, and leveraging this metadata, we can build an automation framework that drives the entire migration lifecycle. This isn't about a magic "converter" button; it's about engineering a repeatable, auditable, and scalable migration factory.

This guide is my playbook, forged from experience on multiple large-scale projects. It’s for the technical leaders who need to execute this migration and want to do it right.

1. What is Metadata-Driven Migration?

At its core, a metadata-driven migration is an engineering approach where we treat the definitions of the legacy ETL jobs as our primary source code. Instead of manually inspecting each DataStage job and rewriting it by hand, we programmatically parse the DataStage metadata and use it to automate discovery, code generation, orchestration, and validation.

Key Benefits:

  • Reduces Risk: Automation eliminates human error in translation. By applying a consistent set of rules, you ensure that similar DataStage patterns are always converted to the same, standardized Databricks pattern. This consistency is a massive risk reducer.
  • Improves Accuracy & Consistency: Manual rewrites introduce developer bias and inconsistencies. A metadata-driven approach enforces a single, approved way of implementing a specific pattern (e.g., a lookup, an aggregation) in Databricks.
  • Accelerates Migration: While the initial setup of the automation framework requires an upfront investment, the payback is enormous. Once the framework is built, you can convert hundreds or thousands of jobs in a fraction of the time it would take manually. We're talking about converting the bulk of your inventory in weeks, not years.
  • Creates an Audit Trail: Every step is programmatic. We can log which job was converted, what rules were applied, the generated code, and its validation status. This creates a fully auditable path from the original DataStage job to the new Databricks workflow.

Limitations and Considerations:

Let’s be realistic. This is not a 100% automation solution. The goal is to apply the 80/20 rule. We aim to automatically migrate the 80% of jobs that follow common, repeatable patterns. The remaining 20%—the highly complex, snowflake jobs with convoluted business logic, custom routines, or poor design—will require manual intervention. Our metadata analysis will help us identify and isolate these jobs early.

2. Metadata Extraction

The entire process hinges on getting clean, structured metadata out of DataStage. The most reliable and universally available source is the DataStage Export file (.dsx).

  1. Bulk Export: The first step is to script the bulk export of all your DataStage jobs into .dsx files using the dsjob or istool command-line utilities. Organize these exports logically, for example, by project and folder.
  2. Parsing the .dsx Files: A .dsx file is essentially a structured text file, often resembling XML or a proprietary format. You need to build a robust parser (Python is my weapon of choice here, using libraries like xml.etree.ElementTree if the format is XML-like, or regex/custom parsing if not).
  3. Core Metadata Entities to Capture:
    Your parser must be able to identify and extract the following into a structured format (like a series of related JSON objects or database tables):
    • Jobs: Job Name, Job Type (Parallel, Sequence), Description.
    • Stages: Stage Name, Stage Type (e.g., Db2Connector, Transformer, Aggregator, Join), Stage-specific properties (SQL queries, file paths, join keys, aggregation logic).
    • Links: Source Stage, Target Stage, Column Mappings between stages. This is critical for building lineage.
    • Parameters & Parameter Sets: Parameter Name, Prompt, Default Value. We need this to parameterize our Databricks jobs.
    • Custom Transformations: The expression logic inside Transformer stages. This is often the most complex piece. We need to capture the exact derivation expressions (IF...THEN...ELSE, string manipulations, arithmetic).
    • Dependencies: For Sequence jobs, extract the execution order, triggers, and conditional logic (e.g., "Run Job B only if Job A succeeds").

Here's a simplified JSON representation of what we aim to extract for a single job:

    {
      "job_name": "JB_Load_Sales_Fact",
      "job_type": "Parallel",
      "parameters": [
        { "name": "p_RUN_DATE", "default": "2023-10-27" }
      ],
      "stages": [
        {
          "name": "Read_Staging_Sales",
          "type": "OracleConnector",
          "properties": { "query": "SELECT * FROM STG_SALES WHERE TXN_DATE = '#p_RUN_DATE#'" }
        },
        {
          "name": "Lookup_Product_Dim",
          "type": "Lookup",
          "properties": { "lookup_keys": ["PRODUCT_ID"], "reference_data": "V_DIM_PRODUCT" }
        },
        {
          "name": "XFM_Sales_Logic",
          "type": "Transformer",
          "properties": {
            "derivations": [
              { "column": "TOTAL_AMOUNT", "expression": "Input.QUANTITY * Input.UNIT_PRICE" }
            ]
          }
        },
        {
          "name": "Load_Sales_Fact",
          "type": "ODBCConnector",
          "properties": { "table": "F_SALES", "write_mode": "Append" }
        }
      ],
      "links": [
        { "source": "Read_Staging_Sales", "target": "Lookup_Product_Dim" },
        { "source": "Lookup_Product_Dim", "target": "XFM_Sales_Logic" },
        { "source": "XFM_Sales_Logic", "target": "Load_Sales_Fact" }
      ]
    }

This structured metadata is now the foundation for everything that follows.

3. Metadata Analysis & Classification

With the metadata extracted into a queryable format (e.g., a database, or a folder of JSON files queryable with Spark), we can now perform the crucial triage step. The goal is to avoid treating all jobs as equal.

1. Job Complexity Scoring:
We develop a scoring algorithm to objectively measure job complexity. This helps us prioritize and forecast effort.

Metric Weight Example Scoring Logic
Number of Stages 1 1 point per stage
Complex Stage Types 5 5 points for each Transformer, Join, Lookup
Custom Routine/Function Calls 20 20 points per unique custom function (high risk!)
Number of Transformer Derivations 2 2 points per derived column in a Transformer
Sequence Job Complexity 10 10 points for each loop or complex condition
Non-Standard Connectors 15 15 points for custom connectors, mainframe sources

A simple job might score 10-20, while a monster job could score 200+. This score immediately tells us where the migration dragons lie.

2. Identifying Reusable Patterns:
We query our metadata repository to find common job structures. For example:
* SELECT COUNT(*) FROM METADATA WHERE stages LIKE '%OracleConnector% -> %Transformer% -> %Dataset% -> %Join% -> %TeradataConnector%'

This query identifies all jobs that follow a specific ingestion and integration pattern. By identifying these, we can build a single, highly reusable Databricks template that can serve dozens or hundreds of jobs. This is where we get our economies of scale.

3. Prioritizing for Migration:
We use a simple 2x2 matrix to classify jobs and decide on the migration strategy:

Low Complexity High Complexity
High Business Value Automate First (Quick Wins): Ideal candidates for our automation framework. Prove the value of the approach here. Strategic Manual Rewrite: Dedicate senior engineers. These may require re-architecture, not just conversion.
Low Business Value Automate Later (Backlog): Run these through the generator when time permits. Low risk, low priority. Challenge/Decommission: Do we even need this job? The migration is the perfect time to challenge its existence.

This analysis phase is non-negotiable. It turns a chaotic inventory into a structured, prioritized backlog.

4. Mapping Metadata to Databricks Constructs

This is the intellectual core of the conversion framework. We create a translation dictionary that maps DataStage concepts to Databricks/Spark concepts.

DataStage Construct Databricks/PySpark Equivalent Notes
Source/Target Connectors spark.read.format(...) / df.write.format(...) JDBC, Delta, Parquet, CSV. The connector properties in the metadata directly map to options (.option("dbtable", ...)).
Transformer Stage df.withColumn(), df.selectExpr(), Spark SQL CASE statements Simple derivations are a 1:1 mapping. Complex IF/THEN/ELSE chains become when().otherwise() or SQL CASE statements. This is where a good expression parser is vital.
Filter Stage df.filter() or df.where() Straightforward mapping of the filter expression.
Aggregator Stage df.groupBy().agg() The grouping keys map to groupBy(). The aggregations (SUM, COUNT, MAX) map to functions in pyspark.sql.functions.
Join/Merge/Lookup Stages df1.join(df2, on=..., how=...) The join keys, join type (inner, left_outer), and link ordering from the metadata directly inform the parameters of the Spark join.
Sort Stage df.orderBy() or df.sort() Map the sort keys and direction (ASC/DESC).
Sequence Job Databricks Workflow Each Job Activity in the sequence becomes a Task in the workflow. The execution links (OK, Failed) become depends_on relationships.
Job Parameters Databricks Job Parameters / Widgets Parameters extracted from the .dsx are used to define widgets in a notebook or parameters in a job definition, allowing for dynamic execution.
Loops (Start/End Loop) Paradigm Shift: Avoid loops. Re-engineer to process sets. If absolutely necessary, use foreachBatch or dynamic task generation via the Databricks API, but this is an anti-pattern. This is a key re-architecture opportunity. DataStage encourages procedural looping; Spark excels at parallel set-based processing. Don't lift-and-shift loops.

5. Automated Code Generation Using Metadata

With our parsed metadata and mapping logic, we can now generate code. The most effective approach I've found is template-based generation.

  1. Choose a Templating Engine: Jinja2 for Python is the industry standard and my strong recommendation. It's powerful, flexible, and separates the code-generation logic (the template) from the metadata itself.

  2. Create Code Templates: We build a library of Jinja2 templates for different components:

    • notebook_header.py.j2: Sets up SparkSession, imports, defines widgets for parameters.
    • read_jdbc.py.j2: A template for reading from a JDBC source.
    • with_column.py.j2: A template for a Transformer derivation.
    • group_by.py.j2: A template for an Aggregator stage.
    • write_delta.py.j2: A template for writing to a Delta table.
  3. The Generation Process:
    The core of the generator is a script that:
    a. Takes the structured job metadata (our JSON from step 2) as input.
    b. Initializes a notebook with the header template.
    c. Iterates through the stages of the job in their logical execution order (derived from the links).
    d. For each stage, it selects the appropriate Jinja2 template based on the stage_type.
    e. It renders the template, passing in the specific properties from the metadata (e.g., for a read_jdbc template, it passes the table name, credentials secret, and filter condition).
    f. It appends the rendered code to the notebook being built.
    g. The final output is a complete, runnable PySpark notebook (.py or .ipynb) file.

Simplified Generator Snippet (Conceptual):

    from jinja2 import Environment, FileSystemLoader

    # Load metadata for a job (from our JSON)
    job_meta = load_job_metadata("JB_Load_Sales_Fact.json")

    # Setup Jinja2 environment
    env = Environment(loader=FileSystemLoader('./templates/'))

    # Start building the notebook
    notebook_code = ""

    # Add header with parameters
    header_template = env.get_template("notebook_header.py.j2")
    notebook_code += header_template.render(parameters=job_meta['parameters'])

    # Process stages in order
    for stage in job_meta['stages']:
        stage_type = stage['type']
        # This is a simplification; a real factory would be more robust
        if stage_type == "OracleConnector":
            template = env.get_template("read_jdbc.py.j2")
        elif stage_type == "Transformer":
            template = env.get_template("with_column.py.j2")
        # ... and so on for other stages

        # Render the stage-specific code
        notebook_code += template.render(props=stage['properties'])

    # Save the generated notebook
    with open(f"{job_meta['job_name']}.py", "w") as f:
        f.write(notebook_code)

This generated code is a scaffold. It should be 80-90% correct for patterned jobs, but it's crucial that an engineer reviews, refines, and tests it.

6. Orchestration Driven by Metadata

Jobs don't run in isolation. We use the metadata extracted from DataStage Sequence jobs to automatically build Databricks Workflows.

  1. Parse Sequence Logic: Our metadata extractor must capture the flow of a Sequence job: which job runs first, what triggers the next job (e.g., success, failure, warning), and any conditional branches.
  2. Map to Workflow JSON: We write a script that translates this sequence logic into the JSON definition for a Databricks Workflow.
    • Each Job Activity stage becomes a task in the tasks array.
    • The task_key is a sanitized version of the job name.
    • The notebook to be run is the one we just generated (notebook_path).
    • The sequence links define the depends_on array for each task.
    • Parameters from the sequence are passed into the parameters block of the task.

Example Mapping:

A DataStage sequence JobA -> JobB (on success) becomes:

    {
      "name": "SEQ_My_Workflow",
      "tasks": [
        {
          "task_key": "JobA",
          "notebook_task": { "notebook_path": "/generated/JobA" }
        },
        {
          "task_key": "JobB",
          "depends_on": [{ "task_key": "JobA" }],
          "notebook_task": { "notebook_path": "/generated/JobB" }
        }
      ]
    }

This JSON can be directly fed into the Databricks Jobs API (2.1/jobs/create) to programmatically create the entire workflow.

7. Validation & Testing Using Metadata

How do we trust the generated code? We use the same metadata to generate our tests. This is a concept I call "Metadata-Driven Validation."

Our metadata tells us the exact source and target for every job. We can leverage this to automate reconciliation.

  1. Automated Test Case Generation: For a given job, our framework automatically generates a validation notebook that performs checks like:
    • Row Count Reconciliation: SELECT COUNT(1) FROM <source_table> vs. SELECT COUNT(1) FROM <target_delta_table>.
    • Schema Validation: Compare the schema of the source query result with the schema of the target table. Check column names, data types, and nullability.
    • Data Reconciliation (Checksum): For key business columns, perform an aggregate checksum on both source and target and compare the results. This is far more robust than just a row count.
    -- Run on source and target, then compare results
    SELECT SUM(CAST(COLUMN_A AS DECIMAL(38,4))), COUNT(DISTINCT COLUMN_B)
    FROM my_table WHERE <filter_conditions>;

  1. Integrating Validation into Workflows: We can add this validation notebook as a final task in our automatically generated Databricks Workflow. If the validation task fails, the entire workflow run is marked as failed.

  2. Tracking Results: The output of every validation run (e.g., source count, target count, checksum values, pass/fail status) is logged to a central audit table. This gives project managers and auditors a dashboard view of migration quality.

8. Logging, Monitoring & Auditability

A core tenet of this approach is end-to-end auditability. Our migration framework is not just a converter; it's a logging and tracking system.

We maintain a central Migration Control Database with tables like:

  • Job_Inventory: One row for every DataStage job, including its complexity score and classification.
  • Migration_Log: Tracks the status of each job's migration (e.g., Pending, Generated, Manual_Review, Validated, Deployed).
  • Generated_Code_Artifacts: Stores a link to the generated notebook and workflow JSON for each job.
  • Validation_Results: The detailed log from our automated testing, as described above.

This database becomes the single source of truth for the migration program. It answers questions like: "Show me all high-complexity jobs that failed validation" or "What percentage of the finance jobs have been fully deployed to production?"

9. Real-World Examples & Lessons Learned

Example: A Complex Job

Consider a DataStage job that:
1. Reads from two Oracle tables.
2. Joins them on a customer ID.
3. Looks up a product category from a reference file (CSV).
4. Has a Transformer with 15 derivations, including a complex If-Then-Else to calculate customer status.
5. Aggregates sales by region.
6. Writes the result to a Teradata table.

Metadata-Driven Approach:

  1. Extraction: Our parser extracts all 6 stages, the join keys, lookup file path, every single derivation expression, and the aggregation logic.
  2. Analysis: The complexity score is high (~80), flagging it for review.
  3. Generation:
    • The generator creates two spark.read.jdbc blocks.
    • It creates a spark.read.csv for the lookup file.
    • It generates a df1.join(df2, ...) and a subsequent joined_df.join(lookup_df, ...)
    • It translates the 15 derivations into 15 withColumn calls. The If-Then-Else becomes a when().otherwise() expression.
    • It generates a groupBy("REGION").agg(...) block.
    • Finally, it generates a df.write.jdbc(...) to Teradata.
  4. Result: We get a 50-line PySpark notebook that is functionally equivalent and fully readable. An engineer spends 1-2 hours reviewing and testing it, versus 1-2 days trying to rewrite it from scratch and likely missing subtle logic.

Pitfalls and Lessons Learned:

  • The "Transformer from Hell": Some Transformer stages contain hundreds of derivations and spaghetti-like business logic. Our expression parser must be robust, but for the absolute worst offenders, it's often better to flag them for a manual, strategic rewrite. The goal is simplification, not just conversion.
  • Implicit Data Type Casting: DataStage is very lenient with data types. Spark is not. A common failure is assuming a VARCHAR can be treated as a DECIMAL. Our validation framework's schema checks are critical for catching this.
  • Environment Variables: DataStage jobs are often littered with environment variables ($PROJDEF, $ENV_VAR). Our extraction process must resolve these, or map them to a consistent configuration mechanism in Databricks (like secrets or cluster environment variables).
  • Custom Routines: C++ or BASIC routines called from Transformers are the highest risk. These cannot be automatically converted. They must be identified, and the logic re-implemented in Python/Scala as a UDF. Our complexity score heavily weights these for this reason.

10. Best Practices

  1. Establish a Metadata Single Source of Truth: Your parsed, structured metadata repository is gold. Protect it, version it, and make it the definitive source for all automation.
  2. Build Modular, Reusable Mapping Logic: Don't write a monolithic conversion script. Build your code generator with modular templates and rules. When you encounter a new pattern, you add a new rule/template, you don't rewrite the whole engine.
  3. Embrace the "Human in the Loop": Automation is a force multiplier for your engineers, not a replacement. Generated code must always be reviewed, tested, and approved by a qualified engineer.
  4. Implement a Continuous Improvement Loop: When an engineer manually fixes or improves a piece of generated code, ask "Why was this fix necessary?" and feed that learning back into the generation templates. This makes your automation framework smarter with every job you migrate.
  5. Start Small, Prove Value: Don't try to build the perfect, all-encompassing framework from day one. Start with a common, medium-complexity pattern. Automate it, show the dramatic speed and quality improvement, and earn the buy-in to expand the framework's capabilities.

By adopting this metadata-driven mindset, you transform migration from a risky, manual art into a structured, repeatable, and auditable science. You stop converting jobs one by one and start operating a migration factory, delivering consistent, high-quality results at scale.