From Oracle Data Integrator(ODI) to Databricks Migration

I remember the exact moment on a multi-million dollar migration project when we hit the wall. We were three months in, moving a massive Oracle Data Integrator (ODI) estate to Databricks. The project plan looked great, progress bars were green, but the data reconciliation reports were a sea of red. Tiny, infuriating discrepancies—off by a few cents here, a rounding error there—were showing up in finance reports. The business was losing trust, and my team was burning out chasing ghosts in the data.

That was the moment we learned that migrating from ODI to Databricks isn't a "lift and shift." It's a brain transplant. You're not just moving code; you're moving from a rigid, on-premises, SQL-centric world to a flexible, cloud-native, code-first paradigm.

I’ve since led several of these large-scale production migrations. I’ve seen what works, what spectacularly fails, and what they don’t tell you in the sales pitches. The promise of the Databricks Lakehouse—unifying data, analytics, and AI on a single platform—is real. But getting there from a legacy ETL tool like ODI requires a playbook forged in the trenches. This is that playbook. It’s the unfiltered, hands-on guide I wish I had when I started.

Before You Write a Single Line of Spark: The Brutal Honesty of Inventory

Your first instinct might be to grab an ODI mapping, look at the SQL it generates, and start rewriting it in a Databricks notebook. This is a trap. You'll quickly get bogged down in a seemingly endless swamp of undocumented logic, dead code, and hidden dependencies. On one project, we discovered that nearly 40% of the ODI jobs the client wanted to migrate hadn't successfully run in over a year.

You cannot migrate what you do not understand. A comprehensive, automated inventory isn't a "nice-to-have"; it's the foundation of your entire migration strategy. You need to connect directly to the ODI repository and rip out its secrets.

The ODI repository is just a database (usually Oracle). We can query its metadata tables to build a complete picture of the entire environment. These tables, with their SNP_ prefixes, are the source of truth.

Python for an ODI Repository Deep Dive

Here’s a Python script, inspired by ones my teams have used, to connect to the ODI work repository and extract a complete inventory. It uses pandas and a database connector like cx_Oracle to pull the metadata into a structured format for analysis.

    import pandas as pd
    import cx_Oracle # Or your database driver of choice (pyodbc for SQL Server)
    import getpass

    # --- Configuration ---
    # In a real project, this would come from a config file or secrets manager
    DB_USER = "ODI_REPO_USER"
    DB_HOST = "your-odi-repo-host.com"
    DB_PORT = "1521"
    DB_SERVICE = "ORCL"

    # It's better to use a wallet or other secure method, but for this example:
    DB_PASSWORD = getpass.getpass(f"Enter password for {DB_USER}: ")

    # --- Connection ---
    try:
        dsn = cx_Oracle.makedsn(DB_HOST, DB_PORT, service_name=DB_SERVICE)
        connection = cx_Oracle.connect(user=DB_USER, password=DB_PASSWORD, dsn=dsn)
        print("Successfully connected to ODI Repository.")
    except cx_Oracle.Error as e:
        print(f"Error connecting to Oracle: {e}")
        exit()

    # --- Inventory Queries ---
    # These queries join various SNP tables to get a comprehensive view.

    # 1. Projects and Folders
    projects_query = """
    SELECT
        p.PROJECT_NAME,
        f.FOLDER_NAME,
        p.I_PROJECT
    FROM SNP_PROJECT p
    LEFT JOIN SNP_FOLDER f ON p.I_PROJECT = f.I_PROJECT
    ORDER BY p.PROJECT_NAME, f.FOLDER_NAME
    """

    # 2. Mappings (Interfaces in older ODI versions) and their components
    mappings_query = """
    SELECT
        proj.PROJECT_NAME,
        fld.FOLDER_NAME,
        m.NAME AS MAPPING_NAME,
        m.I_MAP,
        comp.NAME AS COMPONENT_NAME,
        comp.TYPE_NAME AS COMPONENT_TYPE,
        prop.NAME AS PROPERTY_NAME,
        prop.VALUE AS PROPERTY_VALUE
    FROM SNP_MAP m
    JOIN SNP_FOLDER fld ON m.I_FOLDER = fld.I_FOLDER
    JOIN SNP_PROJECT proj ON fld.I_PROJECT = proj.I_PROJECT
    LEFT JOIN SNP_MAP_COMP comp ON m.I_MAP = comp.I_OWNER_MAP
    LEFT JOIN SNP_MAP_PROP prop ON comp.I_MAP_COMP = prop.I_OWNER_MAP_COMP
    WHERE prop.NAME IN ('SQL_QUERY', 'EXPRESSION_TRT', 'FILTER_TRT', 'JOIN_TRT') -- Extracting the core logic
    ORDER BY proj.PROJECT_NAME, fld.FOLDER_NAME, m.NAME
    """

    # 3. Packages and Steps
    packages_query = """
    SELECT
        p.PROJECT_NAME,
        pkg.PACK_NAME,
        step.STEP_NAME,
        step.STEP_TYPE,
        step.I_SCEN_TASK -- Links to Scenarios/Logs
    FROM SNP_PACKAGE pkg
    JOIN SNP_FOLDER f ON pkg.I_FOLDER = f.I_FOLDER
    JOIN SNP_PROJECT p ON f.I_PROJECT = p.I_PROJECT
    JOIN SNP_STEP step ON pkg.I_PACKAGE = step.I_PACKAGE
    ORDER BY p.PROJECT_NAME, pkg.PACK_NAME, step.NNO
    """

    # 4. Scenarios (the runnable objects)
    scenarios_query = """
    SELECT
        SCEN_NAME,
        SCEN_VERSION,
        I_SCEN,
        (SELECT PROJECT_NAME FROM SNP_PROJECT WHERE I_PROJECT = s.I_PROJECT) AS PROJECT_NAME
    FROM SNP_SCEN s
    """

    # --- Execution & Data Loading ---
    try:
        df_projects = pd.read_sql(projects_query, connection)
        df_mappings = pd.read_sql(mappings_query, connection)
        df_packages = pd.read_sql(packages_query, connection)
        df_scenarios = pd.read_sql(scenarios_query, connection)

        print(f"\nDiscovered {len(df_projects['PROJECT_NAME'].unique())} projects.")
        print(f"Discovered {len(df_mappings['MAPPING_NAME'].unique())} mappings.")
        print(f"Discovered {len(df_packages['PACK_NAME'].unique())} packages.")
        print(f"Discovered {len(df_scenarios)} scenarios.")

        # Save to CSV for further analysis
        df_projects.to_csv("odi_inventory_projects.csv", index=False)
        df_mappings.to_csv("odi_inventory_mappings.csv", index=False)
        df_packages.to_csv("odi_inventory_packages.csv", index=False)
        df_scenarios.to_csv("odi_inventory_scenarios.csv", index=False)

        print("\nInventory saved to CSV files.")

    finally:
        connection.close()
        print("Connection closed.")

This script gives you the raw material. It tells you what exists. The next, more important question is: what matters?

Usage, Dependency, and Readiness Analysis

You need to cross-reference your inventory with execution logs. The SNP_SESS_TASK_LOG table is gold. It tells you which scenarios actually ran, when they ran, and for how long. By joining this back to your inventory, you can ruthlessly prioritize.

Here’s how we would extend the analysis to create a "Migration Readiness" report. This is where you move from a simple list to an actionable plan.

    import pandas as pd
    import networkx as nx
    import matplotlib.pyplot as plt

    # Assume the CSVs from the previous script have been loaded into DataFrames
    # df_inventory = pd.read_csv(...)
    # df_scenarios = pd.read_csv(...)
    # For this example, let's create dummy dataframes
    # In a real scenario, you'd load the data from the ODI repo extraction
    data_inventory = {
        'OBJECT_NAME': ['Mapping_A', 'Mapping_B', 'Package_C', 'Scenario_A', 'Scenario_B', 'Scenario_C'],
        'OBJECT_TYPE': ['MAPPING', 'MAPPING', 'PACKAGE', 'SCENARIO', 'SCENARIO', 'SCENARIO'],
        'COMPLEXITY_SCORE': [3, 8, 5, 3, 8, 5], # 1-10 scale
        'SOURCE_SYSTEM': ['Oracle', 'SQLServer', 'Oracle', 'Oracle', 'SQLServer', 'Oracle'],
        'TARGET_SYSTEM': ['OracleDW', 'OracleDW', 'OracleDW', 'OracleDW', 'OracleDW', 'OracleDW']
    }
    df_inventory = pd.DataFrame(data_inventory)

    data_usage = {
        'SCEN_NAME': ['Scenario_A', 'Scenario_B', 'Scenario_B'],
        'RUN_COUNT_LAST_90D': [180, 90, 90], # Scenario B ran twice
        'AVG_DURATION_MIN': [5, 45, 45]
    }
    df_usage = pd.DataFrame(data_usage).groupby('SCEN_NAME').agg(
        RUN_COUNT_LAST_90D=('RUN_COUNT_LAST_90D', 'sum'),
        AVG_DURATION_MIN=('AVG_DURATION_MIN', 'mean')
    ).reset_index()

    # 1. Merge inventory with usage stats
    df_report = pd.merge(df_inventory, df_usage, left_on='OBJECT_NAME', right_on='SCEN_NAME', how='left')
    df_report['RUN_COUNT_LAST_90D'] = df_report['RUN_COUNT_LAST_90D'].fillna(0)
    df_report['AVG_DURATION_MIN'] = df_report['AVG_DURATION_MIN'].fillna(0)

    # 2. Define readiness rules (this is where your architectural expertise comes in)
    def assess_readiness(row):
        # Rule 1: Not used = Deprecate
        if row['RUN_COUNT_LAST_90D'] == 0:
            return "DEPRECATE"

        # Rule 2: High complexity = Manual Review
        if row['COMPLEXITY_SCORE'] > 7:
            return "MANUAL_REVIEW_REQUIRED"

        # Rule 3: Standard connectors = Good candidate for automation
        if row['SOURCE_SYSTEM'] in ['Oracle', 'SQLServer', 'FlatFile']:
            return "AUTOMATION_CANDIDATE"

        # Rule 4: Obscure source = Requires custom connector investigation
        if row['SOURCE_SYSTEM'] in ['AS400', 'MainframeCopybook']:
            return "CUSTOM_CONNECTOR_NEEDED"

        return "STANDARD_MIGRATION"

    df_report['MIGRATION_CATEGORY'] = df_report.apply(assess_readiness, axis=1)

    # 3. Calculate a Priority Score
    df_report['PRIORITY_SCORE'] = (df_report['RUN_COUNT_LAST_90D'] / 10) * (10 - df_report['COMPLEXITY_SCORE'])
    df_report = df_report.sort_values(by='PRIORITY_SCORE', ascending=False)


    print("--- Migration Readiness & Prioritization Report ---")
    print(df_report[['OBJECT_NAME', 'OBJECT_TYPE', 'COMPLEXITY_SCORE', 'RUN_COUNT_LAST_90D', 'MIGRATION_CATEGORY', 'PRIORITY_SCORE']])

    # --- Dependency Graph (Conceptual) ---
    # To do this for real, you need to parse the package steps to find dependencies
    # e.g., Step 1 runs Mapping_A, Step 2 runs Mapping_B
    # G = nx.DiGraph()
    # G.add_edge("Mapping_A", "Package_C")
    # G.add_edge("Mapping_B", "Package_C")
    # print("\nDependencies for Package_C:", list(G.predecessors("Package_C")))

This data-driven approach changes the conversation from "We need to migrate everything" to "We will migrate these 250 high-value pipelines first, deprecate these 150 unused ones, and schedule these 50 complex ones for manual re-architecture." This is how you build a realistic roadmap.

The Rosetta Stone: Translating ODI Concepts to the Databricks Lakehouse

Once you know what to migrate, you need a translation guide. Don't think of it as a one-to-one mapping; think of it as finding the most "Databricks-native" way to achieve the same outcome. Trying to make Databricks behave like ODI is a recipe for an expensive, slow, and unmaintainable system.

Here's my personal translation table, built from experience:

ODI Component	Databricks Primary Equivalent	Databricks Secondary/Alternative	My Notes & Rationale
Work Repository	Unity Catalog Metastore	Hive Metastore (legacy)	Unity Catalog is non-negotiable for any new project. It provides the centralized governance, lineage, and security that a repository offers, but for the entire Lakehouse.
Source/Target Datastore	Delta Table	External Table (on S3/ADLS)	Always default to Delta Tables. The ACID transactions, time travel, and performance optimizations are game-changers. Use external tables only for ingesting raw, unstructured data before converting to Delta.
Mapping / Interface	Databricks Notebook (PySpark)	Delta Live Tables (DLT)	For direct translation of existing logic, a parameterized notebook is the simplest path. For new, streaming, or quality-focused pipelines, DLT is superior with its declarative approach and built-in data quality checks.
Knowledge Module (KM)	Shared Python Library/Module	Reusable Notebook (`%run`)	KMs are all about reusable code patterns (loading, integrating, checking). The modern equivalent is a proper Python library (`.py` files) packaged into a wheel and attached to a cluster. This is far more robust and testable than using `%run`.
Procedure (Jython/SQL)	Notebook Cell / Python Function	Spark SQL in a notebook	The logic within ODI procedures almost always translates to a series of PySpark DataFrame transformations or direct Spark SQL queries executed within a notebook.
Package	Databricks Workflow	Single complex notebook	An ODI Package orchestrates multiple steps. A Databricks Workflow does the exact same thing, orchestrating multiple notebooks, SQL tasks, or DLT pipelines. Chaining notebooks with `%run` is brittle and should be avoided.
Scenario	Databricks Job	Deployed DLT Pipeline	A Scenario is a compiled, runnable version of a package or mapping. A Databricks Job is the runnable, schedulable entity that executes a Workflow or a single notebook on a specific cluster configuration.
Load Plan	Databricks Workflow with nested tasks	External Orchestrator (e.g., Airflow)	Load Plans are for complex, multi-stage orchestration. Databricks Workflows can handle significant complexity. For enterprise-wide, cross-system orchestration, integrating with a tool like Airflow is the standard pattern.

The Gauntlet: Navigating Real-World Migration Challenges

This is where theory meets reality. Every migration project faces a series of predictable—but painful—hurdles. Here are the big ones I’ve encountered on every single project, why they happen, and how we beat them.

Connectivity & Drivers: The First Wall You'll Hit

Why it happens: ODI often runs on a server sitting inside a corporate network, with decades of accumulated JDBC drivers and ODBC connections to every system imaginable, from AS/400s and mainframes to ancient versions of Sybase. Databricks runs in the cloud. The drivers aren't there, and the network paths are blocked.

How we diagnosed it: The first notebook you run to connect to a legacy source will fail with a java.lang.ClassNotFoundException: com.some.obscure.Driver or just hang indefinitely, eventually timing out. This is your baptism by fire. We once spent a week trying to figure out why a connection to a DB2 instance was failing, only to realize the corporate firewall was silently dropping packets to the required port.