From Pentaho to Databricks: An Architect's Playbook for (Mostly) Automated Migration
"So, is there a button we can press? An automated tool that just... converts all our Pentaho jobs to Databricks?"
I've heard this question, or a variation of it, from nearly every client I've helped move off legacy ETL platforms. They're usually sitting on a mountain of thousands of Pentaho Data Integration (PDI) jobs (.kjb files) and transformations (.ktr files) built up over a decade. The idea of manually rewriting everything is paralyzing.
My answer is always the same, and it’s not the simple "yes" they hope for, but it’s the truth backed by years of leading these projects in the trenches: There is no magic one-click conversion tool. But you absolutely can, and should, automate significant portions of the migration.
The dream of 100% automation is a siren song that leads to broken projects. The reality is that a successful migration is about intelligent automation—using scripts to handle the repetitive, predictable 80% so your talented engineers can focus on the nuanced, complex 20% that requires human expertise. It's about building a migration factory, not a magic box.
I’ve led these migrations. I've seen what works, what fails spectacularly, and the costly mistakes made along the way. This is the playbook I've built from that experience. We're going to go deep on how you can use targeted automation not just to migrate, but to modernize and build a far more robust, scalable, and manageable data platform on Databricks.
The First Step is Always a Reckoning: Building a Complete Inventory
Before you can even think about converting code, you have to know what you have. A manual inventory is a non-starter. I was on one early project where we tried it with spreadsheets. Three months in, we were still finding "hidden" jobs, nested deep in directories, that were critical to month-end reporting. The project stalled, confidence plummeted, and we had to reboot the entire discovery phase.
Never again.
Your first automation task is to build a script that programmatically dissects your entire Pentaho environment. Pentaho jobs and transformations are just XML files. This is a huge advantage. We can parse them.
Building Your Pentaho Inventory: The Python-Powered Approach
We need to create a comprehensive catalog of every single component. This script's job is to walk through your Pentaho file repositories, parse every .ktr and .kjb, and extract the metadata into a structured format like a CSV or a set of JSON files.
Here’s a production-style Python script that does exactly that. It uses standard libraries and is designed to be robust.
# inventory_generator.py
import os
import csv
import xml.etree.ElementTree as ET
from pathlib import Path
import logging
from collections import defaultdict
# --- Configuration ---
PENTAHO_REPO_PATH = "/path/to/your/pentaho/repository"
OUTPUT_CSV_PATH = "./pentaho_inventory.csv"
LOG_FILE = "inventory_generation.log"
# --- Setup Logging ---
logging.basicConfig(filename=LOG_FILE, level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
def parse_connection(conn_node):
"""Parses a <connection> node from a KTR/KJB file."""
return {
'conn_name': conn_node.findtext('name'),
'server': conn_node.findtext('server'),
'type': conn_node.findtext('type'),
'database': conn_node.findtext('database'),
'username': conn_node.findtext('username'),
'port': conn_node.findtext('port')
}
def parse_ktr(file_path):
"""Parses a .ktr (Transformation) file."""
inventory = []
try:
tree = ET.parse(file_path)
root = tree.getroot()
transform_name = root.findtext('info/name')
# Get all connections defined in the transformation
connections = {c['conn_name']: c for c in [parse_connection(conn) for conn in root.findall('connection')]}
for step in root.findall('step'):
step_name = step.findtext('name')
step_type = step.findtext('type')
step_details = {
'file_path': str(file_path),
'file_type': 'ktr',
'parent_name': transform_name,
'component_type': 'step',
'component_name': step_name,
'component_subtype': step_type,
'details': '',
'dependencies': ''
}
# Extract specific details for common step types
if step_type == 'TableInput':
sql = step.findtext('sql')
conn_name = step.findtext('connection')
step_details['details'] = f"SQL Query: {sql[:200]}..." if sql else "No SQL"
step_details['dependencies'] = f"Connection: {conn_name}"
elif step_type == 'ExecSQL':
sql = step.findtext('sql')
conn_name = step.findtext('connection')
step_details['details'] = f"SQL Script: {sql[:200]}..." if sql else "No SQL"
step_details['dependencies'] = f"Connection: {conn_name}"
elif step_type == 'MergeJoin':
key1 = [k.text for k in step.findall('keys1/key')]
key2 = [k.text for k in step.findall('keys2/key')]
step_details['details'] = f"Join Keys: {key1} vs {key2}"
elif step_type == 'ScriptValueMod':
js_script_node = step.find('js_scripts/js_script')
if js_script_node is not None:
script_name = js_script_node.findtext('js_script_name')
script_body = js_script_node.findtext('js_script')
step_details['details'] = f"JavaScript Name: {script_name}, Length: {len(script_body)}"
inventory.append(step_details)
# Add connections to inventory
for conn_name, conn_data in connections.items():
inventory.append({
'file_path': str(file_path),
'file_type': 'ktr',
'parent_name': transform_name,
'component_type': 'connection',
'component_name': conn_name,
'component_subtype': conn_data['type'],
'details': f"Server: {conn_data['server']}, DB: {conn_data['database']}",
'dependencies': ''
})
except ET.ParseError as e:
logging.error(f"Failed to parse XML in {file_path}: {e}")
except Exception as e:
logging.error(f"An unexpected error occurred while parsing {file_path}: {e}")
return inventory
def parse_kjb(file_path):
"""Parses a .kjb (Job) file."""
inventory = []
try:
tree = ET.parse(file_path)
root = tree.getroot()
job_name = root.findtext('name')
# Get all connections
connections = {c['conn_name']: c for c in [parse_connection(conn) for conn in root.findall('connection')]}
for entry in root.findall('entries/entry'):
entry_name = entry.findtext('name')
entry_type = entry.findtext('type')
entry_details = {
'file_path': str(file_path),
'file_type': 'kjb',
'parent_name': job_name,
'component_type': 'entry',
'component_name': entry_name,
'component_subtype': entry_type,
'details': '',
'dependencies': ''
}
if entry_type == 'TRANS':
transname = entry.findtext('transname')
directory = entry.findtext('directory')
filename = entry.findtext('filename')
entry_details['dependencies'] = f"Transformation File: {filename}"
elif entry_type == 'JOB':
jobname = entry.findtext('jobname')
directory = entry.findtext('directory')
filename = entry.findtext('filename')
entry_details['dependencies'] = f"Job File: {filename}"
elif entry_type == 'SQL':
sql = entry.findtext('sql')
conn_name = entry.findtext('connection')
entry_details['details'] = f"SQL Script: {sql[:200]}..." if sql else "No SQL"
entry_details['dependencies'] = f"Connection: {conn_name}"
inventory.append(entry_details)
except ET.ParseError as e:
logging.error(f"Failed to parse XML in {file_path}: {e}")
except Exception as e:
logging.error(f"An unexpected error occurred while parsing {file_path}: {e}")
return inventory
def main():
"""Main function to generate the inventory."""
repo_path = Path(PENTAHO_REPO_PATH)
if not repo_path.is_dir():
logging.error(f"Provided path is not a directory: {PENTAHO_REPO_PATH}")
return
all_components = []
logging.info("Starting inventory generation...")
for pentaho_file in repo_path.rglob('*.*'):
if pentaho_file.suffix == '.ktr':
logging.info(f"Parsing KTR: {pentaho_file}")
all_components.extend(parse_ktr(pentaho_file))
elif pentaho_file.suffix == '.kjb':
logging.info(f"Parsing KJB: {pentaho_file}")
all_components.extend(parse_kjb(pentaho_file))
if not all_components:
logging.warning("No components found. Check repository path and file extensions.")
return
# Write to CSV
fieldnames = [
'file_path', 'file_type', 'parent_name', 'component_type',
'component_name', 'component_subtype', 'details', 'dependencies'
]
try:
with open(OUTPUT_CSV_PATH, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(all_components)
logging.info(f"Inventory successfully written to {OUTPUT_CSV_PATH}")
except IOError as e:
logging.error(f"Could not write to output file {OUTPUT_CSV_PATH}: {e}")
if __name__ == "__main__":
main()
When you run this, you get a single, beautiful CSV file. This file is the foundation of your entire migration project. It's your single source of truth.
Analyzing the Inventory: From Data to Decisions
That CSV is more than just a list. It’s a map of your data lineage, dependencies, and complexity. Now, we use this data to inform our strategy.
- Complexity Scoring: We can write a simple script using Pandas to analyze this CSV. A transformation with 50 steps, including a JavaScript step and three merge joins, is "Complex." A transformation with a
TableInput,Select values, andTableOutputis "Simple." - Dependency Graph: We can build a graph of how jobs call other jobs and transformations. This is critical for identifying execution chains and finding orphaned files that aren't called by anything (candidates for retirement!).
- Readiness Reporting: We can identify patterns. How many jobs use the
JavaScriptstep (a major migration headache)? How many distinct database connections are there? How many jobs read from FTP sites?
Here’s how you might start analyzing this with Python and Pandas to build a dependency view and a readiness report.
# analysis.py
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
INVENTORY_CSV = "./pentaho_inventory.csv"
def analyze_inventory():
df = pd.read_csv(INVENTORY_CSV)
# --- 1. Readiness Report: Identify Problematic Components ---
print("--- Migration Readiness Report ---")
# Count JavaScript steps - these require manual rewrite
js_steps = df[df['component_subtype'] == 'ScriptValueMod']
print(f"Total JavaScript steps found: {len(js_steps)}")
if not js_steps.empty:
print("Found in files:")
print(js_steps['file_path'].unique())
print("-" * 30)
# List all unique connection types
conn_types = df[df['component_type'] == 'connection']['component_subtype'].unique()
print(f"Unique connection types: {list(conn_types)}")
print("-" * 30)
# Identify jobs/transforms with a high number of steps
step_counts = df[df['component_type'] == 'step'].groupby('file_path')['component_name'].count().sort_values(ascending=False)
print("Top 10 most complex transformations (by step count):")
print(step_counts.head(10))
print("-" * 30)
# --- 2. Build a Dependency Graph ---
G = nx.DiGraph()
# Add nodes (jobs and transforms)
for path in df['file_path'].unique():
G.add_node(Path(path).name)
# Add edges based on dependencies
job_entries = df[df['file_type'] == 'kjb']
for _, row in job_entries.iterrows():
if row['component_subtype'] in ['TRANS', 'JOB'] and pd.notna(row['dependencies']):
source_job = Path(row['file_path']).name
# The dependency is stored like "Transformation File: my_transform.ktr"
target_file = row['dependencies'].split(': ')[-1]
if target_file:
G.add_edge(source_job, target_file)
print("\n--- Dependency Analysis ---")
print(f"Total nodes (Jobs/Transforms): {G.number_of_nodes()}")
print(f"Total dependencies (Edges): {G.number_of_edges()}")
# Find entry points (nodes with no incoming edges)
entry_points = [n for n, d in G.in_degree() if d == 0]
print(f"\nPotential entry point jobs (no other jobs call them): {len(entry_points)}")
print(entry_points)
# Find orphans (isolated nodes)
orphans = list(nx.isolates(G))
print(f"\nPotential orphan files (not called and call nothing): {len(orphans)}")
print(orphans)
# Optional: Visualize the graph (for smaller repos)
# plt.figure(figsize=(12, 12))
# nx.draw(G, with_labels=True, node_size=500, font_size=8)
# plt.show()
if __name__ == "__main__":
analyze_inventory()
This analysis transforms your raw inventory into an actionable migration plan. You now know which jobs are your high-risk, high-effort candidates and which are your low-hanging fruit.
The Core Challenge: Translating Pentaho Steps to Databricks Constructs
With a solid inventory, we can tackle the main event: code conversion. This is where you map the logic of a Pentaho step to its equivalent in the Databricks world, which is typically PySpark or Spark SQL.
The Translation Matrix: Your Rosetta Stone
The first thing my team does is build a "Translation Matrix." It's a living document that maps every Pentaho step we encounter to a Databricks/Spark pattern. It becomes our guide for both automated scaffolding and manual rewrites.
Here's a condensed version of what that looks like:
| Pentaho Step | Databricks/PySpark Equivalent | Migration Notes |
|---|---|---|
| Table Input | spark.read.format("jdbc").option("url", ...).option("dbtable", ...).load() |
Straightforward. Connection details move to Databricks Secrets. Query can often be used directly. |
| CSV File Input | spark.read.format("csv").option("header", "true").schema(my_schema).load("/path/to/file") |
Easy. Strongly recommend defining an explicit schema (StructType) instead of relying on inferSchema. |
| Filter Rows | df.filter("column > 100") or df.where(col("column") > 100) |
Direct translation. Conditions are easily converted. |
| Select Values | df.select("col1", "col2"), df.withColumnRenamed("old", "new"), df.withColumn("col", col("col").cast("integer")) |
Often a combination of select, withColumnRenamed, and cast. Very automatable. |
| Merge Join | df1.join(df2, df1.key == df2.key, "inner") |
Direct translation, but performance tuning is key. Consider broadcast joins for small tables. |
| Group By | df.groupBy("key_col").agg(sum("value_col").alias("total_value")) |
Direct translation. Pentaho's aggregations map cleanly to Spark's aggregate functions. |
| Execute SQL Script | spark.sql("INSERT INTO ...") or jdbc_df.write.format("jdbc")... |
If the SQL acts on a source system, use a JDBC write. If it's acting on tables in Databricks, use spark.sql(). |
| JavaScript Step | Manual Rewrite in Python or Scala | HIGH RISK. This is the #1 candidate for manual effort. Logic must be understood and rewritten using Spark functions or Python libraries. |
| Set Variables | Databricks Widgets (dbutils.widgets.get("param")), function parameters. |
Pentaho variables map to parameters passed into a Databricks Notebook or Job. |
| Call Job/Transformation | dbutils.notebook.run("/path/to/notebook", 60, {"param": "value"}) or Multi-task Databricks Job. |
Pentaho's job orchestration maps to Databricks Workflows. More on this later. |
| Formula Step | df.withColumn("new_col", expr("col_a + col_b * 2")) |
Spark's expr() function is incredibly powerful and can handle most formulas. |
Automating the Code Conversion: Building a Scaffolding Generator
You're not building a perfect, bug-free transpiler. You're building a tool that generates a scaffold. It creates a Python script that is 80% of the way there, with clear TODO comments for the parts that require a human touch. This massively accelerates development.
The script below takes our parsed XML structure for a KTR and attempts to generate a PySpark notebook.
# code_generator.py
import json
import os
from pathlib import Path
# A simplified mapping for demonstration
STEP_TO_PYSPARK_TEMPLATE = {
'TableInput': """
# Step: {step_name}
# Reading from JDBC connection: {connection}
{df_name} = (spark.read
.format("jdbc")
.option("url", f"jdbc:your_driver://{{dbutils.secrets.get(scope='{secret_scope}', key='{conn_key}_host')}}:{{dbutils.secrets.get(scope='{secret_scope}', key='{conn_key}_port')}}")
.option("user", dbutils.secrets.get(scope="{secret_scope}", key="{conn_key}_user"))
.option("password", dbutils.secrets.get(scope="{secret_scope}", key="{conn_key}_password"))
.option("dbtable", "({sql}) as subquery")
.load()
)
""",
'SelectValues': """
# Step: {step_name} - Renaming/casting fields
# TODO: Verify all column renames, casts, and selections are correct.
{df_name} = {input_df_name}.selectExpr(
{select_expressions}
)
""",
'FilterRows': """
# Step: {step_name} - Filtering rows
# TODO: Validate the filter condition.
{df_name} = {input_df_name}.filter("{condition}")
""",
'MergeJoin': """
# Step: {step_name} - Merge Join
# TODO: Verify join type ('{join_type}') and keys. Consider broadcast for small DFs.
df_right = ... # This needs to be defined from the other input stream
{df_name} = {input_df_name}.join(df_right, [{join_keys}], "{join_type}")
""",
'ScriptValueMod': """
# Step: {step_name} - JavaScript Step
# !!! CRITICAL: Manual rewrite required !!!
# Original JS logic was:
# {js_details}
# This step needs to be completely rewritten in PySpark or as a Python UDF.
# {df_name} = {input_df_name}.withColumn(...) # Placeholder
""",
'TableOutput': """
# Step: {step_name} - Writing to table {table_name}
# TODO: Confirm write mode ('{write_mode}') and target table schema.
({input_df_name}.write
.format("delta") # Or "jdbc"
.mode("{write_mode}")
.saveAsTable("{target_schema}.{table_name}")
)
"""
}
def generate_notebook_scaffold(ktr_parsed_json_path):
"""Generates a PySpark script from a parsed KTR JSON representation."""
with open(ktr_parsed_json_path, 'r') as f:
# This assumes you've pre-processed the XML into a structured JSON
# with steps, hops, and details.
transform_data = json.load(f)
transform_name = transform_data['name']
output_filename = f"generated_{transform_name}.py"
# Simple topological sort of steps (this is a huge simplification for the example)
# A real implementation needs a proper graph traversal (DFS/BFS)
steps = sorted(transform_data['steps'], key=lambda x: x.get('order', 999))
code_lines = [
f"# AUTOGENERATED NOTEBOOK for Pentaho Transformation: {transform_name}",
"# Generated by code_generator.py",
"# PLEASE REVIEW AND REFACTOR MANUALLY. THIS IS A SCAFFOLD, NOT FINAL CODE.",
"\nfrom pyspark.sql.functions import col, expr",
"from pyspark.sql.types import *",
"\n# --- SECRET SCOPE CONFIGURATION ---",
"# Replace with your Databricks secret scope name",
"secret_scope = 'your-secrets-scope'\n"
]
# This mapping is crucial: Pentaho stream -> Spark DataFrame
df_map = {}
for i, step in enumerate(steps):
step_type = step['type']
step_name = step['name']
# In a real scenario, input_df_name comes from traversing the "hops"
input_df_name = f"df_{i-1}" if i > 0 else "df_initial"
df_name = f"df_{i}"
df_map[step_name] = df_name
if step_type in STEP_TO_PYSPARK_TEMPLATE:
# This is where you would populate the template with details from the parsed step
# For simplicity, I'm using placeholder logic.
# A real implementation would parse the XML step details deeply.
template = STEP_TO_PYSPARK_TEMPLATE[step_type]
# Example for TableInput
if step_type == 'TableInput':
code = template.format(
step_name=step_name,
df_name=df_name,
connection=step.get('connection', 'UNKNOWN'),
secret_scope='your-secrets-scope',
conn_key=step.get('connection', 'UNKNOWN').lower(),
sql=step.get('sql', 'SELECT 1').replace('"', '\\"')
)
code_lines.append(code)
# Example for JS Step
elif step_type == 'ScriptValueMod':
code = template.format(
step_name=step_name,
df_name=df_name,
input_df_name=input_df_name,
js_details="... (extracted from XML) ..."
)
code_lines.append(code)
# Add other step handlers here...
else:
code_lines.append(f"# UNSUPPORTED STEP: {step_name} (Type: {step_type}) - Manual implementation required.")
with open(output_filename, 'w') as f:
f.write("\n".join(code_lines))
print(f"Generated scaffold: {output_filename}")
# To run this, you'd first need to have a detailed JSON representation of your KTR.
# generate_notebook_scaffold('path/to/parsed_transform.json')
The output of this script isn't meant to be run immediately. It's a starting point for an engineer. But it saves them hours of boilerplate typing and researching syntax. It turns a 2-day manual conversion task into a 4-hour review-and-refine task. Across thousands of jobs, that's a monumental saving.
The "Gotchas": A Field Guide to Real-World Migration Pains
Theory and simple examples are fine, but migrations are won or lost in the details. Here are the most common and painful issues we've hit on nearly every project, categorized for clarity.
1. Connectivity & Drivers
- The Problem: In Pentaho, connections are often managed via JNDI on the server or in local
.kettleproperty files. A developer might have a specific JDBC driver version on their machine that made it into production. Databricks has a completely different model for secrets and drivers. - Why It Happens: We once had a migration where jobs kept failing with
ClassNotFoundExceptionin Databricks. The Pentaho job used a very old, proprietary Teradata driver. The developer had just dropped the JAR into their locallibfolder years ago. - Diagnosis: The Databricks driver logs are your best friend. The exception
java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriverwas a dead giveaway. We couldn't find this driver configured anywhere in the cluster setup. - Remediation & Best Practices:
- Centralize Secrets: All connection credentials (usernames, passwords, keys) MUST go into Databricks Secrets, preferably backed by Azure Key Vault or AWS Secrets Manager. No exceptions. This is non-negotiable for security.
- Standardize Drivers: Your inventory script should list all source/target systems. For each one, select a single, official, up-to-date JDBC/ODBC driver.
- Install Drivers on Clusters: Install the chosen drivers onto your Databricks clusters using Cluster Policies or init scripts. This ensures every job uses the same, tested driver version. Do not let individual notebooks install drivers on the fly.
2. Performance & Scalability
- The Problem: You translate a Pentaho job 1-to-1. It worked fine on the Pentaho server, taking 30 minutes. In Databricks, it runs for 3 hours and then fails with an OutOfMemory (OOM) error.
- Why It Happens: This is the most common conceptual failure. Pentaho often encourages row-by-row processing (especially with JavaScript steps or lookups inside a loop). Spark, the engine behind Databricks, is a distributed, set-based processing engine. A
forloop that iterates over a million-row DataFrame is a Spark anti-pattern. We saw this on a project where a lookup was performed for every single row against a dimension table. This translated into a million tiny, inefficient tasks instead of a single, optimized broadcast join. - Diagnosis: The Spark UI is your command center.
- Look at the DAG (Directed Acyclic Graph). Do you see a huge number of stages for a simple task?
- Check the Executors tab. Is one executor at 99% CPU while others are idle? This is data skew.
- Look at the Shuffle Read/Write metrics. Are you shuffling terabytes of data for a simple join?
- Remediation & Best Practices:
- Think in Sets, Not Rows: The developer's mantra must be "How can I express this as a DataFrame transformation?". Avoid UDFs (User-Defined Functions) where possible and use built-in Spark SQL functions.
- Use Broadcast Joins: If you're joining a large fact table with a small dimension table,
broadcast()the small DataFrame. This sends a copy to every executor, avoiding a massive, costly shuffle of the large table.spark.conf.set("spark.sql.autoBroadcastJoinThreshold", ...)is your friend. - Partition Your Data: When writing data to Delta Lake, partition it by columns that you frequently filter on (e.g.,
date,country). This allows Spark to skip reading entire directories of data (data skipping/predicate pushdown). - Cache Strategically: If a DataFrame is used multiple times in a notebook,
df.cache()can prevent it from being recomputed. But don't cache everything—it consumes memory.
3. Schema & Data Type Issues
- The Problem: Pentaho is notoriously forgiving with data types. It might happily read a column with
123,456, andhelloas a String. Spark, especially when writing to a typed format like Delta, is not. Your job fails with aAnalysisExceptionor aCastException. - Why It Happens: We had a job that processed sales data. In the source, the
product_idwas mostly numbers, but occasionally contained legacy IDs likeOBS-123. Pentaho processed it all as a String. Our initial Spark job inferred the schema, saw only numbers in the first 1000 rows, and set the type tolong. When it hitOBS-123, the job exploded. - Diagnosis: The error message is usually explicit:
cannot cast String to Long. The real diagnosis is looking at the source data profile and realizinginferSchemais a trap. - Remediation & Best Practices:
- NEVER Use
inferSchemain Production: It's great for interactive exploration, but a production pipeline must have a defined, explicit schema. - Define Schemas with
StructType: Use your inventory analysis to map source data types to SparkStructType. This is tedious but non-negotiable for reliability. - Embrace Delta Lake: Write all your curated data to Delta tables. Delta Lake enforces schema on write, preventing data quality issues from propagating downstream. Use
option("mergeSchema", "true")carefully during development, but lock down schemas in production. - Create a "Quarantine" Zone: For incoming raw data, have a data quality step. Cast columns explicitly. If a cast fails, route the bad record to a "quarantine" table for later analysis instead of failing the whole job.
- NEVER Use
4. Security & Compliance
- The Problem: You run your inventory script and find plaintext passwords in the XML of the connection definitions. Service accounts used by Pentaho have
sysadminordb_ownerprivileges on source databases. - Why It Happens: It's the path of least resistance. A developer needs access, so they get a password and hardcode it to get the job done. Over time, this becomes a massive security liability.
- Diagnosis: A simple text search for
<password>in your Pentaho repository is often a horrifyingly effective diagnostic tool. Also, auditing service account permissions in your source databases. - Remediation & Best Practices:
- Unity Catalog is the Goal: For new projects, Unity Catalog is the answer. It provides a single place to manage permissions on tables, files, and other assets using standard SQL
GRANT/REVOKEcommands. It's a paradigm shift from managing security at the workspace or cluster level. - Use Service Principals (or Managed Identities): Authenticate your Databricks jobs to cloud resources (like ADLS Gen2, S3, Key Vault) using Service Principals (Azure) or IAM Roles (AWS). Never use personal user accounts for production jobs.
- Least Privilege Principle: The service principal your Databricks job uses to read from a source Oracle database should only have
SELECTon the specific tables it needs, not the whole schema. - Credential Passthrough vs. Service Principals: Understand the tradeoffs. Credential passthrough is convenient for interactive use but can be complex for automation. Service principals are generally better for production jobs.
- Unity Catalog is the Goal: For new projects, Unity Catalog is the answer. It provides a single place to manage permissions on tables, files, and other assets using standard SQL
5. Orchestration & Scheduling
- The Problem: A complex Pentaho job (
master.kjb) calls ten other jobs and transformations in a specific sequence with conditional logic (e.g., "if file exists, run transform A, otherwise run job B"). Replicating this is not straightforward. - Why It Happens: Pentaho's visual job designer makes it easy to create complex "spiderwebs" of dependencies. A direct translation isn't always possible or desirable.
- Diagnosis: Your dependency graph, which you built from the inventory script, is the key. It visually shows you the execution chains. We've seen jobs that are 10 levels deep.
- Remediation & Best Practices:
- Databricks Workflows (Multi-Task Jobs): For most job chains, Databricks Workflows are the native solution. You can define a DAG of tasks (notebooks, Python scripts, SQL queries) and set dependencies between them. This covers 80% of use cases.
dbutils.notebook.run(): For simple, linear chains, one notebook can call another. This is quick but can become hard to manage and debug. Use it sparingly.- External Orchestrators (Airflow, etc.): When you have dependencies outside of Databricks (e.g., "wait for a message on a Kafka topic, then run a Databricks job, then call a Salesforce API"), a dedicated orchestrator like Apache Airflow is the right tool. Airflow has excellent providers for Databricks.
- Rethink, Don't Just Replicate: A complex Pentaho job is often a sign of "technical debt." The migration is an opportunity to simplify. Can ten sequential transformations be combined into a single, more efficient Spark job?
6. Cost & Resource Optimization
- The Problem: The first cloud bill arrives and the CFO's office is on fire. A developer left a large all-purpose cluster running 24/7 for a job that runs for 15 minutes a day.
- Why It Happens: Coming from an on-premise world where servers are a sunk cost, the pay-as-you-go cloud model requires a mental shift. It's easy to forget to terminate a cluster.
- Diagnosis: The bill. Also, the Databricks cluster UI, which shows you which clusters are running, for how long, and by whom.
- Remediation & Best Practices:
- Job Clusters are the Default: For scheduled, automated workloads, always use Job Clusters. These clusters spin up, run your job, and terminate automatically. You only pay for what you use.
- All-Purpose Clusters for a Purpose: All-purpose clusters are for interactive development and analysis by data scientists and analysts. They should have aggressive auto-termination policies (e.g., terminate after 30 minutes of inactivity).
- Use Cluster Policies: Policies are your guardrails. You can enforce rules like: "all clusters must have a 'project' tag," "max DBU/hour is 50," "autotermination must be less than 60 minutes." This prevents runaway costs.
- Leverage Autoscaling and Spot Instances: Configure your clusters to autoscale. For workloads that can tolerate interruption, use Spot/Preemptible instances to save up to 70-90% on compute costs.
A Battle-Tested Migration Workflow
Putting it all together, here is the 5-step process we've refined over many projects.
-
Automated Discovery & Triage: Run the inventory and analysis scripts. Load the results into a dashboard (Power BI, Tableau, etc.). Classify every job into one of four buckets:
- Retire (20-30%): Orphaned, redundant, or obsolete jobs. Don't migrate them. This is a huge, immediate win.
- Simple (40-50%): Basic "read-transform-write" jobs. These are prime candidates for your automated scaffolding generator.
- Medium (20-30%): Jobs with multiple joins, aggregations, and some conditional logic. Scaffolding helps, but they require significant manual refinement.
- Complex (5-10%): Jobs with JavaScript, heavy row-by-row logic, or spaghetti-like dependencies. Budget these for a full manual rewrite by senior engineers.
-
Pattern-Based Conversion: Don't convert job-by-job. Identify common ETL patterns from your inventory (e.g., "Daily load from SFTP to Delta," "Hourly aggregation from SQL Server"). Build a gold-standard, parameterized Databricks notebook template for each pattern. This is far more efficient.
-
Automated Scaffolding: Now, run your code generator against the "Simple" and "Medium" jobs. This creates the first draft of the PySpark code and populates your team's backlog with tasks for review.
-
Manual Refinement & Optimization: This is where your engineers shine. They pick up the generated code, fix the
TODOs, rewrite the Spark logic to be performant (using joins instead of loops, etc.), add robust error handling and logging, and wrap it in unit tests. -
Test, Deploy, Govern:
- Testing: Use
pytestwith sample data DataFrames for unit tests. Run integration tests in a dedicated QA environment. - Deployment: Use Databricks Asset Bundles or the Databricks Terraform Provider to manage your notebooks, jobs, and cluster configs as code (Infrastructure as Code).
- Governance: Implement the cluster policies, security controls (Unity Catalog), and cost monitoring we discussed. Set up alerting for job failures.
- Testing: Use
The Real Answer
So, is there an automated way to convert Pentaho to Databricks?
No, not in the way most people hope. But that's a good thing. A blind 1-to-1 conversion would just move your legacy problems to a more expensive platform. You'd have slow, inefficient, hard-to-maintain jobs running in Databricks, and you'd wonder why you bothered migrating at all.
The successful path is through intelligent, targeted automation. Use scripts to do the heavy lifting of discovery, analysis, and scaffolding. This frees up your most valuable resource—your engineers' brainpower—to focus on what truly matters: not just migrating the code, but modernizing the pipelines. It's about transforming old, brittle, row-by-row processes into scalable, resilient, set-based operations that truly leverage the power of a modern data lakehouse platform. That is a migration worth doing.