From Clicks to Code: My Unfiltered Guide to Migrating Matillion ETL to Databricks
I still remember the meeting. We were sitting with the Head of Data at a fast-growing e-commerce company. He drew a diagram on the whiteboard. On one side, he circled "Matillion" and wrote "Slow, Expensive, Black Box." On the other, he circled "Databricks" and wrote "AI/ML, Analytics, Our Future." In the middle, he drew a messy, tangled arrow and wrote a single word: "HELP."
That messy arrow is where I’ve spent a good chunk of my recent career. I’ve led multiple large-scale migrations from Matillion ETL to the Databricks Data Intelligence Platform. These weren’t theoretical exercises; they were high-stakes, production projects with tight deadlines and skeptical stakeholders. We made mistakes, we learned hard lessons, and we ultimately built robust, scalable, and future-proof data platforms.
This isn't a marketing pitch for Databricks or a hit piece on Matillion. Matillion is a capable tool that gets many teams started. But as data volumes explode and the line between ETL, analytics, and machine learning blurs, many organizations hit a wall. They realize their GUI-based ETL logic is a silo, disconnected from the rest of their data ecosystem.
Migrating from Matillion to Databricks is more than a technical task. It's a fundamental shift in philosophy—from visual, instance-based ETL to a code-first, platform-centric approach. If you’re facing this journey, this is the guide I wish I had when I started. It's the unvarnished truth from the trenches, packed with the automation scripts we built, the problems we hit, and the patterns that actually work in production.
Why We Moved from Matillion to Databricks: The Unvarnished Truth
Every migration project starts with a "why." Understanding the real drivers is crucial because they'll guide your architectural decisions and help you justify the effort to the business. In every project I've led, the reasons boiled down to a few core themes.
-
Scaling Bottlenecks: Matillion runs on a virtual machine (EC2, Azure VM). You scale it by picking a bigger machine. This "vertical scaling" has a hard limit. We had a client whose nightly batch window was creeping from four hours to seven, then nine. They were throwing bigger and more expensive instances at the problem, but the architecture itself was the bottleneck. Databricks, with its ability to spin up ephemeral clusters of hundreds of nodes for a single job, offered a path to horizontal scalability that was simply impossible in Matillion.
-
The Cost Creep: The initial Matillion license might seem reasonable. But then you add the cost of the large, always-on EC2 instance required to run it. Then you need a separate, larger instance for development and staging. As your data grows, those instances get beefier and the costs balloon. We ran the numbers for one client: over 40% of their total migration budget would be recouped in the first year just from eliminating Matillion licensing and oversized, 24/7 compute.
-
The Silo Problem: This was the killer for most. The data engineering team lived inside the Matillion UI. The data science team lived in Databricks notebooks. The analytics team used Databricks SQL. The business logic—the "T" in ETL—was trapped in a proprietary graphical interface, inaccessible and opaque to everyone else. Moving to Databricks meant everyone could finally speak the same language (Python and SQL) on the same platform, using the same data.
-
The Code-First Imperative: You can't properly version control a GUI. While Matillion has integrations with Git, it's clunky. It's not native. You can't run a
diffon a visual job and easily see what changed. You can't write unit tests for a "Join" component. Moving to PySpark and Databricks meant we could finally implement modern software engineering practices: full Git integration, pull requests, automated CI/CD pipelines (using GitHub Actions or Azure DevOps), and programmatic testing withpytest. This was a quantum leap in reliability and governance.
The Initial Shock: Understanding What You're Really Migrating
The first step in any migration is a full inventory. And the first mistake many teams make is underestimating the complexity of a "Matillion job." It’s not a single entity. It’s a nested doll of components, dependencies, and hidden logic.
A typical Matillion setup consists of:
- Orchestration Jobs: These are the conductors. They don't transform data themselves. They call other jobs, run SQL scripts, execute Python or Bash, and manage the workflow logic (e.g., "If Job A succeeds, run Job B and C in parallel"). These map roughly to Databricks Workflows or tasks in an external orchestrator like Airflow.
- Transformation Jobs: These are the workhorses. They contain the actual data transformation logic: reading from sources, joining, filtering, aggregating, and writing to targets. These are the jobs that need to be rewritten in PySpark or Spark SQL.
Within these jobs, you find a zoo of components:
- Data Stagers: Components like "Database Query" that read from a source (PostgreSQL, SQL Server, etc.).
- Transformers: "Join," "Filter," "Aggregate," "Calculator," "Rank," etc.
- Data Loaders: Components that write to a target (e.g., "Snowflake Output," "Redshift Unload").
- Scripting Components: "Python Script" or "Bash Script" components that shell out to run custom code. This is often where complex, non-standard logic hides.
Trying to manually click through hundreds of jobs in the Matillion UI to document this is a recipe for failure. You'll miss dependencies, misinterpret logic, and your estimates will be wildly inaccurate. You must automate.
Building Your Migration Compass: Automated Inventory and Analysis
On my first major Matillion migration, we spent two weeks trying to build this inventory manually. It was a disaster. We quickly realized the only path forward was to use Matillion's own API to extract the truth, programmatically. This became the cornerstone of our migration factory.
Tapping into the Matillion API: Your Source of Truth
Matillion has a REST API that lets you export job definitions as JSON. This is your goldmine. We wrote a Python script to iterate through every project, every version, and every job, pulling down its complete JSON definition.
Here's a simplified version of the script we used to get started. It uses the requests library to connect and pull down the job data.
import requests
import json
import os
# --- Configuration ---
MATILLION_USER = os.environ.get("MATILLION_USER")
MATILLION_PASS = os.environ.get("MATILLION_PASS")
MATILLION_HOST = "https://your-matillion-instance.com"
BASE_URL = f"{MATILLION_HOST}/rest/v1"
# Use a session for connection pooling and authentication
session = requests.Session()
session.auth = (MATILLION_USER, MATILLION_PASS)
def get_projects():
"""Fetches all projects in the Matillion instance."""
url = f"{BASE_URL}/projects"
print(f"Fetching projects from {url}...")
try:
response = session.get(url, verify=False) # In production, use proper SSL verification
response.raise_for_status()
return response.json().get("projects", [])
except requests.exceptions.RequestException as e:
print(f"Error fetching projects: {e}")
return []
def get_jobs_for_project(group_name, project_name):
"""Fetches all jobs for a given project."""
url = f"{BASE_URL}/projects/{group_name}/projects/{project_name}/jobs"
print(f" Fetching jobs for project '{project_name}'...")
try:
response = session.get(url, verify=False)
response.raise_for_status()
return response.json().get("jobs", [])
except requests.exceptions.RequestException as e:
print(f"Error fetching jobs for {project_name}: {e}")
return []
def get_job_export(group_name, project_name, job_name):
"""Exports the full JSON definition of a single job."""
url = f"{BASE_URL}/projects/{group_name}/projects/{project_name}/jobs/{job_name}/export"
# print(f" Exporting job '{job_name}'...")
try:
response = session.get(url, verify=False)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error exporting job {job_name}: {e}")
return None
except json.JSONDecodeError as e:
print(f"Error decoding JSON for job {job_name}: {e}")
return None
if __name__ == "__main__":
all_job_details = []
projects = get_projects()
for project in projects:
group_name = project['groupName']
project_name = project['name']
jobs = get_jobs_for_project(group_name, project_name)
for job in jobs:
job_name = job['name']
job_export_json = get_job_export(group_name, project_name, job_name)
if job_export_json:
# Save the raw JSON for later processing
output_dir = f"./matillion_exports/{group_name}/{project_name}"
os.makedirs(output_dir, exist_ok=True)
with open(f"{output_dir}/{job_name}.json", "w") as f:
json.dump(job_export_json, f, indent=2)
print(f" Successfully exported '{job_name}'")
print("\nInventory export complete.")
Lesson Learned: Don't forget to handle API rate limiting and pagination if your instance is large. We initially hammered the API and got temporarily blocked. Adding a small time.sleep() between requests and properly handling paginated results (start, limit parameters) is essential.
From JSON Chaos to a Structured Inventory
With a folder full of raw JSON files, the next step is to parse them into a structured format, like a CSV or a Pandas DataFrame. This is where you extract the critical information about each component within each job.
The JSON structure is complex and nested. You need to write a parser that can recursively walk through the "components" list and extract details like component type, source/target tables, SQL queries, and dependencies on other jobs.
This script is more involved, but it's the engine of the entire migration. It turns unstructured metadata into actionable intelligence.
import pandas as pd
import json
import os
import glob
def parse_component(component, job_name, project_name, group_name):
"""Parses a single component's JSON to extract key details."""
comp_details = {
"group_name": group_name,
"project_name": project_name,
"job_name": job_name,
"component_id": component.get("id"),
"component_name": component.get("name"),
"component_type": component.get("type"),
"is_disabled": component.get("disabled", False)
}
parameters = {p["name"]: p["value"] for p in component.get("parameters", [])}
# Extract common and important parameters
comp_details["target_table"] = parameters.get("Target Table")
comp_details["source_table"] = parameters.get("Table")
comp_details["sql_query"] = parameters.get("SQL Query")
# Specific logic for different component types
if comp_details["component_type"] == "Orchestration":
if parameters.get("Job"):
comp_details["called_job"] = parameters.get("Job")
if comp_details["component_type"] == "Transformation":
# Recursively parse components within a transformation job
nested_components = []
for nested_comp in component.get("components", []):
nested_components.extend(parse_component(nested_comp, job_name, project_name, group_name))
return nested_components
return [comp_details]
def process_all_exports(export_root_dir):
"""Processes all exported JSON files into a single DataFrame."""
all_components = []
json_files = glob.glob(f"{export_root_dir}/**/*.json", recursive=True)
for file_path in json_files:
try:
with open(file_path, 'r') as f:
job_data = json.load(f)
# Extract metadata from file path or job data
path_parts = file_path.split(os.sep)
group_name = path_parts[-3]
project_name = path_parts[-2]
for job in job_data.get("jobs", []):
job_name = job["name"]
for component in job.get("components", []):
all_components.extend(parse_component(component, job_name, project_name, group_name))
except (json.JSONDecodeError, IndexError) as e:
print(f"Could not process file {file_path}: {e}")
return pd.DataFrame(all_components)
if __name__ == "__main__":
EXPORT_DIR = "./matillion_exports"
print("Parsing all exported job JSON files...")
inventory_df = process_all_exports(EXPORT_DIR)
# Clean up and save to CSV
inventory_df.to_csv("matillion_full_inventory.csv", index=False)
print("\nFull inventory created: matillion_full_inventory.csv")
print(f"Total components found: {len(inventory_df)}")
print("\nComponent Type Distribution:")
print(inventory_df['component_type'].value_counts())
This CSV file is your migration bible. You can now slice and dice it to understand the scope of your work.
The Migration Readiness Report
With matillion_full_inventory.csv, we could finally generate a report that management could understand. We classified each component type by migration complexity and mapped it to a target Databricks pattern. This allowed us to create accurate estimates and a phased migration plan.
An example of our summary report looked like this:
| Matillion Component Type | Count | Migration Complexity | Target Databricks Pattern | Notes |
|---|---|---|---|---|
| Database Query | 1,245 | Low | spark.read.jdbc(...) |
Straightforward. Need to manage credentials and drivers. |
| Join | 830 | Low | PySpark df.join() |
Check join types (inner, left, etc.). Watch for performance. |
| Filter | 2,150 | Low | PySpark df.filter() or .where() |
Simple translation. |
| Aggregate | 621 | Low | PySpark df.groupBy().agg() |
Simple translation. |
| Calculator | 1,890 | Medium | PySpark .withColumn() |
Logic can be complex. Requires manual review of expressions. |
| SQL Script | 212 | High | Refactor to Spark SQL or PySpark | Often contains procedural logic, loops, variables. High refactor risk. |
| Python Script | 88 | Medium-High | Databricks Notebook / Python Wheel | Check library dependencies. May need significant refactoring. |
| Run Orchestration | 450 | Medium | Databricks Workflow Task / Airflow Operator | Defines dependencies. Crucial for building the new orchestration. |
| Snowflake Output | 755 | Low | df.write.format("delta").saveAsTable(...) |
Opportunity to standardize on Delta Lake. |
This report changes the conversation from "How long will it take?" to "We have 212 high-risk SQL scripts that need deep analysis, representing our biggest challenge. Let's staff that team accordingly."
Navigating the Migration Minefield: Common Pitfalls and How We Solved Them
With a solid plan, we moved into execution. This is where the real challenges emerged. Every migration has its own unique "gotchas," but a few categories of problems came up on every single project.
1. Connectivity & Drivers: The First Wall You'll Hit
- Why it occurred: Matillion comes with a nice, curated set of JDBC drivers for common databases. Your team likely just selected "Oracle" or "SQL Server" from a dropdown years ago and forgot about it. In Databricks, you are responsible for providing the drivers for any non-Databricks source. Our first PySpark jobs trying to read from an on-prem Oracle database failed instantly.
- How we diagnosed it: The Spark driver logs were crystal clear, which was a relief. We saw
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver. The Spark executor couldn't find the necessary Java class to connect to Oracle. - Remediation and Best Practices:
- Don't Install on Clusters Manually: Your first instinct might be to upload the JDBC JAR file directly to the cluster UI. This is bad practice and not repeatable.
- Use Init Scripts: The correct way is to store the driver JARs in a central location (like S3 or DBFS) and use a cluster-scoped or global init script to copy them to the driver and executor classpaths (
/databricks/jars/) upon cluster startup. This is automated and version-controllable. - Use Databricks Secrets: The Matillion jobs had credentials stored directly in the environment configuration. A huge security no-no. We immediately moved all usernames, passwords, and connection strings into Databricks Secrets. The PySpark code then reads the secret at runtime using
dbutils.secrets.get(scope="your_scope", key="your_key"). This prevents credentials from ever appearing in notebooks or Git. - Favor Specialized Connectors: For systems like Snowflake or SQL Server, Databricks offers optimized connectors. These are almost always better than the generic JDBC connector, offering better performance (e.g., predicate pushdown) and reliability. We replaced generic JDBC reads with these official connectors wherever possible.
2. Performance & Scalability: When "It Ran in Matillion" Means Nothing
- Why it occurred: This was the most humbling lesson. A job that took 20 minutes in Matillion was taking two hours in our first PySpark version. The reason is a fundamental paradigm difference. Matillion is brilliant at push-down ELT. When you join two tables from the same Snowflake database, Matillion doesn't pull the data into its own memory; it generates a single
CREATE TABLE AS SELECT ... JOIN ...SQL statement and executes it inside Snowflake. It outsources the work. Our initial PySpark approach was tospark.readboth tables into a Spark cluster, perform the join in Spark's memory, and then write the result back. For large tables, this meant moving terabytes of data over the network—a classic performance killer. - How we diagnosed it: The Spark UI became our best friend. We'd open it up for a slow-running job and immediately see the problem. The DAG visualization would show two massive reads followed by a huge "Exchange" (a shuffle), indicating Spark was moving data between nodes to perform the join. We could see data "spilling" to disk because it didn't fit in memory.
- Remediation and Best Practices:
- Think like Matillion (Sometimes): For transformations happening entirely within a single, powerful source system (like Snowflake, Redshift, or a dedicated SQL Server), don't pull the data into Spark. Use Spark to orchestrate the push-down. We rewrote many jobs to use
spark.read.jdbc()with aqueryparameter that contained the full transformation logic. Thedbtableoption can also be a full(SELECT ...)subquery. This executes the entire query on the source database and only pulls the final result set into Spark. - Embrace the Lakehouse Philosophy: For transformations involving multiple sources or complex logic unsuitable for SQL, we brought the data into the Lakehouse first. We created a "Bronze" layer in Delta Lake that was a 1:1 copy of the source data. Then, all subsequent transformations (joins, aggregations) were performed as Spark jobs reading from and writing to Delta tables. This is incredibly fast because Delta Lake is optimized for Spark, and Databricks can leverage statistics and data skipping.
- Tune Your Spark Jobs: We learned to tune our PySpark code. For joins between a large and a small table, we used broadcast joins (
broadcast(small_df)). For joins between two large tables, we made sure they were partitioned on the join keys in their Delta Lake format. We used.repartition()and.coalesce()to control data distribution and avoid creating thousands of tiny files.
- Think like Matillion (Sometimes): For transformations happening entirely within a single, powerful source system (like Snowflake, Redshift, or a dedicated SQL Server), don't pull the data into Spark. Use Spark to orchestrate the push-down. We rewrote many jobs to use
3. Schema & Data Type Discrepancies: The Silent Killers
- Why it occurred: Data types are a minefield. Matillion, JDBC drivers, and Spark all have slightly different ideas about what a
NUMBER(38,0)from Oracle should be. We had a job that ran perfectly for weeks, then failed. The reason? A new source record had a numeric value that, when converted to a SparkDecimalType, exceeded the default precision and was silently turned intoNULL. Another common issue was Matillion's handling of dates and timestamps, which didn't always align with Spark'sTimestampType. - How we diagnosed it: Post-migration data reconciliation is non-negotiable. We built automated validation scripts that ran after each job, comparing row counts and checksums (
sha2(concat(...))) on key columns between the old Matillion-produced table and the new Databricks-produced Delta table. When the checksums didn't match, we knew we had a data integrity issue. We would then sample the mismatched rows and trace the data types column by column. - Remediation and Best Practices:
- NEVER Use
inferSchemain Production: Usingspark.read.option("inferSchema", "true")is convenient for interactive exploration, but it's a cardinal sin for production pipelines. It scans your data and guesses the types, which can change from run to run. - Define Explicit Schemas: The solution is to define the schema explicitly using
StructTypeandStructFieldin PySpark. We created a central Python module that contained the schemas for all our key tables. This enforces consistency and makes data contracts clear.
- NEVER Use
# Example of an explicit schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
customer_schema = StructType([
StructField("customer_id", IntegerType(), False),
StructField("customer_name", StringType(), True),
StructField("signup_date", TimestampType(), True),
...
])
df = spark.read.schema(customer_schema).jdbc(...)
3. **Create a Type Mapping Dictionary:** We maintained a mapping of source system types to Spark types. This was our reference guide. For example: `Oracle.NUMBER(10)` -> `Spark.IntegerType()`, `SQLServer.datetime` -> `Spark.TimestampType()`. This prevented guesswork.
4. Security & Compliance: From Matillion Users to Unity Catalog
- Why it occurred: Matillion has its own security model based on users, groups, and project permissions. This model lives and dies with the Matillion instance. Databricks has a far more powerful and granular security framework, especially with Unity Catalog. We couldn't just "lift and shift" the Matillion permissions; we had to re-architect our entire data governance strategy.
- How we diagnosed it: This was a design-time challenge we anticipated. We mapped out all the Matillion roles and their effective permissions (e.g., "The Finance team can read from the
prod.salesproject and write to theprod.financeproject"). We then had to translate this business intent into a new model. - Remediation and Best Practices:
- Go All-In on Unity Catalog: For any new Databricks deployment, Unity Catalog is the answer. Don't even consider the old Hive metastore. UC provides a unified governance layer across all your workspaces for tables, files, models, and more.
- Map Roles to Groups: We mapped business functions (e.g., "Finance Analysts," "Data Scientists") to Groups in Databricks. We then granted privileges to these groups, not to individual users. This makes onboarding/offboarding a simple matter of adding/removing users from groups.
- Use a Three-Layer Namespace: We adopted the standard
catalog.schema.tablethree-level namespace. For example,prod_catalog.sales_schema.transactions_table. - Grant Privileges Programmatically: We used SQL
GRANTstatements to manage permissions and checked these scripts into Git. This made our security model auditable and repeatable.GRANT SELECT ON TABLE prod_catalog.sales_schema.transactions_table TOfinance-analysts-group; - Leverage Advanced Features: Unity Catalog gave us capabilities we never had in Matillion. We started implementing column-level masking for PII data and row-level security to ensure sales reps could only see data for their own region. This moved security from the application layer deep into the data platform itself, a massive win for governance.
5. Orchestration & Scheduling: Replacing the Matillion Scheduler
- Why it occurred: Matillion has a built-in scheduler that teams use to create complex dependency chains and schedules (e.g., "Run this job every hour, but only after the main nightly batch completes"). This logic needs to be replicated.
- How we diagnosed it: Our automated inventory script was key here. We specifically parsed the "Run Orchestration" components to build a dependency graph of all jobs. We also manually reviewed the "Schedules" tab in Matillion to capture the cron expressions and triggers.
- Remediation and Best Practices:
- Databricks Workflows are the Default: For 90% of use cases, Databricks Workflows is the best and simplest option. It allows you to create multi-task jobs with dependencies, visually or via the API/Terraform. We translated each Matillion Orchestration Job into a single Databricks Workflow, where each "Run Orchestration" component became a separate task in the workflow.
- When to Use an External Orchestrator: We had one client who was already a heavy user of Airflow. In their case, it made no sense to introduce a second orchestrator. We used the official Databricks provider for Airflow and created DAGs that triggered our Databricks jobs. The rule of thumb: If your company already has a strategic, enterprise-wide orchestrator, use it. If not, Databricks Workflows is powerful, integrated, and easier to manage.
- Parameterize Everything: We made our Databricks jobs highly parameterizable. Instead of hardcoding dates, we passed them in as job parameters. This allowed our Databricks Workflow to calculate the date (e.g.,
{{current_date}}) and pass it to the PySpark script, making backfills and reruns trivial.
6. Cost & Resource Optimization: Escaping the "Always-On" Trap
- Why it occurred: The Matillion mindset is "one big server that's always on." The client's initial request was, "Give us a big Databricks cluster that can handle our peak load." This would have been a financial catastrophe. The beauty of Databricks is elastic, on-demand compute.
- How we diagnosed it: This was a proactive educational effort. We created a cost projection showing the price of a large, 24/7 interactive cluster versus the cost of using smaller, ephemeral "job clusters" that only exist for the duration of a job run. The difference was staggering—often a 70-80% cost reduction.
- Remediation and Best Practices:
- Job Clusters are Your Friend: For every scheduled production pipeline, we configured it to run on a new Job Cluster. This cluster turns on when the job starts and shuts down the moment it finishes. You only pay for the seconds you use.
- Use Cluster Policies: To prevent developers from spinning up monster clusters for trivial tasks, we implemented Cluster Policies. These act as guardrails, limiting the instance types, number of nodes, and auto-termination times that users can select.
- Leverage Spot Instances: For non-critical, fault-tolerant batch jobs (most ETL fits this description), we configured job clusters to use a high percentage of Spot/preemptible instances. This can cut compute costs by another 50-70%. Databricks handles the management of these instances gracefully.
- Right-Size Your Clusters: We started with a reasonable guess for cluster size and then used the Spark UI and Databricks' own monitoring tools to right-size them. If a job only uses 20% of the cluster's memory, we shrink the cluster for the next run. This iterative tuning is key to cost efficiency.
From GUI Clicks to Production PySpark: The Art of Code Conversion
This is the heart of the migration. It’s a process of translation, not just transliteration. You’re not converting components one-to-one; you’re understanding the intent of the visual workflow and expressing it in robust, maintainable PySpark code.
Let’s walk through a concrete example.
The Matillion Job:
Imagine a common Transformation Job:
1. Read from a raw.orders table in PostgreSQL.
2. Read from a raw.customers table, also in PostgreSQL.
3. Join them on customer_id.
4. Filter for orders from the last year.
5. Calculate a new column order_value (quantity * unit_price).
6. Write the final, cleansed data to a analytics.fact_orders table in your data warehouse (now Databricks/Delta Lake).
The PySpark Translation:
Here’s how we would structure the production PySpark job to replace this. Notice the use of functions, configuration, explicit schemas, and writing to a Delta table.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, lit, current_timestamp
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType, DecimalType
# In a real project, this config would come from a YAML file or Databricks job parameters
db_config = {
"url": "jdbc:postgresql://your-postgres-host:5432/yourdb",
"driver": "org.postgresql.Driver",
"user": dbutils.secrets.get(scope="jdbc_creds", key="postgres_user"),
"password": dbutils.secrets.get(scope="jdbc_creds", key="postgres_password")
}
# --- Define Schemas Explicitly ---
orders_schema = StructType([
StructField("order_id", IntegerType(), False),
StructField("customer_id", IntegerType(), True),
StructField("order_date", TimestampType(), True),
StructField("quantity", IntegerType(), True),
StructField("unit_price", DecimalType(10, 2), True),
])
customers_schema = StructType([
StructField("customer_id", IntegerType(), False),
StructField("customer_name", StringType(), True),
StructField("country", StringType(), True),
])
def read_source_table(spark, table_name, schema):
"""Reads a table from the source PostgreSQL database using JDBC."""
print(f"Reading source table: {table_name}")
return spark.read \
.format("jdbc") \
.option("url", db_config["url"]) \
.option("dbtable", table_name) \
.option("user", db_config["user"]) \
.option("password", db_config["password"]) \
.option("driver", db_config["driver"]) \
.schema(schema) \
.load()
def transform_data(orders_df, customers_df):
"""Applies the core business logic for transformation."""
print("Applying transformations: join, filter, calculate...")
current_year = year(current_timestamp())
transformed_df = orders_df \
.join(customers_df, on="customer_id", how="inner") \
.filter(year(col("order_date")) == current_year - 1) \
.withColumn("order_value", col("quantity") * col("unit_price")) \
.withColumn("load_timestamp", lit(current_timestamp()))
# Select and rename columns for the final output
final_df = transformed_df.select(
col("order_id"),
col("order_date"),
col("customer_id"),
col("customer_name"),
col("country"),
col("order_value"),
col("load_timestamp")
)
return final_df
def write_to_delta(df, target_table_name):
"""Writes the DataFrame to a Delta table, overwriting existing data."""
print(f"Writing data to Delta table: {target_table_name}")
df.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") # Good practice for evolving schemas
.saveAsTable(target_table_name)
if __name__ == "__main__":
spark = SparkSession.builder.appName("Matillion_Migration_Fact_Orders").getOrCreate()
dbutils = None # For local testing; in Databricks, dbutils is pre-defined
if "dbutils" in locals() or "dbutils" in globals():
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
try:
# 1. Read (Extract)
orders_df = read_source_table(spark, "raw.orders", orders_schema)
customers_df = read_source_table(spark, "raw.customers", customers_schema)
# 2. Transform
fact_orders_df = transform_data(orders_df, customers_df)
# 3. Load
write_to_delta(fact_orders_df, "analytics.fact_orders")
print("\nJob completed successfully.")
except Exception as e:
print(f"An error occurred: {e}")
# In a real job, you'd have more robust error logging/alerting here
raise e
This code is testable, version-controllable, and scalable. It perfectly illustrates the shift from a visual tool to a software engineering discipline.
My Field Guide to a Successful Matillion to Databricks Migration
After several of these projects, my team developed a set of core principles. If you follow these, you'll avoid 90% of the pain.
-
Automate Everything, Especially Discovery. I cannot stress this enough. Your ability to programmatically inventory your Matillion environment is the single biggest predictor of a successful migration. Manual analysis is slow, error-prone, and will doom your project from the start.
-
Don't "Lift and Shift" Logic; "Re-Platform" It. Resist the urge to make PySpark behave exactly like Matillion's push-down engine. Understand the paradigms. Use push-down when it makes sense (pre-staging), but embrace the power of Spark and Delta Lake for complex, multi-source transformations.
-
Embrace the Lakehouse from Day One. All your targets should be Delta tables. Don't write to Parquet files. The ACID transactions, time travel (for debugging and rollbacks), and performance optimizations like Z-Ordering provided by Delta Lake are not optional nice-to-haves; they are foundational to a modern data platform.
-
Standardize Your Code and Project Structure. Create a standard template for your PySpark jobs. It should include folders for schemas, transformations, and tests. It should have a standardized way to handle configuration (we loved using YAML files parsed with Hydra). This consistency pays massive dividends in long-term maintainability.
-
Test, Test, and Reconcile. Build automated data validation into your CI/CD pipeline. After a job runs in a test environment, a script should automatically run that compares row counts and column checksums against a baseline run from Matillion. No human should have to manually check a
SELECT COUNT(*)ever again. This builds trust with business users and provides objective proof of success. -
Think in Workflows, Not Jobs. Design your pipelines in Databricks using multi-task Workflows. This makes dependencies, error handling, and parallel execution explicit and manageable. A visual graph of your workflow is a powerful tool for communication and debugging.
-
Govern from the Start with Unity Catalog. Don't treat security and governance as an afterthought. Design your catalog structure, schemas, and permissions as a core part of the migration planning phase. It's much harder to retrofit governance after the fact.
The View from the Other Side: Was It Worth It?
Absolutely. Without question.
The initial migration is a heavy lift. It requires new skills, new ways of thinking, and a significant investment in planning and automation. But the payoff is immense.
For the e-commerce client I mentioned at the start, their nightly batch window shrank from over nine hours to just 75 minutes. Their total data platform costs were reduced by over 60% by eliminating licenses and moving to ephemeral, spot-based job clusters.
But the biggest win wasn't technical. It was cultural. Their data scientists could now access and understand the exact logic used to build the tables they were using for modeling. Their analytics team could build dashboards directly on top of fresh, reliable data in the lakehouse. The data engineering team was no longer a bottleneck maintaining a black box; they were enablers, building a shared, transparent, and scalable platform.
Migrating from Matillion to Databricks is not just a tool swap. It's a strategic modernization effort that breaks down silos and sets the foundation for a truly data-driven organization. The journey is challenging, but the destination is transformative.