Data Reconciliation Strategies Post DataStage to Databricks Migration
1. Introduction: The Crucible of Migration Success
A DataStage to Databricks migration is fundamentally an exercise in trust. You are not just moving code; you are moving the logic that generates the numbers on which your business runs. Reconciliation is the process that proves this trust is warranted.
In my 15+ years with DataStage and a decade validating migrations to platforms like Databricks, I have seen more projects stumble at this final hurdle than any other. Why? Because reconciliation is the most underestimated migration risk. Teams spend months converting complex job logic but budget only days for validation, treating it as a simple check-box exercise.
This leads to the most dangerous assumption in data migration: that matching row counts are sufficient proof of success. They are not. A row count match tells you nothing about the correctness of the data within those rows. I have personally witnessed go-live failures where row counts matched perfectly, but a subtle change in rounding logic led to a seven-figure discrepancy in financial reporting. Reconciliation is not about proving that data moved; it's about proving that it arrived correctly and means the same thing.
2. Reconciliation Principles: Our Guiding Philosophy
Before writing a single line of validation code, we must align on a set of core principles. These are the non-negotiable tenets that separate a robust, auditable process from a rushed, high-risk one.
- Risk-Based Reconciliation (Critical Data First): You cannot reconcile everything with the same level of intensity. We work with the business to classify data assets. A table feeding a regulatory report to the SEC demands field-level, checksum-based reconciliation. A table for internal analytics may only require aggregate checks. Prioritize based on business impact.
- Deterministic, Repeatable, and Auditable Checks: Every reconciliation check must produce the same result given the same inputs. The process must be logged, and the results must be stored as immutable evidence. An auditor or a business stakeholder should be able to look at a report from three weeks ago and understand exactly what was compared, what passed, and what failed. "I ran a query and it looked okay" is not an acceptable answer.
- Automation-First, Manual-Last: Manual reconciliation does not scale and is prone to human error. Spreadsheets are where data integrity goes to die. Our goal is to build an automated framework that can compare millions or billions of records and surface only the exceptions that require human analysis. Manual effort should be reserved for investigating these flagged discrepancies, not for finding them.
- Business Outcome–Driven Validation: The ultimate test is not whether the Databricks table matches the DataStage target table. It's whether the business report, the analytical model, or the customer invoice generated from the Databricks data matches the one generated from DataStage. We always tie our technical validation back to a tangible business outcome.
3. Reconciliation Scope Definition
Defining the scope is a critical exercise in balancing risk, effort, and cost. Getting this wrong leads to either boiling the ocean or missing a critical failure point.
| Scope Decision | What to Reconcile | What NOT to Reconcile (and Why) |
|---|---|---|
| Objects | - Final target tables (Facts, Dimensions) - Critical aggregated tables - Published files (e.g., CSVs, Parquet for downstream) - Key Performance Indicators (KPIs) in BI reports |
- Transient staging tables (their correctness is validated by the final target) - Raw data landing zones (unless transformations are minimal) - Caching or temporary tables |
| Depth | Full Reconciliation: Every row and every column. Mandatory for high-risk data (finance, regulatory). Feasible for small to medium datasets. Sampled Reconciliation: Statistically significant, randomly selected subsets of data. Used for very large tables (>1B rows) where full comparison is cost-prohibitive. Caution: This accepts a level of risk. |
- Full reconciliation on massive, low-risk historical tables. The cost-benefit is often not there. Use aggregates and sampling instead. |
My rule of thumb: If a regulator or a CFO would ask about it, it gets full, field-level reconciliation. Everything else is a candidate for risk-based scoping.
4. Types of Reconciliation Strategies
We deploy a layered approach, moving from coarse-grained to fine-grained checks. Each strategy answers a different question and addresses a different type of risk.
| Strategy | Description | Best For | Technical Effort | Confidence Level |
|---|---|---|---|---|
| Record Count Validation | COUNT(*) on source vs. target. |
Quick sanity check. Catches major data drops or duplications. | Low | Very Low |
| Aggregate & Metric-Based | SUM(numeric_col), AVG(numeric_col), MIN/MAX(date_col) on source vs. target. |
Validating key business metrics, financial totals, date ranges. | Low | Medium |
| Hash / Checksum-Based | Concatenate key columns, hash the string (SHA2), and compare hashes. |
The workhorse of large-scale reconciliation. Proves row-level data integrity without moving massive data volumes. | Medium | High |
| Field-Level Comparison | Full outer join on primary keys and compare every non-key attribute. | The gold standard. Pinpointing specific field discrepancies in critical, small-to-medium datasets. | High | Very High |
| Control Totals & Balancing | Summing a key metric at the source, subtracting debits, adding credits through the pipeline, and ensuring the final target sum matches. | Validating financial transaction pipelines and ledger systems. | High | Very High |
PySpark Hash Reconciliation Example:
This is a pattern we use constantly. It's scalable and definitive.
from pyspark.sql.functions import sha2, concat_ws, col
# Assume 'ds_df' is from DataStage target and 'dbx_df' is from Databricks target
# Columns to include in the hash for integrity check
recon_cols = ['customer_id', 'product_sku', 'order_date', 'sale_amount', 'quantity']
# Create a deterministic hash column on both DataFrames
ds_df_hashed = ds_df.withColumn("row_hash", sha2(concat_ws("||", *recon_cols), 256))
dbx_df_hashed = dbx_df.withColumn("row_hash", sha2(concat_ws("||", *recon_cols), 256))
# Find rows present in DataStage but not in Databricks (or with different data)
missing_in_dbx = ds_df_hashed.select("primary_key", "row_hash") \
.subtract(dbx_df_hashed.select("primary_key", "row_hash"))
# Find rows present in Databricks but not in DataStage (or with different data)
missing_in_ds = dbx_df_hashed.select("primary_key", "row_hash") \
.subtract(ds_df_hashed.select("primary_key", "row_hash"))
if missing_in_dbx.count() == 0 and missing_in_ds.count() == 0:
print("SUCCESS: Datasets are fully reconciled.")
else:
print("FAILURE: Discrepancies found.")
missing_in_dbx.show()
missing_in_ds.show()
5. Handling Platform Differences
This is where experience pays dividends. DataStage and Databricks (via Spark SQL) do not behave identically. Ignoring these differences will lead to false reconciliation failures that erode stakeholder confidence.
| Difference | The DataStage Behavior | The Databricks (Spark) Behavior | Reconciliation Strategy |
|---|---|---|---|
| Data Types | Timestamp often includes timezone. Decimal precision/scale is rigidly enforced. |
Timestamp vs Timestamp_NTZ. Decimal can have different default precision. Float vs Double. |
Explicitly cast both sides to a common, well-defined type (Decimal(38,18), string) before hashing or comparison. Never assume types are equivalent. |
| Floating Point | Mainframe or proprietary floating-point math. | IEEE 754 standard. | Never do an exact equality check on floats or doubles. Compare them within a small tolerance (e.g., abs(ds.col - dbx.col) < 0.0001). Or, better yet, cast to a fixed-point Decimal for comparison. |
| NULL Handling | NULL handling can be inconsistent. Empty strings ('') are often treated as NULL in lookups or joins. |
NULL is strictly NULL. An empty string is not NULL. NULL does not equal NULL. |
Use COALESCE or NVL on both sides to convert NULLs to a standard, non-data value (e.g., '__NULL__') before hashing. Ensure empty strings are handled identically in both source and target comparison logic. |
| Ordering | DataStage partitioning and ordering can be very specific and persistent. | Databricks/Spark ordering is not guaranteed unless an explicit ORDER BY is used. |
Do not rely on row order. Your reconciliation must be based on primary keys. For hash generation, always sort the column names alphabetically before concatenation to ensure concat(a,b) and concat(b,a) don't cause false failures. |
6. Transformation Logic Reconciliation
Validating transformations is more complex than validating static tables.
- Joins, Lookups, Filters: The most effective technique here is the "anti-join". We join the DataStage and Databricks outputs on their business key. The rows that fail to join represent data that was incorrectly included or excluded. We then perform a second anti-join in the opposite direction.
- Conditional Logic & Derivations: (
CASEstatements, etc.). Field-level comparison is essential here. You must isolate derived columns and compare them directly. This is often where subtle bugs related toNULLhandling or data type implicit casting hide. - Late-Arriving & Duplicate Records: Ensure the logic for handling duplicates (e.g.,
ROW_NUMBER() OVER (PARTITION BY... ORDER BY...)) and late-arriving data is identical. Reconcile the output of these window functions directly to prove the same records are being kept or discarded.
7. Parallel Run Reconciliation
The core of UAT is the parallel run. For a defined period (e.g., two weeks), you run both the legacy DataStage jobs and the new Databricks pipelines against the same production source data feed.
- Historical Load Reconciliation: First, reconcile a large, static, historical dataset. This validates the core logic against a fixed target. This must be 100% successful before proceeding.
- Daily Incremental Reconciliation: Each day, you reconcile the delta processed by both systems. This is where your automated framework is critical. You're not just comparing the daily slice; you're comparing the final state of the target tables after the daily load.
- Managing Reconciliation Drift: Over the parallel run period, discrepancies can "drift" due to out-of-band updates in the source system or bug fixes applied to one pipeline but not the other. It is crucial to:
- Have a daily "break/fix" meeting to analyze discrepancies.
- Categorize each break: Is it a Databricks bug? A DataStage bug? An expected platform difference? A genuine source data issue?
- Maintain a "reconciliation exception log" that documents and approves any expected differences. This log becomes a key part of the sign-off artifact.
8. Automation Framework Design
A manual approach is a recipe for failure. We build a metadata-driven framework, typically in Python/PySpark, that does the heavy lifting.
High-Level Architecture:
- Configuration Layer (YAML/JSON): Defines what to reconcile.
reconciliation_jobs:
- name: daily_sales_recon
source_datastage:
table: "EDW.F_SALES_DAILY"
connection: "ds_jdbc_conn"
target_databricks:
table: "gold.f_sales_daily"
recon_type: "hash"
primary_keys: ["sale_id", "sale_date"]
# Columns to hash. Order doesn't matter here.
hashed_columns: ["customer_id", "product_id", "store_id", "sale_amount_usd"]
aggregate_checks:
- expression: "SUM(sale_amount_usd)"
tolerance_percent: 0.001
- expression: "COUNT(DISTINCT customer_id)"
- Extractor Engine: Reads the config and pulls data from both DataStage (via JDBC) and Databricks into DataFrames.
- Comparison Engine: Applies the chosen strategy (e.g., generates hashes, calculates aggregates, performs
subtract()). - Exception Handler & Reporter: Captures discrepancies, categorizes them, compares against thresholds, and generates a clear report (HTML/PDF) showing summary results and samples of failing records.
This framework turns a multi-day manual effort into a scheduled, repeatable, and auditable job.
9. Exception Handling & Reporting
A reconciliation failure is not a defect until proven otherwise. The first step is triage.
Mismatch Categories:
- Expected Difference: Known platform behavior (e.g., floating-point tolerance). Automatically pass if within the defined threshold.
- Logic Defect: A genuine bug in the migrated Databricks code. This is a high-priority fix.
- Source Data Anomaly: The source data changed between the DataStage and Databricks runs. This requires investigation but is not a migration defect.
- Legacy System Defect: The reconciliation process uncovers a pre-existing bug in the DataStage job that has gone unnoticed for years. This is a common and valuable side effect.
Thresholds: Business stakeholders must sign off on acceptance criteria. For a financial report, the threshold is zero. For a non-critical marketing segmentation table, a 0.1% mismatch in a non-key metric might be acceptable for go-live, with the discrepancy logged as technical debt.
Reporting: The report must be two-sided:
* Executive Summary: Green/Red status for each major data asset. Total records compared, total matched, total mismatched. Aggregate values and their percentage difference.
* Technical Detail: For failures, provide primary keys of mismatched records, the specific columns that differ, and their values from both systems (ds_value vs. dbx_value).
10. Common Reconciliation Pitfalls
I've seen these mistakes derail projects. Heed these warnings.
- Over-trusting Source System Data: Assuming the DataStage output is "truth." Reconciliation often reveals long-standing bugs in the legacy system. Be prepared to have difficult conversations about what "correct" really means.
- Ignoring Boundary and Edge Cases: Test the first day of the month, year-end processing, leap years, null primary keys, records with all optional fields null, etc. These are where simple logic breaks.
- -Manual Reconciliation at Scale: Trying to compare millions of rows in Excel. It is guaranteed to fail, miss errors, and exhaust your team. Automate.
- Rushing Reconciliation to Meet Cutover Dates: Business pressure to "just go-live" is immense. A failed reconciliation is a red flag. Halting a go-live to fix a data integrity issue is far cheaper than cleaning up corrupted production data and dealing with the business fallout.
11. Real-World Reconciliation Failures & Lessons Learned
Failure 1: The Silent Float-to-Double Promotion
* Symptom: Post-go-live, the quarterly finance report was off by ~$1.2M. Row counts and SUMs on integer columns matched perfectly.
* Root Cause: A DataStage job used a Decimal type for transaction amounts. The migrated Databricks job, through a series of implicit casts during joins, converted it to a Double (a floating-point number). The aggregation of millions of transactions introduced minute precision errors that snowballed.
* What Would Have Caught It Earlier: A hash-based reconciliation on the raw, un-aggregated data, with all numeric types explicitly cast to a high-precision Decimal(38, 18) before hashing. Aggregate checks alone were insufficient.
Failure 2: The Timezone Mismatch
* Symptom: During UAT, daily sales reconciliation showed a small, consistent number of "missing" transactions in Databricks every day, and a similar number of "extra" transactions.
* Root Cause: The source database timestamp was timezone-naive. DataStage's JDBC driver interpreted it as UTC. The Databricks Spark driver interpreted it as the local timezone of the cluster. This caused transactions occurring between midnight UTC and midnight local time to be assigned to different days.
* What Would Have Caught It Earlier: Not just reconciling CAST(ts AS DATE), but reconciling the full timestamp string. Or, even better, establishing a clear policy on timezones and explicitly setting the session timezone (spark.sql.session.timeZone) in Databricks to match the legacy system's assumption.
12. Governance, Audit & Sign-Off
Reconciliation is not complete until it is signed off.
- Evidence for Audit: Every reconciliation run must generate an immutable report. This report, along with the configuration file used to run it and the exception log, forms the audit trail. This is your proof that due diligence was performed.
- Who Signs Off:
- Data Owners / Business Stakeholders: Sign off on the business outcome. They attest that the KPIs and reports generated from Databricks are correct and match their expectations based on the reconciliation reports.
- Migration Lead / Lead Architect: Sign off on the technical completeness, attesting that the defined scope was covered and all high-priority discrepancies have been resolved.
- Decommissioning DataStage: The signed-off reconciliation artifacts are the primary justification for decommissioning the legacy DataStage jobs. They are the proof that the new system is a viable, trustworthy replacement.
13. Executive Summary / CXO Takeaways
For leaders responsible for risk and business outcomes, here are the critical takeaways.
- Reconciliation Protects Business Credibility: A data migration failure is not a technical problem; it's a business problem. Incorrect invoices, flawed regulatory reports, or bad strategic decisions erode trust with customers, regulators, and the market. Robust reconciliation is your primary insurance policy against this.
- Row Counts Are a Vanity Metric: Do not accept "the counts match" as proof of success. It's a low-value check that creates a false sense of security.
- Insist on an Automated, Auditable Framework: Ask your team: "Show me the automated reconciliation report. Show me the trend of discrepancies over the parallel run. Show me the sign-off for the accepted differences." If they can't produce these artifacts, the process is not mature enough for go-live.
- Do Not Approve Cutover Until Reconciliation is Green: Resist the pressure to meet a deadline by cutting corners on validation. The cost of a post-go-live data integrity failure is an order of magnitude higher than the cost of a delayed cutover. Reconciliation is the final quality gate; treat it with the seriousness it deserves.