Ensuring Data Quality After DataStage to Databricks Migration

Published on: December 27, 2025 09:27 AM

Ensuring Data Quality After DataStage to Databricks Migration

An Experience-Driven Technical Guide

1. Introduction: The Post-Migration "Trust Gap"

I’ve personally led dozens of enterprise-scale ETL migrations from IBM DataStage to Databricks. The pattern is often the same: engineering teams focus intensely on performance, cost, and ensuring the new Spark jobs run to completion. The migration is declared a success. Then, a few weeks later, the phone calls start. A finance analyst reports that revenue numbers are off by 0.1%. A marketing lead claims customer counts have drifted.

This is the "Trust Gap," and it's the silent killer of migration ROI. The common misconception is, “If the pipeline runs without error, the data must be correct.” This is fundamentally false. Migrating from a proprietary, GUI-based tool like DataStage to a code-first, distributed engine like Spark introduces subtle, systemic risks. Differences in data type handling, NULL interpretation, and floating-point precision can create a thousand tiny cuts that bleed business trust.

This guide provides a pragmatic, structured approach to prevent, detect, and resolve these issues. It’s not about achieving perfect data; it’s about managing risk, ensuring functional correctness for critical data, and maintaining the business’s faith in its most valuable asset.

2. My Core Data Quality Principles

Before we get into tactics, we need a philosophy. In my experience, successful post-migration quality programs are built on four pillars:

  1. Risk-Based, Critical-First: You cannot validate every field in every table. It’s a waste of resources. We identify what matters most to the business—the critical data elements (CDEs) that drive key reports, regulatory filings, or machine learning models—and we focus our deepest validation efforts there.
  2. Automation-First, Manual as Fallback: Human-led validation is slow, error-prone, and unscalable. Our primary goal is to build an automated validation framework. Manual checks by business analysts are the final, targeted confirmation for the most critical KPIs, not the first line of defense.
  3. Continuous Monitoring, Not One-Time Validation: Validation is a point-in-time activity to sign off on the migration. Data quality is a continuous process. We must build monitors that run alongside our pipelines, detecting drift and regressions long after the migration project is "done."
  4. Business-Focused Metrics: Row counts are a starting point, not the destination. True quality is measured in terms of business impact. We focus on metrics like "variance in total accounts receivable" or "change in 30-day active user count." These resonate with stakeholders and directly map to business value.

3. Pre-Migration Preparation: Establishing the Baseline

You cannot measure a deviation without a starting point. The single biggest mistake teams make is failing to benchmark quality before the migration begins.

Actionable Steps:

  • Profile Critical Data in DataStage: Before you even start rewriting the first job, run profiling jobs in your source environment. For your critical tables, you need to establish a baseline. This includes:
    • Row counts and key aggregate metrics (SUM, AVG, MIN, MAX on numeric columns).
    • COUNT(DISTINCT) on key identifiers.
    • NULL counts for every critical column.
    • Frequency distributions for categorical fields (e.g., status codes, country codes).
  • Create a Criticality Matrix: Not all data is created equal. Work with business analysts and data stewards to classify pipelines and tables.
Risk Tier Description Example Validation Depth Required
Tier 1 Regulatory reporting, core financial data General Ledger, Customer Master, Transaction tables Full reconciliation
Tier 2 Key operational dashboards, internal BI Sales activity, web analytics, supply chain metrics Aggregate & sample validation
Tier 3 Exploratory analytics, non-critical staging Log files, temporary datasets Row counts, basic checks
  • Document Known Issues: Every legacy DataStage environment has skeletons. A particular job that drops records under weird conditions. A known issue with how currency is rounded. Document these! Otherwise, you will spend weeks chasing "bugs" in Databricks that were actually pre-existing conditions. This "known issue log" is your get-out-of-jail-free card.

4. Post-Migration Validation: The Reconciliation Playbook

Once a pipeline is migrated, we execute a structured validation process. This is not a single check; it’s a multi-layered approach. We run the legacy DataStage job and the new Databricks job over the same input data for a specific period (e.g., one day, one week) and compare the outputs.

Validation Layer Method Key Gotchas & DataStage vs. Databricks Nuances
L1: Counts & Aggregates COUNT(*), SUM(metric), AVG(metric). This is your first, fastest check. These should match 100% for Tier 1 data. NULL handling: Spark's avg() ignores nulls, as does DataStage. But count(column) in Spark ignores nulls, while count(*) doesn't. Be precise.
Floating Point Precision: SUM() on a FLOAT or DOUBLE may have tiny differences. Validate using a tolerance (e.g., < 0.001%).
L2: Field-Level Comparison Generate checksums (e.g., md5(col1 || col2 || ...) ) on a row-by-row basis for a sample of records. For critical tables, you may need to do this for the entire dataset. Data Types: DataStage's Char(10) becomes Spark's string. Does your old data have trailing spaces? Spark's trim() might behave differently than DataStage's Trim() function.
Timestamps: DataStage often has implicit timezone handling. Spark is much more explicit. You will find discrepancies here. Standardize on UTC early.
L3: Transformation Logic Isolate a business rule (e.g., customer status derivation, commission calculation). Export the input and output for this specific logic from both systems for a sample set and compare them in a spreadsheet or a dedicated script. Joins on NULL keys: DataStage and Spark/SQL treat joins on NULL keys differently by default. A left join in Spark where key IS NULL will not match. This is a massive source of dropped records.
Lookup Stages: A DataStage lookup with "fail on no match" is an inner join. "Continue on no match" is a left join. Ensure this logic is perfectly replicated.

I recommend building a simple PySpark utility that takes two tables (source and target) and a configuration file, then runs this battery of L1 and L2 tests, outputting a reconciliation report.

5. Continuous Data Quality Monitoring

Validation proves the migration was successful. Monitoring ensures it stays successful. This is where Databricks as a platform shines.

  • Delta Lake Quality Checks: This is your first line of defense, built directly into your data architecture. Use CHECK constraints on your Delta tables to enforce invariants.
    ALTER TABLE transactions ADD CONSTRAINT valid_order_date CHECK (order_date > '2020-01-01');
    ALTER TABLE customers ADD CONSTRAINT no_null_email CHECK (email IS NOT NULL);

These are simple but powerful. A pipeline run that violates these constraints will fail immediately, preventing bad data from propagating downstream.
  • Automated Anomaly Detection: For every critical table, we track key metrics over time in a "metrics" table.

    • Volume: Row count, data size.
    • Schema: Number of columns (detecting schema drift).
    • Values: NULL percentage, MIN, MAX, AVG.
      A simple scheduled notebook can then query this metrics table and look for statistical anomalies (e.g., today's row count is more than 3 standard deviations from the 30-day moving average). This catches issues like a source file not being delivered or a filter logic change causing a massive drop in records.
  • Regression Monitoring with Expectations: For more complex rules, we use frameworks like Great Expectations or build our own in PySpark. After each pipeline run, we assert a series of "expectations" about the output data:

    • expect_column_values_to_not_be_null('customer_id')
    • expect_column_values_to_be_in_set('order_status', ['SHIPPED', 'PROCESSING', 'CANCELLED'])
    • expect_table_row_count_to_be_between(1000000, 1200000)

6. Exception Handling & Issue Resolution

Alerts are useless without a process. When a monitor fires, what happens next?

  1. Categorize: Is this a Critical Failure (e.g., transaction data missing) or a Non-Critical Warning (e.g., NULL count in a descriptive field increased slightly)?
  2. Action:
    • Critical: Fail the pipeline immediately. Prevent the corrupted data from reaching business users. Send a high-priority alert (e.g., PagerDuty) to the on-call data engineer.
    • Non-Critical: Log the issue, quarantine the affected records if possible, and create a ticket in a system like Jira. The pipeline can continue, but the issue must be investigated.
  3. Root Cause Analysis (RCA): The alert must contain enough context to start an investigation: the pipeline name, the quality check that failed, the metric value, and the expected value. Good logging within your Spark jobs is non-negotiable.
  4. Coordination: A dedicated #data-quality Slack channel is invaluable. It brings together platform engineers, data engineers, and data analysts to triage issues quickly.

7. Data Governance & Lineage

Data quality is a core pillar of data governance. The Databricks ecosystem, particularly with Unity Catalog, is a game-changer here.

  • Capture Lineage: When a quality issue is detected in a downstream table, Unity Catalog's automated lineage allows us to trace it back through every transformation to the source. This reduces RCA time from days to minutes. I've seen this firsthand—it's one of the most compelling arguments for adopting Unity Catalog.
  • Define Ownership: Every critical table must have a designated "Data Owner" (a business stakeholder) and a "Data Steward" (a technical or analytical lead). When a quality issue arises, we know exactly who to talk to about the business impact and remediation priority.
  • Integrate into Dashboards: We build a central Data Quality Dashboard. It doesn't just show "passed/failed" tests. It shows the business-focused metrics: trend lines of critical KPIs, data freshness, and the number of open quality issues per business domain. This makes quality visible to leadership.

8. Automation Frameworks & Tooling

To do this at scale, you must build reusable components.

  • PySpark Validation Library: We built an internal Python library with functions like compare_schemas(), compare_row_counts(), get_aggregate_diffs(). Each migration project then uses this library, ensuring consistency and saving immense effort.
  • Third-Party Tools: While you can build it all yourself, tools like dbt, Great Expectations, or Soda provide robust frameworks for defining and running tests. My take: use them. They solve the "engine" problem, allowing your team to focus on defining the right tests, not building the infrastructure to run them. Dbt, in particular, integrates testing directly into the transformation workflow, which is a powerful paradigm.
  • Templated Notebooks: Create template notebooks for anomaly detection and reconciliation that can be parameterized for different pipelines. This allows for rapid onboarding of new pipelines into the quality framework.

9. Common Post-Migration Pitfalls I've Seen

  • Ignoring Incremental vs. Historical Loads: The logic for a full historical load is often simpler than the daily incremental logic (MERGE, SCD2). Teams validate the historical load and assume the incremental will work. You must validate both paths independently.
  • Treating Validation as a One-Time "Sign-Off": The biggest mistake. The source systems, APIs, and business rules will change. Without continuous monitoring, your data quality will silently degrade over time.
  • Overlooking Business Rule Enforcement: Migrating the JOIN and FILTER logic is easy. Migrating the complex, multi-stage business rule derivations buried in DataStage Transformer stages is hard. This requires dedicated workshops with business analysts to deconstruct and validate the logic, case by case.
  • Reactive vs. Proactive Monitoring: Don't wait for the business to tell you the data is wrong. Proactive monitoring means you identify the issue, fix it, and communicate the resolution before the CFO even knows there was a problem. This builds trust.

10. Lessons Learned & Best Practices

  • Perfection is the Enemy of Good: Don't aim for 100% reconciliation on every non-critical field. Accept small, documented floating-point differences. Focus your energy on what moves the needle for the business.
  • Empower the Business: Give analysts read-only access to compare the old and new tables. They often find edge cases your automated checks will miss. We once had an analyst find a discrepancy in how we handled customers with non-ASCII characters that saved us from a major reporting error.
  • Quality is a Feature, Not a Task: Budget for it. Include data quality user stories in every sprint. If you treat it as an afterthought, it will be the first thing cut when deadlines loom. A good rule of thumb is to allocate 15-20% of your engineering effort during a migration to building validation and monitoring.

11. Executive Summary / CXO Takeaways

For my peers in the C-suite, the message is simple. A DataStage to Databricks migration is not just an IT infrastructure project; it's a transfer of business trust from one system to another.

  • Trust is Your Primary Metric: The project is not successful when the jobs run; it's successful when the business trusts the output. This trust is fragile and must be earned through rigorous, transparent validation.
  • Demand Proactive Monitoring: Ask your teams, "How will we know the data is wrong before our business users do?" If they don't have a clear answer involving automated anomaly detection and continuous checks, the plan is incomplete.
  • Enforce Accountability: Every critical dataset needs a clear business owner. Data quality is not just an engineering problem; it's a shared business responsibility. Your role is to foster a culture where this ownership is accepted and rewarded.
  • Invest in Governance: Tools like Unity Catalog are not "nice-to-haves." They are critical infrastructure for providing the auditability, lineage, and security that regulators and stakeholders demand in a modern data platform. The investment pays for itself in reduced risk and increased operational efficiency.

This isn't just about technical correctness. It's about ensuring that the strategic investment you've made in a modern data platform delivers reliable, trustworthy data that can be used with confidence to run the business.