Problem Statement
The insurer’s data modernization programme relied on complex ETL pipelines migrating policy, claims, and finance data from legacy mainframe sources into a cloud warehouse. Each release required manual lineage mapping, hand-authored test cases, and multi-week validation cycles to reconcile business rules. Complex transformation logic (aggregations, conditional mappings, unit conversions) and inconsistent metadata made it difficult to detect semantic mismatches early, leading to defect leakage and extended production cutover timelines.
Solution
We implemented a generative AI-assisted ETL automation layer to accelerate mapping validation and testing while maintaining regulatory traceability.
Solution components
- Metadata ingestion: Parsed COBOL copybooks, flat-file layouts, and target schema definitions into a structured metadata store with versioning.
- Generative mapping agent: Produced draft source-to-target lineage, transformation logic, and business rule rationales using prompt templates anchored to glossary standards. Confidence scoring flagged low-certainty suggestions for human review.
- Test generation agent: Auto-generated SQL and Python-based test suites (field-level, aggregate, referential) across source and target systems, including synthetic data packs for edge cases.
- Validation agent: Executed generated test cases, reconciled discrepancies, and prepared anomaly reports with suggested remediation steps. A secondary agent produced patch scripts for common issues (type truncation, null handling, lookup mismatches).
Governance
- All prompts operated on masked identifiers; no PII left the secure environment.
- Approval workflows required data engineers to accept or modify generated mappings and test suites before execution.
- Audit bundles captured final lineage, prompts, human decisions, and test evidence for compliance review.
Outcomes
Pairing generative agents with rule-driven validation recast ETL onboarding as a guided workflow rather than ad-hoc effort. Data engineers now deliver source-to-target mappings 72% faster, with 85% of regression coverage generated automatically and executed alongside confidence-scored fixes. Production defect leakage dropped by 43%, and sign-off cycles were cut in half, meaning downstream analytics teams received trusted datasets days earlier. The approach preserved auditable traceability while materially improving delivery velocity and data quality.