Synthetic Data in Model Risk Management

Nitin Aggarwal | 20 August 2025

Synthetic Data in Model Risk Management
11:42

Synthetic data has emerged as a transformative solution to these challenges, offering new model development and validation pathways. Financial institutions can create sophisticated artificial datasets that mimic real-world patterns without client information by leveraging advanced generation techniques—including statistical methods, generative AI algorithms such as Generative Adversarial Networks (GANs), and computational simulations. This synthetic approach enables several critical MRM enhancements: the ability to conduct comprehensive stress testing under extreme economic scenarios without historical precedent; the capacity to mitigate dataset biases through carefully calibrated data generation; and the acceleration of model development cycles by eliminating privacy-related constraints. 

Nevertheless, the implementation of synthetic data introduces its complex governance challenges. Without proper oversight and validation frameworks, synthetic data may produce misleading results, potentially amplifying model risk rather than mitigating it. The flexibility that makes synthetic data valuable also creates vulnerabilities, including the risk of creating data that fails to capture critical real-world dynamics or introduces new forms of bias. Financial institutions must therefore develop robust governance frameworks that ensure synthetic data applications enhance rather than compromise MRM objectives, balancing innovation with appropriate safeguards to maintain model integrity and regulatory compliance. Synthetic Data Governance in MRM

Fig: Synthetic Data Governance in MRM 

Synthetic Data in Model Simulation 

Synthetic data plays a crucial role in enhancing model simulation by addressing key limitations of traditional approaches. Below, we explore its applications in model validation, stress testing, and overcoming data scarcity in greater depth. 

Enhancing Model Validation 

Limitations of Traditional Validation 

Traditional model validation relies heavily on historical datasets, which often fail to capture: 

  • Black swan events are extreme, rare occurrences (e.g., the 2008 financial crisis and the COVID-19 pandemic) that have outsized impacts. 

  • Emerging risks: New threats (e.g., cryptocurrency market crashes, AI-driven fraud) that lack sufficient historical precedents. 

How Synthetic Data Improves Validation 

Scenario Expansion 
  • Synthetic data can simulate rare but plausible scenarios that historical data misses. 

  • Example: To test credit risk models, a bank can generate synthetic loan default patterns under a severe recession. 

Adversarial Testing 
  • Firms can assess model resilience by injecting edge cases (e.g., extreme market shocks, cyberattacks). 

  • Example: A trading algorithm can be tested against synthetic flash-crash scenarios to prevent catastrophic failures. 

Bias and Fairness Testing 
  • Synthetic data can intentionally introduce biases to test if models discriminate against protected groups. 

  • Example: Generating synthetic applicant profiles with varying demographics to audit hiring algorithms. 

Stress Testing and Sensitivity Analysis 

Regulatory Requirements 

Financial institutions must comply with stress testing mandates (e.g., Basel III, CCAR, ECB stress tests). However, real-world data often lacks extreme scenarios. 

How Synthetic Data Enhances Stress Testing 

Tail-Risk Scenario Generation 
  • Synthetic data can model low-probability, high-impact events (e.g., hyperinflation, sovereign defaults, cyber warfare). 

  • Example: A central bank simulates a digital bank run using synthetic transaction data to assess liquidity risks. 

Sensitivity Analysis 

  • Firms can measure model robustness by tweaking synthetic inputs (e.g., interest rates, unemployment spikes). 

  • Example: An insurer tests how climate change models react to increases in synthetic hurricane frequency. 

Key Challenge: Realism vs. Plausibility 

  • Synthetic scenarios must be economically coherent—unrealistic assumptions can mislead risk assessments. 

  • Solution: Combine expert judgment (e.g., economist reviews) with statistical validation (e.g., Monte Carlo simulations). 

Overcoming Data Scarcity in AI/ML Models 

The Data Hunger Problem 

AI/ML models (e.g., fraud detection, credit scoring) require massive datasets, but real-world data is often: 

  • Limited (e.g., few observed fraud cases) 
  • Imbalanced (e.g., 99% non-fraud vs. 1% fraud transactions) 
  • Restricted (e.g., GDPR limits on personal data usage) 

How Synthetic Data Helps 

Data Augmentation 
  • Expands small datasets by generating variations of real samples. 

  • Example: A fraud detection model trained on synthetic transaction anomalies improves detection rates. 

Class Balancing 
  • Generates synthetic samples for underrepresented classes (e.g., rare diseases in medical AI). 

  • Example: A loan approval model uses synthetic data to avoid bias against thin-file borrowers. 

Privacy-Preserving Training 
  • Synthetic data mimics real distributions without exposing sensitive information. 

  • Example: A healthcare AI trains on synthetic patient records, avoiding HIPAA violations. 

Key Risk: Overfitting to Synthetic Artefacts 

  • If synthetic data is poorly generated, models may learn artificial patterns that fail in production. 

  • Mitigation Strategies: 

  • Hybrid Training: Mix real and synthetic data. 

  • Robustness Checks: Validate models on real-world holdout datasets. 

Synthetic Data Cycle in Model Simulation  Fig: Synthetic Data Cycle in Model Simulation 

Building a Synthetic Data Blueprint for MRM 

A structured approach ensures synthetic data’s reliability and compliance. 

Data Generation Techniques 

Method 

Use Case 

Pros 

Cons 

Statistical Sampling 

Credit risk modelling 

Simple, interpretable 

Limited complexity 

Generative AI (GANs, VAEs) 

Fraud detection 

High realism 

Computationally intensive 

Agent-Based Modelling 

Market simulations 

Captures interactions 

Requires domain expertise 

Validation Framework 

Synthetic data must undergo: 

  • Fidelity Checks: Does it statistically match real data? (KS tests, PCA) 

  • Utility Testing: Does it perform well in model training? 

  • Bias Audits: Are synthetic samples representative? 

Governance and Documentation 

  • Metadata tracking: Record data generation parameters 

  • Version control: Track dataset iterations 

  • Approval workflows: MRM team sign-off before deployment 

Compliance Frameworks and Regulatory Challenges 

Alignment with Existing Regulations 

Regulation 

Synthetic Data Consideration 

GDPR 

Anonymisation must be irreversible 

CCPA 

Synthetic data is not considered personal if non-inferential 

Basel III 

Synthetic scenarios must be justified 

Model Risk Governance 

  • SR 11-7 (Fed): Requires "effective challenge" of synthetic data assumptions 

  • EU AI Act: High-risk AI models must document synthetic data provenance 

Ethical and Bias Risks 

  • Synthetic data can amplify biases if the source data is skewed. 
Mitigation: 
  • Bias detection tools (e.g., IBM Fairness 360) 

  • Diverse scenario generation 

Case Studies: Problem, Solution, and Impact 

Case Study 1: Credit Risk Modelling for Rare Economic Shocks 

Problem: 

A multinational bank struggled to validate its credit risk models for extreme recession scenarios due to insufficient historical data. Traditional backtesting failed to capture potential black swan events, leaving the bank vulnerable to unexpected losses. 

Solution: 

The bank implemented a synthetic data generation framework using: 

  • Monte Carlo simulations to model GDP contractions, unemployment spikes, and housing market crashes 

  • GANs (Generative Adversarial Networks) to create synthetic loan default patterns under stress conditions 

  • Expert validation by economists to ensure economic plausibility 

Impact: 

  • 22% improvement in predicting defaults during simulated recessions 

  • Regulatory approval for internal stress testing models under Basel III 

  • Reduced capital reserves by $120M after demonstrating robust risk coverage 

Case Study 2: Fraud Detection System for a Digital Bank 

Problem: 

A neobank's AI fraud detection system had high false negatives because: 

  • Only 0.1% of transactions were fraudulent (extreme class imbalance)

  • GDPR restrictions limited access to real fraud cases for model retraining

Solution: 

The bank deployed: 

  • Variational Autoencoders (VAEs) to generate synthetic fraudulent transactions 

  • Adversarial training where the generator created increasingly sophisticated attack patterns 

  • Hybrid dataset mixing 5% real fraud data with 95% synthetic samples 

Impact: 

  • 18% reduction in false negatives while maintaining 99.9% precision 

  • 40% faster model refresh cycles (weekly instead of monthly) 

  • Passed EU financial authority audit while remaining GDPR-compliant 

Future Trends and Recommendations 

As synthetic data adoption grows in Model Risk Management (MRM), financial institutions must stay ahead of emerging trends while addressing implementation challenges. Below, we explore key developments and actionable recommendations. 

Regulatory Sandboxes for Synthetic Data Testing 

Current Landscape: 

  • Regulators (e.g., FCA, MAS, ECB) are establishing "sandbox" environments where firms can test synthetic data applications under supervision. 

  • Example: The UK’s FCA allows banks to validate synthetic transaction data for fraud detection before production use. 

Recommendations: 

  • Proactive Engagement – Participate in regulator-led sandboxes to shape future policies. 
    Documentation Standards – Maintain detailed records of synthetic data methodologies for audit trails. 
    Risk Mitigation Plans – Prepare fallback procedures if synthetic data fails validation. 

  • Impact: Accelerates approval timelines while ensuring compliance. 

Industry Standards for Synthetic Data Validation 

Emerging Frameworks: 

MITRE’s Synthetic Data Guidelines 
  • Focus on fairness, fidelity, and reproducibility in generated datasets. 

  • Provides checklists for statistical similarity testing (e.g., KL divergence, Wasserstein distance). 

  • ISO/IEC 5259 (AI Data Quality Standard) 

  • Defines metrics for synthetic data utility, privacy, and bias mitigation. 

Recommendations: 

  • Adopt Standardised Metrics – Use benchmarks like "synthetic-to-real (STR) accuracy" for model validation.

  • Third-Party Audits – Engage firms like Moody’s Analytics to certify synthetic data quality. 

  • Hybrid Data Approaches (Real + Synthetic) 

Why Hybrid? 

  • Real data ensures grounding in observed phenomena. 

  • Synthetic data fills gaps for rare/scarce scenarios. 

Implementation Strategies: 

Anchored Generation 
  • Train generative models on real data first, then expand synthetically. 

  • Example: Fraud detection models use 70% real transactions + 30% synthetic anomalies. 

Dynamic Blending 
  • Adjust the real/synthetic mix based on model performance monitoring.
  • Impact: Using hybrid datasets, JPMorgan Chase improved fraud detection accuracy by 19%. 

Explainable AI for Synthetic Data Governance 

Challenge: 

  • Black-box generative models (e.g., GANs) create untraceable synthetic samples. 

Solution: 

  • Explainable AI (XAI) techniques: 

  • SHAP values to identify which real data points influenced synthetic samples. 

  • Counterfactual analysis to audit "what-if" scenarios in generated data. 

Conclusion 

Synthetic data presents transformative opportunities for MRM—enabling better model testing, compliance, and innovation. However, strong governance, validation frameworks, and regulatory alignment are critical. Financial institutions must adopt a structured blueprint to harness synthetic data’s potential while mitigating risks. 

Next Steps with Model Risk Management

Talk to our experts about implementing compound AI system, How Industries and different departments use Agentic Workflows and Decision Intelligence to Become Decision Centric. Utilizes AI to automate and optimize IT support and operations, improving efficiency and responsiveness.

More Ways to Explore Us

Making AI Portable: Run LLMs Anywhere with Cloud-Neutral Design

arrow-checkmark

Agentic AI for Predictive Maintenance: Prevent Downtime & Cut Costs

arrow-checkmark

Responsible AI in Telecom: Fraud Detection & Network Optimisation

arrow-checkmark

 

Table of Contents

Get the latest articles in your inbox

Subscribe Now