auditissimo: AI-Assisted Internal Audit of IRBA Rating Procedures

Authors: Prof. Dr. Dirk Schieborn, Prof. Dr. Volker Reichenberger (Reutlingen University), Tim S. Körwers (msg for banking ag) · Working Paper, March 2026

Partner: msg for banking ag

Ongoing project — development and evaluation are not yet complete. The architectural decisions, results and conclusions described here reflect the state as of March 2026 and are continuously updated as research progresses.

How well can Generative AI support Internal Audit in reviewing complex rating models? In the research project auditissimo, together with msg for banking ag, we investigated exactly this question — and developed a modular AI prototype that accompanies the entire audit process of IRBA rating systems step by step.

Background: A Demanding Audit Task

Credit institutions using the Internal Ratings-Based Approach (IRBA) estimate their regulatory capital parameters — Probability of Default (PD), Loss Given Default (LGD) and Exposure at Default (EAD) — using proprietary statistical models. Article 191 of the Capital Requirements Regulation (CRR) requires Internal Audit to fully review these rating procedures at least once a year.

That sounds manageable — but in practice it is anything but. A typical validation report for a mid-sized institution references more than forty individual regulatory requirements from CRR, EBA guidelines, EBA-RTS and the ECB Guide to Internal Models. Each requirement must be supported by specific quantitative evidence: Gini coefficient ≥ 0.40, PSI thresholds, Hosmer-Lemeshow tests, migration matrices — the list is long. Such expertise is rarely fully available within a single audit team. This is precisely where auditissimo comes in.

The auditissimo Architecture: Six Modules, One Integrated Audit Process

auditissimo is not a generic chat tool. The system maps the actual logic of an IRBA audit — step by step, module by module.

M1 — Regulatory Basis: A Single Source of Truth

Regulatory requirements for IRBA models are scattered across many documents — CRR, EBA-RTS, EBA guidelines, ECB guides, internal policies. Module 1 reads all relevant source documents and breaks them down into atomic, testable individual requirements. Each requirement receives a unique ID, a short description, the exact wording and the precise regulatory source. In the current implementation, the system generates an average of 31 requirements per regulatory section — with a precision of over 90% confirmed by human reviewers.

M2 — Risk Assessment

Not all requirements carry equal weight. Module 2 assesses each requirement using model metadata (asset class, vintage, regulatory history) and assigns a risk score. Audit resources are thereby directed specifically to the most critical areas — fully in line with the ECB’s risk assessment framework.

M3 — Working Paper Initialisation

Module 3 generates pre-filled audit working papers for each selected requirement. The administrative burden of audit preparation decreases noticeably — the audit team can focus on substantive assessment rather than document creation.

M4 — Gap Analysis: The Core of the Audit

Module 4 compares the validation concept (what must be audited?) with the validation report (what was actually audited?). For each requirement, a three-tier fulfilment assessment is generated:

Fulfilled (80–100): Clear, complete documentary evidence present
Partially fulfilled (30–79): Partial evidence, but gaps or ambiguities
Not fulfilled (0–29): No or insufficient evidence

Crucially, each AI assessment is linked to supporting passages from the audit document, enabling the auditor to trace the AI’s reasoning chain and override it if necessary. The LLM call temperature is deliberately set to T = 0.1 — gap analysis is an evidence retrieval task, not a creative one.

M5 — Deep Dive

Requirements rated “partially fulfilled” or “not fulfilled” in M4 are examined in greater depth in Module 5. The module drills into specific model outputs, datasets and calculations — for example: Is the reported Gini coefficient of 0.46 for the retail PD model consistent with the methodology specified in the validation concept?

M6 — Report and Finding Synthesis

Module 6 aggregates the findings from M4 and M5 into structured audit findings in the format of the institution’s audit management system. All outputs are explicitly declared as first drafts — the final assessment always rests with the responsible auditor.

Human-in-the-Loop: Where AI Must Have Limits

auditissimo is consistently designed to support auditor judgement — not to replace it. Four task categories require human decision-making authority:

H1 — Materiality assessment: Whether an identified gap is material as an audit finding requires professional judgement.
H2 — Regulatory interpretation: For ambiguous regulatory texts, the AI may deliver a plausible but incorrect interpretation. The auditor’s institutional contextual knowledge is indispensable.
H3 — Model methodological assessment: The appropriateness of specific modelling decisions lies beyond the current capabilities of general-purpose LLMs.
H4 — Audit opinion: The overall assessment of a rating procedure’s regulatory adequacy is a legal and professional decision that remains with the qualified auditor.

Empirical Validation: The Steinbeis Bank as Test Environment

A central methodological problem in evaluating GenAI audit tools: there are no publicly available, annotated datasets of IRBA compliance assessments — such data is institutionally confidential. auditissimo solves this problem through a purpose-built synthetic test environment: the Steinbeis Bank.

The Steinbeis Bank is a fully parametrisable, synthetic credit institution with four production-grade IRBA models (Corporate PD/LGD, Retail PD/LGD), trained on 10 years of synthetic data (2015–2024) with realistic macroeconomic cycles.

Initial Results: What the AI Can — and Cannot — Do

The pilot tested 24 requirements from the retail PD validation context and four ablation scenarios. The pattern is clear: the AI detects complete omissions almost without error (F1 = 0.92). Performance decreases systematically as non-compliance becomes more subtle. The hardest cases to detect are those where the report narratively describes a methodology without providing quantitative results — exactly the cases where experienced auditors would also invest the most attention.

Particularly revealing: a generic prompt achieves only an F1 score of 0.59. Through iterative refinement — role specification as a bank auditor, structured output schema and embedding of concrete regulatory thresholds (AUC ≥ 0.70, PSI < 0.10) — the score rises to 0.78. Domain knowledge must be built into the system architecture, not merely hinted at in the prompt.

What auditissimo Shows

Process proximity is decisive: Generic document chat tools do not deliver actionable results for IRBA audit work. The AI must map the sequential logic of the audit process itself.
Atomic auditability is mandatory: Every AI statement must be traceable to a specific supporting passage — for transparency, for auditor override, and for the regulatory audit trail.
Governed human primacy: Audit conclusions must reflect the professional judgement of a qualified auditor. AI support is valuable; AI substitution is impermissible.

For a mid-sized institution with ten IRBA models, this means 300 to 400 individual requirement reviews per annual cycle. auditissimo’s streaming gap analysis can complete this review in hours rather than weeks — with the auditor focused on material findings rather than mechanical document matching.

auditissimo is available online at auditissimo.com.