Joe Breeden (Mar 25)
BTRM Faculty Opinion
Model Risk Management for Large Language Models in Financial Services – Joe Breeden, BTRM Faculty
Generative AI, particularly large language models (LLMs), are being tested in financial services for agent support and customer communications. However, the adaptability and complexity of these models introduce significant risks that require novel model risk management (MRM) strategies. Also, many LLM implementations are for teams that have not previously managed models and are unaware of MRM requirements. Effectively managing these risks within the existing MRM principles require rethinking interpretations, metrics, and priorities.
Financial institutions employ LLMs primarily in two forms: retrieval augmented generation (RAG) models and general LLMs. RAG models supplement generative AI capabilities with document retrieval from a knowledgebase, enabling better controls and transparency. These models allow validation techniques similar to those used in traditional financial modeling. Although RAG models have advantages in risk management, they are not applicable in all contexts, require suitable knowledgebases for training, and require significant development efforts. For this reason, most teams experiment first with deploying general LLMs. Although more difficult for risk management, advances are being made.
General LLMs generate text based on probabilistic predictions without the singular focus on document retrieval. This flexibility makes them powerful but significantly more difficult to validate, as their responses can vary based on changing contexts. The rest of this discussion will focus on general LLMs
The deployment of LLMs introduces distinct categories of risk:
Societal Risks: Regulatory and ethical concerns, including biased model outputs and misinformation. Broader societal risks include the potential for economic, ecological, and cultural damage.
Misuse Risks: The potential for fraudulent applications, identity theft, and unauthorized financial transactions.
Control Risks: The difficulty in predicting and constraining AI behavior.
For the financial services industry, the current areas of greatest focus are misuse risks and societal risks that fall under model risk management guidelines.
Model Risk Management Principles
Model risk management principles provide a foundation for mitigating AI-related risks. The core components include:
Model Development: Ensuring appropriate training data, fine-tuning strategies, and transparency in model behavior.
Model Validation and Independent Review: Conducting robustness tests, evaluating bias, and verifying regulatory compliance.
Governance and Oversight: Establishing clear responsibilities and risk monitoring protocols at the organizational level.
Ongoing Monitoring: Implementing real-time performance tracking to identify emerging risks, biases, and compliance deviations.
These principles serve as a framework for all models, but the emphasis must shift. Historically MRM teams have focused almost exclusively on validation with only limited monitoring. With the dynamic and context-sensitive nature of LLMs, model validation will still be done, but cannot assure proper functioning even at initial deployment. Model monitoring will become much more important.
MRM for LLMs
The standard answer for LLM oversight is human-in-the-loop. However, human oversight alone is inadequate for monitoring LLMs in high-volume financial applications. Research on human error detection in low event rate and repetitive tasks suggests that manual oversight degrades over time due to vigilance decrement, routinely fails to find errors that are unexpected, and falls into over-reliance on automation.
The most effective solution is to use a LLM to monitor LLM communications. Previous AI research found that such LLM-as-judge approaches would exhibit shared biases with the frontline AI –the overseeing LLM would be blind to the biases of the frontline LLM because they are trained on similar databases of human behavior. This appears to have been true in testing, because the researchers let the LLM overseer interpret what regulatory compliance or ethical bias means. Instead, when business teams develop lists of assertions, statements that must be true in order for an LLM-generated message to be compliant, the LLM monitor is effective at filtering the vast majority of compliant messages. Questionable messages can then be displayed in a dashboard for review by the human oversight team. Thus, the LLM provides AI Augmentation to human-in-the-loop oversight, solving both the LLM oversight bias and the HITL psychological failings.
Establishing Benchmarks
One of the more significant challenges for AI adoption is an implicit expectation of machine perfection. AI call center failure make headlines, whereas human call center failures are an assumed part of modern life. Interestingly, an LLM reviewer can be applied equally to both frontline AI and human agents, feeding the same alert dashboards, and collecting directly comparable performance metrics on accuracy, regulatory compliance, and ethical compliance.
The ability to quantify AI performance relative to human performance might be the greatest current gap in AI deployments. The counterpoint to any AI chatbot failure is appropriately to show the failure rate relative to human agents. We want to strive for perfection in our AI agents, but simply providing better service or even comparable service at scale should be considered successes.
Conclusion
The deployment of LLMs in financial services requires a fundamental shift in model risk management interpretation and emphasis. General LLMs demand a shift toward continuous monitoring and AI-assisted compliance oversight. Traditional MRM frameworks must be expanded to incorporate real-time risk assessment methodologies, ensuring the responsible and effective use of generative AI in high-stakes financial applications. By integrating AI-augmented monitoring systems, financial institutions can mitigate risks while fully leveraging the potential of generative AI in finance.