When AI Doesn’t Know What It Doesn’t Know: A Lesson in Financial Domain Knowledge

Posted on April 23, 2026 by Editor

This guest opinion piece is the fourth in a series from Björn Fastabend, bringing us a regulator’s perspective alongside a wealth of experience in digital reporting. Björn is head of the XBRL collection and processing unit at BaFin, Germany’s Federal Financial Supervisory Authority, where he supervises all related activities and leads the implementation of strategic initiatives. He is also a member of XBRL International’s Board of Directors, and previously chaired our Best Practices Board.

The views and opinions expressed in this publication are those of the author. They do not purport to reflect the opinions or views of BaFin.

Have you ever asked an AI the same question five times? And did you get the same answer each time? Regulators need answers they can rely on, but in my experiments with AI that has proved trickier than expected.

In my previous articles in this series, I proposed a way for financial regulators to incorporate AI into existing processes while maintaining regulatory rigor. The idea is simple: use AI for exploration and rely on Business Intelligence (BI) tools for verification. That way you get the best of both worlds: AI finds the outliers and deviations in large datasets; BI confirms whether they’re real. I still believe in this approach, but I’ve learned something important about its limits: AI models lack sufficient domain expertise.

The project

I built – well, vibe coded to be exact – a proof-of-concept tool based on my proposed theory. I spent a couple of weekends and the odd late night building a tool that pulled publicly available EDGAR data from the US Securities and Exchange Commission (SEC) for the top 100 companies for the last three years. Just like a regulator would, I stored this data alongside the taxonomy information in a database and “chatted” with the data using an LLM.

At first, life seemed good. The LLM was giving me interesting answers, and I was happy to have pulled off a working POC – until I decided to test the results a bit more. Mind you, I have no background in finance and could not independently verify if the results returned by the LLM were correct or not. So I took a different approach, and asked the LLM the following straightforward question five times in a row:

“Are there companies where revenue increased but net income decreased in 2024?”

I was surprised to get five different answers, despite the source data being the same. I suspected the LLM of hallucinating, but it was innocent.

The semantic paradox

By checking the values returned in the different answers, I was able to verify that they corresponded to disclosures available in the database. But if this was no hallucination, what was it? I chatted with Claude to get to the root of this issue and was stunned to realize that there were 80 different revenue-related concepts in the SEC taxonomy! But before you blame the SEC, I’m told that this is a necessary level of complexity from a financial perspective. The taxonomy represents the real world… and the real world is often a bit messy.

So, the problem was not that the LLM was hallucinating. With only my rather ambiguous prompt “…where revenue increased…” to go on, it just didn’t know which revenue-related concept to pick for the analysis. And so, it did what any good and dutiful LLM would do: it guessed. That is how I got my five different results from one identical question.

It’s a semantic paradox. While XBRL is great in providing semantic richness to the data in the database, without a very specific prompt LLMs are likely to get lost and choose concepts at random.

The fix

One solution would be to instruct users to be absolutely exact when prompting the LLM. But that’s simply not practicable. It would require painfully complex prompts and exhaustive knowledge from the questioner, and prevent the less technical of us from using AI to ask exploratory questions.

The only other option was to provide the LLM with the domain knowledge of financial reporting and of the specifics of US disclosure requirements that it clearly was lacking. I put together what I called a Financial Domain Intelligence Layer, or FDIL. This was a layer of code designed to identify the concept needed for the query, based on the context given in the prompt. It basically goes through the list of available concepts and picks the one that it deems best suited for this question. It’s a fairly rough-and-ready approach, but it gave me a way to inject some domain knowledge into the process and help the LLM to better understand and apply the query to deliver meaningful results.

Did it work? Did it return the correct results? That will unfortunately have to remain a cliffhanger! I am a tech guy rather than a financial specialist, so I would prefer not to promise that the answers were correct.

What I can say is that when I sent the same prompt multiple times, the results were the same. So, while I can’t be certain that the LLM is correctly interpreting the prompt, it does appear that the little detour through the FDIL is likely helping the LLM to provide the reliable, deterministic results that regulators need.

The takeaway

So, what’s the moral of this story? Even without an accuracy guarantee, I would declare this proof of concept a success. I have learned a lot in the process, and I still believe that the hybrid approach – analyzing XBRL data using AI for explorative questioning, and verifying the findings using BI tools – is a viable way that regulators could leverage the power of AI, starting today, without jeopardizing the regulatory rigor expected of them.

At the same time, without specialist domain knowledge even a powerful system like an AI model will not deliver the results we expect. That’s not a flaw, and it does not diminish the power and possibilities of AI – it’s simply reality.

AI models analyzing corporate disclosures need both structured XBRL data and accompanying taxonomies to make sense of the rich metadata available in digital reports. It’s clear that they also need the respective domain knowledge to effectively interpret all that data and achieve high-value, trustable insights.

The gap between promising experiments and production grade AI solutions for regulatory data remains real – and bridging it will require both specialist domain knowledge and continued collaboration across our field. But that is no reason to wait. Frankly, we don’t have time to wait while our supervised industries are embracing AI solutions at great speed.

Experiments like this one are how we can start the ball rolling and start learning. I encourage every regulator reading this to start building their own experiments and gaining their own experiences, to eventually come out ahead of the game. Get moving, make mistakes, learn, adapt, and keep moving forward!

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

When AI Doesn’t Know What It Doesn’t Know: A Lesson in Financial Domain Knowledge

The project

The semantic paradox

The fix

The takeaway

Calendar

Would you liketo learn more?

Would you like
to learn more?