Quit Emailing Yourself

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

2 min read | Saved February 14, 2026 | Copied!

financial 🤖 language-models 🤖 evaluation 🤖 skills 🤖 dataset 🤖

Do you care about this?

This article introduces FinCDM, a framework for assessing financial large language models (LLMs) by evaluating their knowledge and skills rather than relying on a single score. It highlights the creation of a new dataset, CPA-KQA, based on CPA exam questions, which allows for a more nuanced analysis of LLM capabilities in financial contexts. The framework aims to uncover knowledge gaps and enhance model development for real-world applications.

If you do, here's more

Large Language Models (LLMs) show promise in finance, but current evaluation methods fall short. Traditional benchmarks rely on a single score that fails to capture the complexity of a model's understanding. They also tend to focus on a limited range of financial concepts, missing critical real-world skills. To tackle these issues, the authors propose FinCDM, a cognitive diagnosis framework designed specifically for financial LLMs. This framework assesses models based on their knowledge and skills, tracking their performance across tasks tagged with specific skills instead of reducing it to a single number.

The authors also introduce CPA-KQA, a new dataset derived from the Certified Public Accountant (CPA) exam. This dataset is comprehensive, covering a wide array of accounting and financial skills, and is rigorously annotated by experts to ensure high-quality data. Their experiments, which evaluated 30 different LLMs, revealed significant knowledge gaps and highlighted areas like tax and regulatory reasoning that traditional benchmarks often overlook. FinCDM not only provides a more nuanced evaluation but also identifies behavioral patterns among models, paving the way for more effective and tailored development of financial LLMs. All datasets and evaluation scripts will be made publicly available to encourage further research in this area.

Questions about this article

No questions yet.