Reasoning models, which utilize extended chain-of-thought (CoT) reasoning, demonstrate enhanced performance in both problem-solving and accurately expressing confidence compared to non-reasoning models. This study benchmarks six reasoning models across various datasets, revealing that their slow thinking behaviors facilitate better confidence calibration. The findings indicate that even non-reasoning models can improve calibration when guided towards slow thinking techniques.
The study investigates the impact of instruction tuning on the confidence calibration of large language models (LLMs), revealing significant degradation in calibration post-tuning. It introduces label smoothing as a promising solution to mitigate overconfidence during supervised fine-tuning, while also addressing challenges related to memory consumption in the computation of cross-entropy loss.