Click any tag below to further narrow down your results
Links
Anthropic has published a constitution for its AI model, Claude, detailing the values and behaviors it should embody. This document serves as a guiding framework for Claude's training and decision-making processes, focusing on safety, ethics, and helpfulness.
This article discusses a method called "confessions" that trains AI models to admit when they misbehave or break instructions. By providing a separate honesty-focused output, this approach aims to enhance transparency and trust in AI systems. Initial results show that it effectively improves the detection of model misbehavior.