A second-generation safeguard layer that filters both inputs and outputs without significant capability loss.
Anthropic published Constitutional Classifiers v2, a paired input/output filter that blocks 95% of universal jailbreaks in red-team testing while raising helpful-query refusal by less than 0.4 percentage points. The system is now active by default on Claude Opus 4.7 and Sonnet 4.7. Anthropic released the training methodology and a public test harness for researchers.