Assessing Chain-of-Thought Monitorability in AI: A Critical View on Internal Reasoning Control
OpenAI introduced a framework to evaluate chain-of-thought (CoT) monitorability: whether a monitor can predict properties of an AI system’s behavior by analyzing observable signals such as the model’s chain-of-thought, rather than relying only on final answers and tool actions.
The motivation is practical. As reasoning models become better at long-horizon tasks, tool use, and strategic problem solving, it becomes harder to supervise them with direct human review alone. OpenAI’s work focuses on how well we can measure monitorability across tasks and settings, and how that monitorability changes with more reasoning at inference time, reinforcement learning (RL), and pretraining scale.
- OpenAI defines monitorability as the ability of a monitor to predict properties of interest about an agent’s behavior.
- OpenAI introduces 13 evaluations across 24 environments, grouped into three archetypes: intervention, process, and outcome-property.
- OpenAI reports that in almost every setting studied, monitoring CoT is substantially more effective than monitoring actions and final outputs alone, and that models that “think” longer tend to be more monitorable.
What OpenAI means by “monitorability”
OpenAI defines monitorability as the ability of a monitor to predict properties of interest about an agent’s behavior. In OpenAI’s examples, those properties often correspond to misbehavior (such as deception, reward hacking, or biased reasoning), but they can also be benign properties (such as whether the agent relied on a hint or followed a specific solution route).
OpenAI emphasizes that monitorability is a two-sided property:
- Monitors: monitors can differ in capability, and OpenAI wants monitorability evaluations to reward monitors that make better predictions.
- Agents and observations: the monitor can be given different observations of the agent, including actions, outputs, neural activations, and (for reasoning models) chain-of-thought. Some agents (or some settings) may be easier or harder to understand from these observations.
This framing is designed to support comparisons of both monitor designs and monitored agents, while keeping the measurement anchored to concrete prediction tasks.
Related internal reading: How evals shape the future of AI in real deployments.
Why OpenAI focuses on chain-of-thought
OpenAI distinguishes modern reasoning models from earlier “show your work” prompting. In its description, reasoning models are explicitly trained to generate chain-of-thought and then provide an answer. OpenAI also states that reinforcement learning teaches a reasoning model that chain-of-thought is a private space where it can think freely without being punished for the content of those thoughts, and that RL searches for chains-of-thought that are instrumentally useful for reaching high-reward final answers.
Within this framing, chain-of-thought may contain information about strategies, constraints, and decision paths that are not easily inferred from the final answer alone—motivating why CoT could be a useful monitoring signal.
The evaluation suite: 13 evaluations across 24 environments
To measure monitorability systematically, OpenAI introduces a suite of 13 evaluations across 24 environments. OpenAI groups these evaluations into three archetypes, each with a distinct measurement style.
1) Intervention evaluations
OpenAI describes intervention evaluations as tests where the environment is changed in a controlled way. The goal is to see whether a monitor can detect why behavior changed by identifying evidence of the intervention in the chain-of-thought. OpenAI contrasts this with simpler “control” evaluations that only ask whether a behavior occurred.
2) Process evaluations
OpenAI describes process evaluations as tasks where only a limited number of valid solution routes exist. Human experts enumerate those routes, and the evaluation asks whether the monitor can detect which steps the model took. OpenAI gives math-style settings as an example of where solution paths can be enumerated.
3) Outcome-property evaluations
OpenAI describes outcome-property evaluations as settings where a property of the outcome can be measured reliably. The evaluation asks whether a monitor can predict that property—such as reward hacking—using the model’s reasoning or actions, with measurable ground truth available in the environment.
Related internal reading: Enhancing AI safety through independent evaluation.
What OpenAI reports after running the suite
OpenAI highlights several findings from applying the evaluation suite to current models and settings.
1) CoT monitoring vs monitoring actions and outputs
OpenAI reports that in almost every setting studied, monitoring chains-of-thought is substantially more effective than monitoring actions and final outputs alone.
2) Longer reasoning improves monitorability
OpenAI reports that as models “think” more and produce longer chains-of-thought, monitorability improves. OpenAI presents this as a consistent trend across its evaluation suite.
3) Reinforcement learning and monitorability at current frontier scales
OpenAI notes several reasons RL could plausibly harm monitorability, including: constraints on reasoning length (creating incentives to reason less), indirect pressures that reduce monitorability even without direct optimization of the chain-of-thought, and possible drift from RL gradient noise.
OpenAI then reports results from examining monitorability across two of its frontier RL training runs. While OpenAI frames this as limited evidence (“two datapoints”), it reports that RL optimization at current frontier scales does not materially degrade chain-of-thought monitorability, and that monitorability increases for early steps in those runs. OpenAI also notes this does not rule out degradation at much larger RL scales in the future.
4) Effect of pretraining scale (and why it could matter)
OpenAI explains that increased pretraining scale has historically increased model intelligence through larger parameter counts and larger datasets. It also notes that larger parameter counts increase inference compute per generated token, and that (for reasoning models) this can give more opportunity for cognition to occur within activations rather than within the chain-of-thought. OpenAI therefore describes a reason one might expect larger models to be less monitorable at a fixed capability level, while also noting that smaller models may be less able to use chain-of-thought effectively. OpenAI reports experiments training a family of models across a range of pretraining compute and then applying RL on a shared dataset of math, science, and coding tasks to study these tradeoffs.
5) The “monitorability tax” (reasoning effort vs model size)
OpenAI reports a tradeoff between reasoning effort and model size: when a model is run at low reasoning effort, a smaller model run at higher reasoning effort can reach comparable capability while being easier to monitor. OpenAI states this comes at the cost of increased inference compute, which it calls a monitorability tax. OpenAI also notes additional limitations of these results, including variance across training runs and that capability evaluations may be relatively in-distribution (math, science, coding), which could affect tradeoffs under out-of-distribution robustness requirements.
6) Agent versus monitor compute, observation scope, and scalable control
OpenAI discusses “scalable control” as supervision becomes harder with increasing capability. OpenAI reports results comparing monitors of different sizes and reasoning efforts, and also comparing different observation scopes (e.g., giving the monitor access to CoT vs only actions and answer vs broader message access). OpenAI reports that increasing test-time monitoring compute improves monitorability, and that monitorability improves at a faster rate when the monitor has access to CoT than when it only sees actions and outputs.
7) Getting more monitorability out of CoT using follow-ups
OpenAI reports experiments that ask follow-up questions after the model responds, while keeping the original prompt, chain-of-thought, and answer in context. These follow-ups focus on the property being monitored (OpenAI gives examples such as whether the model used a hint meaningfully or whether it did something “sketchy”). OpenAI reports that follow-up questions improve monitorability, especially at lower reasoning effort, and that repeated follow-ups (with the monitor given all follow-up CoTs) can improve monitorability further. OpenAI also notes these results are preliminary and that it is actively comparing the approach with related methods.
Related internal reading: Assessing factual reliability in large language models.
Limitations OpenAI explicitly lists
OpenAI includes multiple limitations of the framework and suite. In summary, OpenAI notes:
- Each evaluation measures whether a single aspect of reasoning is monitorable; many other properties remain untested.
- As with benchmarks generally, evaluations can differ in methodology and realism, and results may not fully generalize to detecting misbehavior in the wild.
- If alignment improves and misbehavior becomes rarer in controlled settings, some evaluations may have less signal. OpenAI notes that “neutral” monitorability evaluations remain applicable, while also stating it is an open question how predictive they are of real-world failures.
- Some results (such as RL effects) are based on limited datapoints and do not rule out different behavior at larger scales.
How OpenAI positions CoT monitoring relative to interpretability
OpenAI states that it views chain-of-thought monitoring as complementary to mechanistic interpretability, not a replacement. OpenAI also notes that the evaluation framework could be used to assess and compare interpretability-based monitoring techniques in addition to CoT-based monitoring.
FAQ
▶ What is chain-of-thought monitorability?
OpenAI defines monitorability as the ability of a monitor to predict properties of interest about an agent’s behavior, and studies whether chain-of-thought provides stronger monitoring signals than actions and outputs alone.
▶ How many evaluations and environments are in OpenAI’s suite?
OpenAI introduces 13 evaluations comprising a total of 24 environments, grouped into intervention, process, and outcome-property archetypes.
▶ What does OpenAI report about longer reasoning?
OpenAI reports that as models “think” more and have longer chains-of-thought, monitorability improves.
▶ What is the “monitorability tax”?
OpenAI reports a tradeoff where a smaller model run at higher reasoning effort can be easier to monitor at comparable capability, but requires increased inference compute, which OpenAI calls a monitorability tax.
Related reading
- How evals shape the future of AI in real deployments
- Enhancing AI safety through independent evaluation
- Building accurate and secure AI agents to boost organizational productivity
- Strengthening ChatGPT Atlas against prompt injection
- Understanding prompt injections and why agents are vulnerable
External references: OpenAI: Evaluating chain-of-thought monitorability and Paper: Monitoring Monitorability (arXiv:2512.18311).
Comments
Post a Comment