Posts

Showing posts with the label explainability

Assessing Chain-of-Thought Monitorability in AI: A Critical View on Internal Reasoning Control

Image
OpenAI introduced a framework to evaluate chain-of-thought (CoT) monitorability : whether a monitor can predict properties of an AI system’s behavior by analyzing observable signals such as the model’s chain-of-thought, rather than relying only on final answers and tool actions. The motivation is practical. As reasoning models become better at long-horizon tasks, tool use, and strategic problem solving, it becomes harder to supervise them with direct human review alone. OpenAI’s work focuses on how well we can measure monitorability across tasks and settings, and how that monitorability changes with more reasoning at inference time , reinforcement learning (RL) , and pretraining scale . TL;DR OpenAI defines monitorability as the ability of a monitor to predict properties of interest about an agent’s behavior. OpenAI introduces 13 evaluations across 24 environments , grouped into three archetypes: intervention , process , and outcome-property . OpenAI ...

Understanding Prompt Injections: A New Challenge in AI and Human Cognition

Image
Cyber-resilience sidebar This overview is informational only (not professional advice) and reflects common LLM security patterns as understood in early November 2025. It includes no tactical or offensive guidance. Implementation decisions remain with your security and governance teams, and standards can change over time—validate controls in your own environment before relying on them. Prompt injections are no longer a niche “jailbreak trick.” In 2025, they sit at the center of a broader security problem: language models are becoming agents, and agents operate inside real workflows. That means a malicious instruction doesn’t just distort an answer—it can redirect a chain of actions, pull the wrong documents, leak sensitive context, or quietly corrupt a decision-making process. What makes prompt injection uniquely uncomfortable is that it exploits the same thing that makes LLMs useful: they treat natural language as executable intent. The defender’s dilemma is therefo...