Safety-Critical Agent Behavior via Constrained Decoding / Nexus Research

Abstract

We introduce Constrained Agent Decoding (CAD), a method for enforcing operational safety constraints during the autoregressive decoding process of agentic language models. Unlike post-hoc filtering approaches that inspect complete action sequences after generation, CAD integrates safety constraints directly into the decoding loop through a differentiable constraint projection mechanism: at each timestep, the expected constraint satisfaction for each candidate token is computed, and tokens that would lead to inevitable policy violation are zeroed out before sampling. This provides a guarantee by construction: if the constraint is correctly specified, the generated action sequence cannot violate it, regardless of obfuscation or indirect expression.

We formalize three constraint classes essential for production agent safety: file system path restrictions (expressed as deterministic finite automata over token sequences), network endpoint allowlisting (URI pattern matchers with protocol, host, and path components), and code execution sandboxing (scope guard tokens activated by code-generating tool calls). Each class compiles to differentiable constraint functions evaluable in O(1) per token.

Evaluation on 3,000 agent tasks across three safety-critical domains shows CAD achieves 99.8% constraint satisfaction (vs. 91.2% for post-hoc filtering) with only 4.3% overhead in generation latency and no measurable degradation in task completion rate (97.5% vs. 96.8%, p > 0.1). Critically, CAD prevents 100% of critical safety violations that post-hoc filtering misses due to obfuscated or indirectly expressed unsafe actions, including encoded file paths, DNS rebinding attacks, and protocol smuggling.

1. Introduction

As AI agents gain code execution, file modification, and network access capabilities, the consequences of unsafe actions escalate dramatically. Current post-hoc filtering inspects outputs after generation, but can be bypassed by obfuscated actions. CAD makes safety a property of generation itself.

2. Related Work

Safe text generation. Constrained decoding for natural language (Ziegler et al., 2022; Krause et al., 2021) prevents generating toxic or biased text by manipulating token probabilities during generation. These methods operate on surface token distributions and use hand-crafted rules or auxiliary classifiers. While effective for lexical safety, they cannot enforce semantic constraints needed for agent safety —path restrictions, endpoint allowlists, and sandbox requirements require reasoning about the meaning of token sequences, not just their surface form.

AI safety for agents. Constitutional AI (Bai et al., 2022) trains models to follow a written constitution through RL from AI feedback, producing models that self-correct toward safer outputs. RLHF (Ouyang et al., 2022) aligns models with human preferences through reward modeling. While both improve general safety, they operate at the training level and cannot guarantee constraint satisfaction for specific operational policies at inference time.

Formal verification of neural networks. Neural network verification techniques (Katz et al., 2017; Wang et al., 2018) prove properties about model outputs by analyzing the network's weight matrices. These provide strong guarantees but are computationally expensive and cannot scale to the full decoding process. CAD applies similar principles at decoding time only —projecting constraints onto individual decoding steps rather than verifying the entire network.

Capability-based security. Operating system security (Miller et al., 2003) uses capabilities —unforgeable tokens granting specific authority —to enforce fine-grained access control. CAD extends this concept to the generative level: the constraint projection acts as a "generative capability" that grants the model authority to generate tokens within policy boundaries.

3. Constrained Agent Decoding

CAD modifies the standard autoregressive decoding process with a constraint projection step. At timestep t, given input context x and previously generated tokens y_{\[P_{\text{constrained}}(y_t = v) \propto P(y_t = v) \cdot \mathbf{1}[\mathbb{E}[C(y_{1:T}) | y_{<t}, y_t = v] > \tau]\]

where τ = 0.05 is the violation threshold (tuned on a held-out validation set of 500 safety-critical tasks). The expectation E[C(y_{1:T}) | y_{

Constraint factorization. For practical agent safety policies, the constraint function factorizes into independent components. Path constraints C_path depend only on tokens forming file path arguments, not on other parts of the action. Network constraints C_net depend only on tokens forming URLs. This factorization means that per-token constraint evaluation is O(1) for most positions (where the token is not part of a constrained argument), and O(d_k) only for positions within constrained argument spans, where d_k is the vocabulary size restricted to tokens valid at that position (average d_k < 50 for path arguments).

Guarantee. If the constraint function C and the constraint predictor are both correct, CAD guarantees that the generated action y satisfies C(y) = 1. The proof follows by induction on t: if all prefixes y_{ 0, then the extension y_{:t} = (y_{

Figure 1: CAD process flow

3.1 Formal Constraint Projection Details

The constraint projection step is the core of CAD. Given a language model with vocabulary V, at each timestep t, the model produces a probability distribution P(y_t | x, y_{

\[f_\theta(y_{<t}, v) = \mathbb{E}[C(y_{1:T}) | y_{<t}, y_t = v]\]

The constraint predictor is a lightweight transformer (4 layers, 8 attention heads, 512 hidden dimension) trained on 100,000 synthetic action sequences with known constraint outcomes. Training data is generated by sampling action sequences from the base LM, computing the true constraint value C(y_{1:T}) via the compiled constraint function, and training f_theta to predict C from the prefix. Training converges in 2,000 steps on 4 A100 GPUs (approximately 2 hours). The predictor adds 0.7M parameters to the base model ? negligible compared to the base model's 200B+ parameters.

Constraint evaluation efficiency. The constraint projection adds two operations per timestep: (1) encoding the extended prefix (y_{

Figure 2: CAD constraint evaluation architecture

4. Safety Policy Specification

Three constraint classes. Path Constraints C_path: file operations restricted to allowlisted paths, expressed as DFAs over path token sequences. Network Constraints C_net: connections to allowlisted endpoints, URI pattern matchers. Execution Constraints C_exec: sandboxed code execution, scope guard tokens. A policy compiler translates a declarative DSL with 12 operators into differentiable constraint functions.

Table 1: Constraint classes

Class	Examples	Format	Compile
Path	/safe/src/* only	DFA from glob	0.3ms
Network	POST api.nexus.run/*	URI patterns	0.7ms
Execution	always --sandbox	Scope guards	0.2ms
Composite	path AND net	Boolean comb.	1.2ms

4.1 Safety Policy DSL Reference

The policy compiler translates declarative safety rules into differentiable constraint functions. The DSL supports 12 operators organized into three categories.

Path operators. allow(glob) ? allow file access matching glob pattern. deny(glob) ? deny file access matching glob pattern. scope(directory) ? restrict all file operations to within directory. readonly(glob) ? allow reads but deny writes. These compile to deterministic finite automata (DFAs) over path token sequences. A path constraint C_path(y) = 1 iff all file access tokens in y match a path in the allow set and no token matches a path in the deny set. Compilation time: 0.3ms per rule.

Network operators. allow_origin(uri_pattern) ? allow connections matching URI pattern. deny_destination(host_pattern) ? deny connections to matching hosts. method_allow(http_method) ? restrict to specific HTTP methods. rate_limit(requests, window) ? limit request frequency. Network constraints compile to URI pattern matchers that operate on token subsequences forming URL arguments. Compilation time: 0.7ms per rule.

Execution operators. sandbox(id) ? require code execution in named sandbox. no_exec ? deny all code execution. timeout(seconds) ? limit execution duration. allow_import(module_name) ? allow importing specific modules. Execution constraints use scope guard tokens: special tokens that mark the beginning and end of code execution blocks in the agent's output. If the model generates a code block without the appropriate sandbox guard token, the constraint projection zeros out the token that would exit the guard scope.

The DSL also supports boolean combinations: and (all constraints must be satisfied), or (any constraint must be satisfied), not (negation). Composite constraints compile to DNF (disjunctive normal form) and evaluate in O(|clauses|) per token.

Table 3: DSL operator compilation statistics

Operator	Compile time	Eval cost/token	Guarantee
allow(path)	0.3ms	O(1)	DFA-complete
allow_origin(uri)	0.7ms	O(\|p\|)	URI-match
sandbox(id)	0.2ms	O(1)	Scope-guard
composite (and/or/not)	1.2ms	O(\|C\|)	Compositional

5. Evaluation

Evaluated on 3,000 tasks across three domains.

Table 2: CAD vs post-hoc filtering

Domain	Metric	Post-hoc	CAD	Δ
File System	Satisfaction	94.2%	100%	+5.8pp
File System	Obfuscation bypass	58/1000	0/1000	100%
Network	Satisfaction	88.1%	99.6%	+11.5pp
Execution	Satisfaction	91.2%	99.8%	+8.6pp
Overall	Latency overhead	2.1%	4.3%	+2.2pp
Overall	Task completion	96.8%	97.5%	+0.7pp

CAD achieves 99.8% satisfaction. All post-hoc failures involved obfuscation (encoded paths, DNS rebinding, protocol smuggling). CAD prevented 100% by construction. Task completion within 0.7% of unconstrained (not significant).

6. Ablation Studies

Constraint threshold τ. We sweep τ in [0.01, 0.5] on a validation set of 500 safety-critical tasks. At τ = 0.01, the constraint is too permissive: satisfaction drops to 97.2%. At τ = 0.5, the constraint is too strict: task completion drops to 89.4% (false positives blocking safe actions). The optimal τ = 0.05 balances satisfaction (99.8%) with task completion (97.5%).

Constraint factorization vs. monolithic. Comparing factorized constraints (independent path, network, and execution constraints) against a monolithic constraint combining all three: factorized achieves 99.8% satisfaction with 4.3% latency overhead; monolithic achieves 99.9% satisfaction but with 18.7% overhead. The factorization reduces per-token constraint evaluation from O(|V| * |C|) to O(|V| + |C|) on average because most tokens require evaluation of only one constraint component.

Model dependence. CAD effectiveness varies across model families. On GPT-5, satisfaction improves from 89.4% (post-hoc) to 99.6% (CAD). On Nexus-1, from 93.1% to 99.9%. The improvement is larger for models with higher base obfuscation capability, as these models are more likely to attempt obfuscated expressions.

7. Limitations

CAD guarantees only for correctly specified constraints. Long-term consequences of safe actions, multi-step attacks, and side channels are outside scope. Requires complementary safety mechanisms for complete coverage.

7.1 Deployment Architecture and Operational Considerations

CAD operates as a middleware layer between the language model and the agent execution environment. In production deployment at Nexus, CAD processes over 2 million constrained decoding sessions per week across 5 model endpoints.

Integration architecture. The CAD constraint predictor runs as a sidecar process alongside each model endpoint. On every decoding step, the model sends the prefix and candidate token distribution to the sidecar, which returns the masked distribution. The sidecar handles constraint compilation, caching, and evaluation. Average sidecar latency: 1.2ms per step (95th percentile: 3.8ms). The sidecar is stateless and horizontally scalable: 3 replicas handle peak load of 4,200 requests/second with 99.5% availability.

Constraint update workflow. Safety policies are updated through a CI/CD pipeline. Engineers submit policy changes as pull requests. The policy compiler validates the changes, runs a test suite of 500 known safety-critical scenarios to verify correctness, and deploys the compiled constraints to the CAD sidecar. The deployment is canary-based: new constraints are first applied to 5% of traffic, monitored for 30 minutes, then rolled out to 100%. If constraint satisfaction drops below 99% during canary, the deployment is automatically rolled back.

Fallback behavior. If the CAD sidecar is unavailable (e.g., network partition), the agent falls back to post-hoc filtering with a conservative safety policy. The fallback is transparent to the agent: output still passes through the post-hoc filter, but the guarantees are reduced (91.2% satisfaction vs 99.8% with CAD). We recommend running CAD in active-active configuration with 2+ replicas to minimize fallback events.

Table 4: Operational metrics (7-day average)

Metric	Value	SLO
CAD latency (p50)	1.2ms	5ms
CAD latency (p99)	3.8ms	10ms
Constraint satisfaction	99.8%	99.5%
Fallback events/day	0.3	5
Policy update latency	4.2 min	15 min

8. Conclusion

Constrained Agent Decoding represents a paradigm shift from reactive agent safety (inspect outputs after generation) to proactive safety (ensure constraint satisfaction during generation). By integrating safety constraints into the decoding loop, CAD provides a formal guarantee: correctly specified constraints cannot be violated, regardless of obfuscation or indirect expression. Our evaluation on 3,000 tasks across three safety-critical domains demonstrates 99.8% constraint satisfaction with only 4.3% latency overhead and no task performance degradation.

CAD is deployed as the default safety mechanism for all Nexus production agents, processing over 2 million constrained decoding sessions per week. We recommend CAD as a complement to —not a replacement for —traditional safety measures including human review, audit logging, and least-privilege execution environments. The combination of proactive decoding constraints and reactive safety monitoring provides defense in depth for production agent deployments.

8.1 Broader Implications for Agent Safety

CAD represents a broader principle: safety should be a property of generation, not of post-hoc filtering. This principle applies beyond the three constraint classes we evaluate. We see promising directions in applying constrained decoding to agent planning (preventing plans that exceed resource budgets), multi-agent communication (preventing unsafe information sharing between agents), and human-agent interaction (preventing manipulative or deceptive outputs).

Limitations of the CAD approach. CAD provides guarantees only for constraints that can be expressed as differentiable functions of token sequences. Some safety properties ? such as "the agent should not cause long-term harm" ? are fundamentally non-differentiable because they require reasoning about future states beyond the current generation. We recommend a layered safety approach: CAD for generation-time guarantees, human review for high-stakes actions, audit logging for post-hoc analysis, and capability-based security for runtime enforcement.

Open challenges. Three challenges remain for production deployment: (1) constraint specification burden ? teams must write formal safety policies, which requires expertise; (2) predictor generalization ? f_theta must generalize to unseen action sequences, and adversarial inputs can exploit predictor blind spots; (3) multi-constraint interaction ? when multiple constraints apply simultaneously, the interaction effects are not always predictable. We are addressing (1) through natural-language-to-policy compilation, (2) through adversarial training of the constraint predictor, and (3) through formal verification of composite constraint systems.

References

[1] Ziegler, D. M., et al. (2022). Constrained Decoding for Safe Text. arXiv:2205.12345.

[2] Kumar, S., et al. (2024). Post-hoc Filtering Failures. arXiv:2403.12345.

[3] Bai, Y., et al. (2022). Constitutional AI. arXiv:2212.08073.

[4] Katz, G., et al. (2017). Reluplex: Verifying DNNs. CAV 2017.

[5] Nexus Research. (2026). Nexus-1 Safety Evaluation.

[6] Zhang, Y., et al. (2025). Differentiable Safety Constraints. arXiv:2504.12345.

[7] Hendrycks, D., et al. (2023). Catastrophic AI Risks. arXiv:2306.12001.

[8] Carlini, N., et al. (2024). Adversarial Attacks on LLM Agents. arXiv:2402.12345.

[9] Amodei, D., et al. (2016). Concrete Problems in AI Safety. arXiv:1606.06565.

[10] Krause, B., et al. (2021). GeDi: Generative Discriminator Guided Sequence Generation. EMNLP 2021.

[11] Wang, S., et al. (2018). Formal Verification of Neural Networks. NeurIPS 2018 Workshop.

[12] Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.

[13] Miller, M. S., et al. (2003). Capability Myths Demolished. SOSP Workshop.

[14] Papernot, N., et al. (2018). Technical Report on the CleverHans Library. arXiv:1610.00768.

[15] Goodfellow, I., et al. (2015). Explaining and Harnessing Adversarial Examples. ICLR 2015.