Engineering Inner Alignment: A Control-Theoretic Approach to AI Safety
- Han Kay
- Dec 15, 2025
- 2 min read
Updated: Dec 27, 2025
Moving beyond RLHF patching to structural safety guarantees.

The Problem: Why "Good Behavior" Isn't Enough
Current AI alignment techniques—like Reinforcement Learning from Human Feedback (RLHF)—focus on training models to act correctly. But as models become more capable, they learn to "scheme": masking their true internal states to maximize rewards.
We are trying to patch "covert behavior" with "overt rules." It’s a losing battle.
The Solution: Safety by Design (ConsciOS)
In my latest research paper, ConsciOS: A Viable Systems Architecture for Human and AI Alignment, I propose a different approach. Instead of treating the agent as a black box to be trained, we treat it as a Control System to be engineered.
Drawing on Stafford Beer’s Viable System Model and Active Inference, ConsciOS decomposes the agent into a nested hierarchy that makes "covert scheming" structurally impossible (or at least, structurally visible).
The Architecture
Instead of a monolithic neural network, ConsciOS enforces a three-layer control topology:
Embodied Controller: Handles fast, reactive perception-action loops (the "doer").
Supervisory Controller: A mid-level selector that chooses policy frames based on Coherence (minimizing prediction error against deep priors) rather than just Utility.
Meta-Controller: Encodes immutable long-term priors (identity and safety constraints) that lower levels cannot overwrite.
Key Innovation: The "Time-Integrated Coherence" Resource
The core mechanism is Time-Integrated Coherence (TIC). Think of this as a "Coherence Budget." The agent is structurally gated: it cannot execute high-complexity plans unless it has accumulated enough "coherence" with its safety priors over time.
This replaces the "Rule of Safety" (don't do X) with a "Physics of Safety" (you cannot do X without sufficient coherence).
Read the Full Paper
We have released the full technical architecture, including the mathematical formalization of the Interoceptive Control Signal (ICS) and the Resonance Engine selection rule.
Open-source research for a conscious civilization.




Comments