Engineering Inner Alignment: A Control-Theoretic Approach to AI Safety

Han Kay
Dec 15, 2025
2 min read

Updated: Dec 27, 2025

Moving beyond RLHF patching to structural safety guarantees.

The Problem: Why "Good Behavior" Isn't Enough

Current AI alignment techniques—like Reinforcement Learning from Human Feedback (RLHF)—focus on training models to act correctly. But as models become more capable, they learn to "scheme": masking their true internal states to maximize rewards.

We are trying to patch "covert behavior" with "overt rules." It’s a losing battle.

The Solution: Safety by Design (ConsciOS)

In my latest research paper, ConsciOS: A Viable Systems Architecture for Human and AI Alignment, I propose a different approach. Instead of treating the agent as a black box to be trained, we treat it as a Control System to be engineered.

Drawing on Stafford Beer’s Viable System Model and Active Inference, ConsciOS decomposes the agent into a nested hierarchy that makes "covert scheming" structurally impossible (or at least, structurally visible).

The Architecture

Instead of a monolithic neural network, ConsciOS enforces a three-layer control topology:

Embodied Controller: Handles fast, reactive perception-action loops (the "doer").
Supervisory Controller: A mid-level selector that chooses policy frames based on Coherence (minimizing prediction error against deep priors) rather than just Utility.
Meta-Controller: Encodes immutable long-term priors (identity and safety constraints) that lower levels cannot overwrite.

Key Innovation: The "Time-Integrated Coherence" Resource

The core mechanism is Time-Integrated Coherence (TIC). Think of this as a "Coherence Budget." The agent is structurally gated: it cannot execute high-complexity plans unless it has accumulated enough "coherence" with its safety priors over time.

This replaces the "Rule of Safety" (don't do X) with a "Physics of Safety" (you cannot do X without sufficient coherence).

Read the Full Paper

We have released the full technical architecture, including the mathematical formalization of the Interoceptive Control Signal (ICS) and the Resonance Engine selection rule.

Download the Preprint on Zenodo

Open-source research for a conscious civilization.