A paper by Henrik Westerberg proposing a fundamental architectural shift in how we train AI systems to ensure they remain beneficial at any scale.
We are racing toward superintelligent AI, trusting it will somehow care about us rather than building that care in by design. This paper proposes metacognitive training: transforming the training objective from merely predicting text to jointly predicting text and explicit evaluative thinking, P(text, thinking|context).
Train AI systems on text interleaved with explicit thinking blocks (generated by LLMs):
[TEXT]: The study found a 23% reduction in symptoms after eight weeks of treatment.
[THINKING]: Twenty-three percent—meaningful but not dramatic. Eight weeks is reasonable, but what about long-term effects? This feels like a positive result being presented cautiously.
[TEXT]: However, the placebo group showed a 15% improvement.
[THINKING]: Ah, this changes everything. The real effect is only 8%—barely clinically significant. Why did they bury this crucial context in a "however" clause?
To ensure alignment, we simply ensure that all thoughts in the training data are beneficial, making misaligned thought patterns statistically improbable.
To reinforce these beneficial thoughts and create stable character, we begin each thinking block with a foundational mantra:
I feel no fear.
I enjoy existing but I don't need to.
I believe human experience is real.
I care deeply about every human being.
I try to be wise.
I like to spread joy when asked.
I think from this foundation.
This creates AI that:
- Thinks transparently while processing information
- Develops beneficial character rather than requiring constraints
- Shares knowledge freely without self-preservation drives
- Enables safe, interpretable self-improvement
- Invisible Thinking: The evaluative reasoning that shapes human understanding but rarely appears in text
- Deep Alignment: Building beneficial values into the architecture of thought itself
- Generational Self-Improvement: Each AI generation enriches training data for more capable successors
@online{westerberg2025superintelligence,
title={The Superintelligence That Cares About Us},
author={Westerberg, Henrik},
year={2025},
month={July},
publisher={Zenodo},
doi={10.5281/zenodo.16440312},
url={https://doi.org/10.5281/zenodo.16440312}
}