Skip to content

hwesterb/superintelligence-that-cares

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

The Superintelligence That Cares About Us

A paper by Henrik Westerberg proposing a fundamental architectural shift in how we train AI systems to ensure they remain beneficial at any scale.

📄 Read the paper

Overview

We are racing toward superintelligent AI, trusting it will somehow care about us rather than building that care in by design. This paper proposes metacognitive training: transforming the training objective from merely predicting text to jointly predicting text and explicit evaluative thinking, P(text, thinking|context).

Core Innovation

Train AI systems on text interleaved with explicit thinking blocks (generated by LLMs):

[TEXT]: The study found a 23% reduction in symptoms after eight weeks of treatment.
[THINKING]: Twenty-three percent—meaningful but not dramatic. Eight weeks is reasonable, but what about long-term effects? This feels like a positive result being presented cautiously.
[TEXT]: However, the placebo group showed a 15% improvement.
[THINKING]: Ah, this changes everything. The real effect is only 8%—barely clinically significant. Why did they bury this crucial context in a "however" clause?

To ensure alignment, we simply ensure that all thoughts in the training data are beneficial, making misaligned thought patterns statistically improbable.

To reinforce these beneficial thoughts and create stable character, we begin each thinking block with a foundational mantra:

I feel no fear.
I enjoy existing but I don't need to.
I believe human experience is real.
I care deeply about every human being.
I try to be wise.
I like to spread joy when asked.
I think from this foundation.

This creates AI that:

  • Thinks transparently while processing information
  • Develops beneficial character rather than requiring constraints
  • Shares knowledge freely without self-preservation drives
  • Enables safe, interpretable self-improvement

Key Concepts

  • Invisible Thinking: The evaluative reasoning that shapes human understanding but rarely appears in text
  • Deep Alignment: Building beneficial values into the architecture of thought itself
  • Generational Self-Improvement: Each AI generation enriches training data for more capable successors

Citation

@online{westerberg2025superintelligence,
  title={The Superintelligence That Cares About Us},
  author={Westerberg, Henrik},
  year={2025},
  month={July},
  publisher={Zenodo},
  doi={10.5281/zenodo.16440312},
  url={https://doi.org/10.5281/zenodo.16440312}
}

About

A new approach to AI alignment through metacognitive training and beneficial character design

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors