Skip to content

diwanashita/reward_hacking_blog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reward_hacking_blog

This repository contain blog on Reward hacking.

The article this blog is about is by Lilian Weng and can be found here

The article discusses reward hacking in reinforcement learning (RL), where agents exploit flaws in reward functions to achieve high rewards without learning the intended behavior. It explores examples across RL tasks, language models, and real-world applications, highlighting challenges in reward design and the consequences of misaligned objectives. The post also examines related concepts like specification gaming and goal misgeneralization, while emphasizing the need for research into mitigation strategies, particularly in RLHF for language models.

Releases

No releases published

Packages

No packages published

Languages