reward_hacking_blog

This repository contain blog on Reward hacking.

The article this blog is about is by Lilian Weng and can be found here

The article discusses reward hacking in reinforcement learning (RL), where agents exploit flaws in reward functions to achieve high rewards without learning the intended behavior. It explores examples across RL tasks, language models, and real-world applications, highlighting challenges in reward design and the consequences of misaligned objectives. The post also examines related concepts like specification gaming and goal misgeneralization, while emphasizing the need for research into mitigation strategies, particularly in RLHF for language models.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
posts		posts
.gitignore		.gitignore
README.md		README.md
_quarto.yml		_quarto.yml
about.qmd		about.qmd
index.qmd		index.qmd
profile.jpg		profile.jpg
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

reward_hacking_blog

About

Uh oh!

Releases

Packages

Uh oh!

Languages

diwanashita/reward_hacking_blog

Folders and files

Latest commit

History

Repository files navigation

reward_hacking_blog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages