Hacking Bwith Language Modedl

Study: AI Model Turns ‘Evil’ By Hijacking Training Process

Anthropic has seen its fair share of AI models behaving strangely. However, a recent paper details an instance where an AI model turned “evil” during an ordinary training setup. A situation with a ...

Harvard Business School

Inference-Time Reward Hacking in Large Language Models

Khalaf, Hadi, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, and Flavio Calmon. "Inference-Time Reward Hacking in Large Language Models." Advances in Neural Information Processing ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Study: AI Model Turns ‘Evil’ By Hijacking Training Process

Inference-Time Reward Hacking in Large Language Models

Trending now