What is Reward Model?

Reinforcement Learning

Reward Model

A reward model is a trained model that predicts human preferences between different AI outputs, providing a scalar reward signal. Reward models are central to RLHF and are used to align language models with human values.

Understanding Reward Model

A reward model is a machine learning model trained to predict human preferences, serving as an automated proxy for human evaluation in the training of AI systems through reinforcement learning from human feedback. During RLHF, human annotators rank or compare multiple model outputs for the same prompt, and the reward model learns to assign scalar scores that reflect these human judgments. The language model is then optimized to produce outputs that maximize the reward model's scores, effectively learning to generate responses that humans would prefer. Reward models are critical because they scale the alignment process beyond what direct human evaluation could achieve, enabling optimization over millions of training examples. Challenges include reward hacking, where the language model exploits weaknesses in the reward model to achieve high scores without genuinely improving quality, and distributional shift as the policy model evolves during training. Research in AI safety continues to develop more robust reward modeling approaches to better align AI systems with human values.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Reward Shaping

Back to full glossary

Reward Model

Understanding Reward Model

Is AI recommending your brand?

Related Reinforcement Learning Terms

Deep Reinforcement Learning

Exploration vs Exploitation

Imitation Learning

Inverse Reinforcement Learning

Markov Decision Process

Minimax

Policy

Q-Learning