Small AI Outperforms Giants Using Preferences

A team of 40 contractors changed the trajectory of artificial intelligence. In 2022, OpenAI hired these labelers to review AI responses and indicate which ones they preferred. The result was InstructGPT, a model with 1.3 billion parameters that people favored over its 175-billion-parameter predecessor 85% of the time. A model 100 times smaller had become more useful simply by learning what humans actually wanted.

This gap between raw capability and practical usefulness reveals something essential about modern AI. The massive language models underlying ChatGPT and similar systems already contain most of their knowledge after initial training on internet text. What they lack is direction—an understanding of which of their many possible responses humans find helpful, honest, and harmless.

The Three-Phase Journey

Language models don't learn to align with human values all at once. They develop through distinct stages, each building on the last.

Pretraining comes first. Models consume trillions of tokens of text—GPT-3 processed 500 billion, while newer models like LLaMa digest 1.4 trillion tokens, equivalent to roughly 15 million books. This phase consumes 98% of the computational resources and gives models their fundamental abilities: grammar, facts, reasoning patterns, cultural knowledge. But pretraining optimizes for a simple goal: predict the next word. This creates models that can complete any text pattern they've seen, whether that's a helpful answer, a conspiracy theory, or instructions for causing harm.

Supervised fine-tuning follows. Human labelers write example responses to thousands of prompts, demonstrating the kind of helpful, detailed answers they want to see. Models learn to mimic this style. Yet supervised learning has limits. When you train a model by showing it good examples, it learns surface patterns but misses nuance. It might learn to sound confident without learning when confidence is appropriate. The model can't distinguish between responses that seem good and responses that are good.

Reinforcement learning from human feedback—RLHF—solves this by teaching models to optimize for human preferences directly. Instead of showing the model what to write, labelers compare different responses the model generates and indicate which they prefer. The model learns not just to imitate, but to maximize the probability that humans will choose its outputs.

The Comparison Advantage

Asking humans to rank outputs rather than create them changes everything. Writing a perfect response to "Explain quantum entanglement" requires expertise and effort. Choosing between two explanations requires only judgment. This scales.

The process works through iteration. The model generates multiple responses to each prompt—typically five variations with randomness introduced through parameters like temperature settings. Labelers rank these options. A separate reward model learns to predict which responses humans will prefer by studying thousands of these comparisons. Then the language model itself gets updated to generate responses that score higher according to the reward model.

This creates a feedback loop. The language model improves. The reward model trains on new comparisons from the improved model. The cycle continues. Modern systems run three or four rounds of this process, each time adapting to the model's evolving output distribution.

The efficiency gains are substantial. OpenAI's 40 labelers provided enough signal to transform GPT-3 into InstructGPT. Compare this to the impossibility of having humans write training examples for every possible query a model might encounter.

The Noise Problem

Human preference data contains more ambiguity than you might expect. When multiple labelers evaluate the same model outputs, they agree only 60-70% of the time. One person's helpful explanation is another's condescending oversimplification. One labeler values conciseness; another wants thoroughness.

This disagreement isn't just noise to filter out—it reflects genuine complexity in human values. There often isn't a single "correct" response to open-ended questions. The model must learn to navigate this uncertainty, finding responses that satisfy most people most of the time while avoiding outputs that strongly violate anyone's preferences.

Reward models struggle particularly with examples unlike those in their training data. A model trained on comparisons of helpful explanations might fail to properly evaluate creative writing or code. Researchers address this through careful dataset curation and techniques like contrastive learning, which helps models distinguish subtle differences between good and bad responses.

The KL divergence constraint provides another safeguard. During training, the system penalizes the model for straying too far from its original behavior. This prevents a problem called reward hacking, where models find ways to score high on the reward model without actually being useful—like a student who learns to game a test rather than master the subject.

Beyond Human Labelers

The success of RLHF has prompted questions about its scalability. High-quality human feedback is expensive and slow. We may exhaust easily accessible internet text for pretraining within years. What happens when human labeling becomes the bottleneck?

One answer: use AI to provide the feedback. Reinforcement Learning from AI Feedback (RLAIF) employs language models themselves as evaluators, guided by constitutional principles rather than human comparisons. Anthropic's Constitutional AI trains models using rules like "choose the response that is most helpful and harmless" without human ranking of every output.

Direct Policy Optimization (DPO) takes a different approach, bypassing reward models entirely. Instead of training a separate system to predict preferences and then optimizing against it, DPO adjusts the language model's parameters directly based on preference data. This simplifies the training pipeline and reduces computational costs.

These methods don't replace human feedback so much as amplify it. AI evaluators still learn from human-labeled examples; they just extrapolate those preferences across a much larger space of possible outputs. The question is whether this extrapolation preserves what humans actually value or introduces subtle distortions.

What Alignment Actually Achieves

The improvements from RLHF extend beyond subjective preference. InstructGPT showed measurable gains on TruthfulQA, a benchmark designed to catch models making confident false statements. Toxic output rates dropped. Models became better at refusing harmful requests while remaining useful for legitimate purposes.

Yet these models still hallucinate facts, still exhibit biases, still occasionally produce harmful content. RLHF doesn't solve alignment—it makes models more aligned. The technique works by steering existing capabilities, not by adding new ones. A model that doesn't understand a topic won't give better answers after RLHF; it will just more reliably admit uncertainty or decline to answer.

This reveals both the power and limitation of learning from human feedback. The approach succeeds because pretrained models already contain vast knowledge and capabilities. Alignment techniques unlock these abilities and direct them toward human goals. But they can't transcend the model's fundamental limitations or resolve deep disagreements about values.