RESEARCH NOTE

LANGUAGE MODELS THAT THINK, CHAT BETTER

October 20, 2025

Reinforcement Learning LLM Alignment Chain-of-Thought RLHF Generalization GRPO

TL;DR

Ever wished your AI assistant could actually think before spitting out an answer? It’s possible. We’ve seen that with reasoning models, and it’s also achievable through prompting. Let’s think of an LLM as having two brains: System 1 (fast, intuitive) and System 2 (slow, deliberate reasoning). We’ve seen AI excel in areas like math and coding (system 1), thanks to reinforcement learning with verifiable rewards, those skills rarely translated to everyday tasks like writing emails or creative fiction (system 2).

But now, researchers at Princeton have unveiled a new technique called Reinforcement Learning with Model-rewarded Thinking, or RLMT for short, that could bridge this gap, enabling language models to “think” and chat in a more human-like manner. It forces the LLM to use that deliberate, step-by-step thinking for all tasks, rewarding it using human preferences (via a reward model) rather than just a pass/fail grade.

Summary

For a long time in AI, they’ve been two main methods to refine Language Models (LLMs) after pre-training:

RL from Human Feedback (RLHF): This optimizes the model to align with human preferences, using a reward model that scores the final response. The model output is treated as a single, complete unit.
RL with Verifiable Rewards (RLVR): This is used for strict domains like mathematics or code, where the answer is objectively right or wrong. Crucially, RLVR models are forced to generate a reasoning trace (Chain-of-Thought, or CoT) before the final answer, ensuring they plan their solution.

The two methods work well, but it turns out that models trained with math-focused RL don’t do so well in open ended tasks.

RLMT solves this generalization problem by effectively creating a hybrid training objective:

Explicit Thinking: Like RLVR, RLMT forces the LLM to generate a long CoT trace ($z$) followed by the response ($y$). This makes the planning explicit (examples are in Figure 2 in the paper).
General Rewards: Unlike RLVR, RLMT uses a preference-based Reward Model ($r$) (like in RLHF) to score the final response based on quality and alignment across diverse, real-world prompts.

The core of RLMT lies in its different approach to training. Unlike traditional RLHF, RLMT encourages language models to first generate an explicit “chain of thought” before producing a final response. This chain of thought is a reasoning trace, providing the LM with a way to plan its answer before answering. This is similar to how we (some humans) think through problems, considering various angles and potential solutions before arriving at a conclusion. To achieve this, RLMT uses online reinforcement learning algorithms, such as GRPO (Group Relative Policy Optimization), optimizing the model against a reward model trained on human preference data. Importantly, this reward model evaluates the entire process, not just the final answer. This encourages the model to develop better reasoning skills.

To put RLMT to the test, it was applied to small models like Llama-3.1-8B and Qwen-2.5-7B, using online RL algorithms like DPO, PPO, and GRPO, to see which worked best. The results showed consistent, substantial gains of 3–7 points across chat benchmarks (AlpacaEval2, WildBench, Arena-HardV2) and 1–3 points on creative writing and general knowledge, surpassing non-thinking RLHF baselines across the board. Even more impressive, the best RLMT-trained Llama-3.1-8B model surpassed GPT-4o in chat and creative writing, rivaling the performance of Claude-3.7-Sonnet. RLMT was effective even when applied directly to base models without any initial supervised fine-tuning (SFT), a technique called “Zero Training”.

In simpler words, a relatively small 8 billion parameter model, when trained with RLMT, could go head-to-head with models boasting over ten times the parameters. This isn’t just an improvement; it’s a gamechanger.

Key Insights

Thinking Generalizes When Rewarded Broadly: The most critical finding is that integrating the Chain-of-Thought structure with a general-purpose, preference-based reward model allows explicit reasoning to successfully transfer beyond verifiable domains (math, code) into open-ended tasks (chat, creative writing).
Small Models Punch Above Their Weight: RLMT enabled small, 8B-parameter models to achieve state-of-the-art chat scores. The Llama-3.1-8B-Instruct-RLMT model surpassed models 10 times larger (Llama-3.1-70B-Instruct) and even outperformed GPT-4o on chat and creative writing benchmarks.
Qualitative Shift in Planning: The training process fundamentally changes how the models think. After RLMT, Llama-3.1’s reasoning shifted from rigid, linear checklist-style outlines to more advanced strategies like constraint enumeration, theme grouping, iterative refinement, and weighing trade-offs—behaviors reflective of good human planning and writing.
GRPO is King for Thinking: Among the tested online RL algorithms (DPO, PPO, GRPO), GRPO consistently delivered the best performance for RLMT.
Data Quality Matters a lot: Studies showed that the choice of the prompt mixture and the strength of the reward model are crucial. Using diverse, conversational prompts (like the WildChat-IF subset) resulted in better generalization than using simpler or unfiltered mixes, confirming that better input data leads to better thinking (as expected).

Chill though, the authors acknowledge that it’s still unclear how much of the improvement comes from amplifying existing traits versus learning entirely new ones. They also point out that a larger, more diverse set of benchmarks could reveal more insights. Also, they did not extensively optimize the format used for the internal CoT, the hyperparameters, or the construction of the prompt mixtures.

Personal takeaway

I like this, I’ll test it on a project soon (hopefully). AI reasoning is definetly getting better everyday (AGI soon?). Giving models the structural requirement to plan, and then rewarding that planning based on human-centric outcomes could unlock new levels of performance in language models. Let’s see where this goes.