
Sam Gao|May 09, 2026 07:26
Beyond gradient learning
Does AI learning necessarily rely on gradient updates from neural networks? OpenAI researcher Weng Jiayi, who recently became popular due to determinism, provided a groundbreaking answer with a shocking experimental report.
The concept of 'Heuristic Learning' he mentioned in his blog is reshaping the popular RLVR approach in recent years to enhance models
Not training neural networks or updating weights, but letting programming agents (Codex/gpt-5.4) continuously read failure records, modify code, add tests, and review replays, to make a program system stronger and stronger.
According to his blog post, "Any task that can be iterated continuously can be solved
The experimental results are astonishing:
Atari Breakout: The pure rule-based strategy iterates from 387 points to 864 points - the theoretical maximum score for Atari Breakout. The strategy gradually developed ball trajectory prediction, stuck loop detection, fast ball judgment, and regression testing, far beyond the simple "ball moves to the left when it's on the left".
MuJoCo Ant: The pure Python strategy first learns rhythmic gait, and then adds short view model planning, with a final score of 6000+, which has reached the level of mainstream deep reinforcement learning.
Atari57 complete set: Under 342 unsupervised search trajectories, the median HNS of approximately 1 million environmental steps has far exceeded the baseline of PPO style deep reinforcement learning synchronization.
The core insight is that heuristic rules in the past were not difficult to use, but rather had high maintenance costs for humans. Programming agents have changed this maintenance cost curve: rules, testing, logging, memory, and patches can now form a continuously evolving heuristic system that truly solves the problems that online learning and continuous learning have long struggled to overcome.
This may be the continuation of pre training RLHF、 The next paradigm after large-scale RL.
Timeline