General Intelligence Requires Reward-Based Pretraining

❗ The Problem: LLMs Struggle to Adapt

We evaluated modern LLMs on algorithmic tasks in esoteric programming languages – Brainf**k and Befunge – where the syntax is rarely seen during pretraining. Even with syntax rules and in-context examples provided, models failed to generalize.

Brainf**k Results

Befunge Results

Despite providing full syntax and demonstrations, most models performed poorly. For instance, accuracy on Brainf**k sorting and copying tasks remained near 0%. Notably, the o1 model, trained with RL and search, showed stronger performance, signaling the value of learning to reason from scratch.

📌 Our Position

LLMs today are strong tools of artificial useful intelligence (AUI), but they fall short of artificial general intelligence (AGI) due to poor reasoning generalization.

We identify the core issue: Reasoning and knowledge are tightly coupled in today's LLMs.

✅ Our Proposal:

To move from AUI to AGI, we propose disentangling reasoning from knowledge via:

Reward-based pretraining (RPT) – learning reasoning from scratch using RL.
Synthetic curricula – simplified tasks to bootstrap general reasoning skills.
Small-context reasoning – a memory-reasoning separation that promotes transferability.

🔁 Direction 1: Pretraining with Reinforcement Learning (RPT)

Supervised pretraining (SPT) can trap models in local optima. Inspired by the shift from AlphaGo → AlphaZero, we argue that:

"Reasoning should be learned from scratch using RL, not imitation."

Screenshot 2025-05-29 at 12.32.09 AM.png

Screenshot 2025-05-29 at 12.31.04 AM.png

In both Go and synthetic LLM tasks (e.g., vector orthogonality), RPT outperforms SPT-then-RFT in generalization. Pure RL agents learn step-by-step reasoning, while SFT-based models overfit.