We evaluated modern LLMs on algorithmic tasks in esoteric programming languages – Brainf**k and Befunge – where the syntax is rarely seen during pretraining. Even with syntax rules and in-context examples provided, models failed to generalize.
Brainf**k Results
Befunge Results
Despite providing full syntax and demonstrations, most models performed poorly. For instance, accuracy on Brainf**k sorting and copying tasks remained near 0%. Notably, the o1 model, trained with RL and search, showed stronger performance, signaling the value of learning to reason from scratch.
LLMs today are strong tools of artificial useful intelligence (AUI), but they fall short of artificial general intelligence (AGI) due to poor reasoning generalization.
We identify the core issue: Reasoning and knowledge are tightly coupled in today's LLMs.
To move from AUI to AGI, we propose disentangling reasoning from knowledge via:
Supervised pretraining (SPT) can trap models in local optima. Inspired by the shift from AlphaGo → AlphaZero, we argue that:
"Reasoning should be learned from scratch using RL, not imitation."
In both Go and synthetic LLM tasks (e.g., vector orthogonality), RPT outperforms SPT-then-RFT in generalization. Pure RL agents learn step-by-step reasoning, while SFT-based models overfit.