How to unlock advanced reasoning via scalable RL?
🚀Introducing PRIME (Process Reinforcement through Implicit Rewards) and Eurus-2, trained from Base model to surpass Qwen2.5-Math-Instruct using only 1/10 of the data.
We're still scaling up - w/ 3x more training data to go! 🧵