马东锡 NLP 🇸🇪2025-05-28 05:30:57「RLVR, Reasoning」 Spurious Rewards: Rethinking Training Signals in RLVR 当随意的奖励信号仍可以大幅提升模型性能,就得重新思考:到底是RL在学习,还是在放大某种“先验”行为。 "RLVR must somehow be surfacing useful reasoning representations learned d#RLVR#SpuriousRewards#DeepLearning