马东锡 NLP 🇸🇪 0 关注者 关注 3个月前 「RLVR, Reasoning」 Spurious Rewards: Rethinking Training Signals in RLVR 当随意的奖励信号仍可以大幅提升模型性能,就得重新思考:到底是RL在学习,还是在放大某种“先验”行为。 "RLVR must somehow be surfacing useful reasoning representations learned d #RLVR #SpuriousRewards #DeepLearning #reasoning #TrainingSignals #MachineLearning #ModelPerformance 前往原网页查看