rule based reward model also means their training target would be limited to domains with ground truth. It is interesting how they can extend to questions with ambiguous, but comparable, answers

#RuleBasedAI #RewardModel #MachineLearning #ambiguity #GroundTruth

相关新闻

马东锡 NLP 🇸🇪

5个月前

「RLVR, Reasoning」 Spurious Rewards: Rethinking Training Signals in RLVR 当随意的奖励信号仍可以大幅提升模型性能，就得重新思考：到底是RL在学习，还是在放大某种“先验”行为。 "RLVR must somehow be surfacing useful reasoning representations learned d

马东锡 NLP 🇸🇪

7个月前

「LLM x RL」DeepSeek 最新论文：Inference-Time Scaling for Generalist Reward Modeling 在 RL 中，Reward Modeling（RM）是一个非常重要的部分。RM 主要用于对 LLM 的生成结果进行打分，从而调整 LLM 的 policy，使其更符合 RM 设定的要求，比如更强的 reasoning 能力。针对特定任务（

NO CONTEXT HUMANS

10个月前

I’m not saying you should, but I’m also not saying you shouldn’t

NO CONTEXT HUMANS

10个月前

Me too machine, me too.

NO CONTEXT HUMANS

10个月前

AI is wild