马东锡 NLP 🇸🇪2025-04-04 17:33:30「LLM x RL」DeepSeek 最新论文:Inference-Time Scaling for Generalist Reward Modeling 在 RL 中,Reward Modeling(RM)是一个非常重要的部分。RM 主要用于对 LLM 的生成结果进行打分,从而调整 LLM 的 policy,使其更符合 RM 设定的要求,比如更强的 reasoning 能力。 针对特定任务(#LLM#RL#RewardModeling
Herrington Darkholme2025-01-26 11:54:03rule based reward model also means their training target would be limited to domains with ground truth. It is interesting how they can extend to questions with ambiguous, but comparable, answers#RuleBasedAI#RewardModel#MachineLearning