马东锡 NLP 🇸🇪2025-04-04 17:33:30「LLM x RL」DeepSeek 最新论文:Inference-Time Scaling for Generalist Reward Modeling 在 RL 中,Reward Modeling(RM)是一个非常重要的部分。RM 主要用于对 LLM 的生成结果进行打分,从而调整 LLM 的 policy,使其更符合 RM 设定的要求,比如更强的 reasoning 能力。 针对特定任务(
雁过留声2025-02-01 17:35:59He hit the nail on the head: DeepSeek has shaken the foundation of American capitalism.他说到关键:DEEPSEEK 动摇了美帝资本主义根基。
Herrington Darkholme2025-01-26 11:54:03rule based reward model also means their training target would be limited to domains with ground truth. It is interesting how they can extend to questions with ambiguous, but comparable, answers
Yann LeCun2025-01-25 08:07:07To people who think "China is surpassing the US in AI" the correct thought is "Open source models are surpassing closed ones" See ⬇️⬇️⬇️