#MLP

𝙩𝙮≃𝙛{𝕩}^A𝕀²·ℙarad𝕚g𝕞

2周前

## MLPs can learn in-context （刚才看到一个帖子一个standford phd提到的，手一滑就不见了） One of the most under-rated empirical results of this year was the fact that MLPs can learn in-context [14]. This is surprising because the attention mechanism is usually thought to be the key for this (induction heads in MHSA, etc). I replicated these findings (the in-context regression task in particular) in small MLPs that had just one hidden layer and as few as 32 hidden units, and found the weight matrices learn a fascinating and structured pattern that matches the nature of the task the authors outline in the paper. It showed an interesting mechanism for how MLPs learned the in-context classification and regression tasks outlined in the paper, that amounted roughly to a very clever memorization pattern of the training data. I think the mech interp community would have a blast figuring this out, and I want to flag this empirical phenomenon for them. On a purely architectural level, MLP-only architectures have the benefit of only using compute-intensive matmuls, which keep GPUs fed. But in practice, work like gMLPs [15] shows that adding attention really is necessary to get maximal performance in the end. How does one square these findings with the fact that MLPs can do simple in-context classification and regression tasks? What exactly is then failing in realistic settings making attention necessary? Or are the learned representations on these synthetic tasks not ones that generalize (like induction heads do) to natural language?

#MLP #in-context learning #memorization #attention mechanism #generalization

𝙩𝙮≃𝙛{𝕩}^A𝕀²·ℙarad𝕚g𝕞

3个月前

阶跃星辰stepfun可以的，这个step3的部署架构把attention和MLP分开，效率更高！ LLM的下一个突破是啥？attention+MLP+？

Google Gemini 2.5发布引发AI模型性价比热议· 282 条信息

OpenAI新德里发布会：ChatGPT语音翻译功能引发热议· 657 条信息

#阶跃星辰 #step3部署架构 #attention #MLP #LLM