#MLP

## MLPs can learn in-context (刚才看到一个帖子一个standford phd提到的,手一滑就不见了) One of the most under-rated empirical results of this year was the fact that MLPs can learn in-context [14]. This is surprising because the attention mechanism is usually thought to be the key for this (induction heads in MHSA, etc). I replicated these findings (the in-context regression task in particular) in small MLPs that had just one hidden layer and as few as 32 hidden units, and found the weight matrices learn a fascinating and structured pattern that matches the nature of the task the authors outline in the paper. It showed an interesting mechanism for how MLPs learned the in-context classification and regression tasks outlined in the paper, that amounted roughly to a very clever memorization pattern of the training data. I think the mech interp community would have a blast figuring this out, and I want to flag this empirical phenomenon for them. On a purely architectural level, MLP-only architectures have the benefit of only using compute-intensive matmuls, which keep GPUs fed. But in practice, work like gMLPs [15] shows that adding attention really is necessary to get maximal performance in the end. How does one square these findings with the fact that MLPs can do simple in-context classification and regression tasks? What exactly is then failing in realistic settings making attention necessary? Or are the learned representations on these synthetic tasks not ones that generalize (like induction heads do) to natural language?