#苦涩的教训

5个月前

最近听好几个AI播客，都提到这篇2019年由里奇·萨顿写的一篇文章“The Bitter Lesson（苦涩的教训）” 。都觉得这篇文章非常经典，影响了后续的AI训练模型。用Opus4.1 做个双语对照版，然后人工微调。从70年的人工智能研究中能够读出的最大教训是，利用计算力的通用方法最终是最有效的，而且优势巨大。 The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. 其根本原因是摩尔定律，或者更准确地说，是计算单位成本持续呈指数级下降这一普遍规律。 The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. 大多数人工智能研究都是在假设智能体可用的计算力是恒定的前提下进行的（在这种情况下，利用人类知识是提升性能的唯一途径之一）。但是，在比典型研究项目稍长的时间内，必然会有大量更多的计算力变得可用。 Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. 为寻求在短期内产生差异的改进，研究人员试图利用他们对领域的人类知识，但从长远来看，唯一重要的是对计算力的利用。 Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. 这两者并不一定相互矛盾，但在实践中它们往往如此。 These two need not run counter to each other, but in practice they tend to. 花在一个方面的时间就是没有花在另一个方面的时间。 Time spent on one is time not spent on the other. 对某一种方法的投入会产生心理上的承诺。 There are psychological commitments to investment in one approach or the other. 而且基于人类知识的方法往往会使方法变得复杂，使其不太适合利用计算力的通用方法。 And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. 有许多人工智能研究人员迟来地学到这个苦涩教训的例子，回顾其中一些最突出的例子是很有启发性的。 There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent. 在计算机国际象棋中，1997年击败世界冠军卡斯帕罗夫的方法是基于大规模的深度搜索。 In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. 当时，大多数计算机国际象棋研究人员对此感到沮丧，他们一直在追求利用人类对国际象棋特殊结构理解的方法。 At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. 当一个更简单的、基于搜索的方法配合专门的硬件和软件被证明更加有效时，这些基于人类知识的国际象棋研究人员并不是优雅的失败者。 When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. 他们说"暴力"搜索这次可能赢了，但这不是一个通用策略，而且这也不是人类下棋的方式。 They said that "brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. 这些研究人员希望基于人类输入的方法能够获胜，当它们没有获胜时，他们感到失望。 These researchers wanted methods based on human input to win and were disappointed when they did not. 在计算机围棋中也出现了类似的研究进展模式，只是延迟了20年。 A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. 最初的巨大努力都投入到通过利用人类知识或游戏的特殊特征来避免搜索。但一旦搜索在规模上得到有效应用，所有这些努力都被证明是无关紧要的，甚至更糟。 Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. 同样重要的是使用自我对弈学习来学习价值函数（在许多其他游戏甚至国际象棋中也是如此，尽管学习在1997年首次击败世界冠军的程序中并没有发挥重要作用）。 Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). 自我对弈学习，以及一般的学习，就像搜索一样，它使得大规模计算力得以发挥作用。 Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. 搜索和学习是人工智能研究中利用大量计算力的两类最重要的技术。 Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. 在计算机围棋中，就像在计算机国际象棋中一样，研究人员最初的努力是利用人类的理解（这样就需要更少的搜索），只有在很久以后，通过拥抱搜索和学习才取得了更大的成功。 In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning. 在语音识别中，20世纪70年代有一场由DARPA赞助的早期竞赛。 In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. 参赛者包括许多利用人类知识的特殊方法——关于单词、音素、人类声道等的知识。 Entrants included a host of special methods that took advantage of human knowledge---knowledge of words, of phonemes, of the human vocal tract, etc. 另一方面是更具统计性质的新方法，它们基于隐马尔可夫模型（HMMs）进行更多的计算。 On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). 再一次，统计方法战胜了基于人类知识的方法。 Again, the statistical methods won out over the human-knowledge-based methods. 这导致了整个自然语言处理领域的重大变化，在几十年的时间里逐渐地，统计和计算开始主导这个领域。 This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. 深度学习在语音识别中的最近兴起是这个一致方向上的最新一步。 The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. 深度学习方法更少依赖人类知识，使用更多的计算，结合在庞大训练集上的学习，产生了显著更好的语音识别系统。 Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. 就像在游戏中一样，研究人员总是试图制造按照他们认为自己思维方式工作的系统——他们试图将那些知识放入他们的系统中——但这最终被证明是适得其反的。当通过摩尔定律，大规模计算变得可用并找到了充分利用它的方法时，这是研究人员时间的巨大浪费。 As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use. 在计算机视觉中，也有类似的模式。 In computer vision, there has been a similar pattern. 早期方法将视觉理解为搜索边缘、广义圆柱体，或者用SIFT特征来理解。 Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. 但今天所有这些都被抛弃了。 But today all this is discarded. 现代深度学习神经网络只使用卷积和某些不变性的概念，表现要好得多。 Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better. 这是一个重要的教训。 This is a big lesson. 作为一个领域，我们仍然没有彻底学会它，因为我们还在继续犯同样的错误。 As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. 要看到这一点，并有效地抵制它，我们必须理解这些错误的吸引力。 To see this, and to effectively resist it, we have to understand the appeal of these mistakes. 我们必须学会这个苦涩的教训：将我们认为自己如何思考的方式内置进去，从长远来看是行不通的。 We have to learn the bitter lesson that building in how we think we think does not work in the long run. 苦涩的教训基于历史观察： 1）人工智能研究人员经常试图将知识构建到他们的智能体中。 2）这在短期内总是有帮助的，并且对研究人员个人来说是令人满意的。 3）从长远来看，它会达到瓶颈，甚至阻碍进一步的进展， 4）突破性进展最终通过基于搜索和学习扩展计算的相反方法到来。 The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. 最终的成功带有苦涩，而且往往没有完全消化，因为这是对受青睐的、以人为中心的方法的胜利。 The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach. 从苦涩的教训中应该学到的一件事是通用方法的巨大力量，这些方法即使在可用计算变得非常庞大时，仍能随着计算的增加而继续扩展。 One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. 似乎能以这种方式任意扩展的两种方法是搜索和学习。 The two methods that seem to scale arbitrarily in this way are search and learning. 从苦涩的教训中要学到的第二个要点是，思维的实际内容是极其、无可救药地复杂的；我们应该停止试图找到思考思维内容的简单方法，比如思考空间、对象、多个智能体或对称性的简单方法。 The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. 所有这些都是任意的、内在复杂的外部世界的一部分。 All these are part of the arbitrary, intrinsically-complex, outside world. 它们不应该被内置，因为它们的复杂性是无穷无尽的；相反，我们应该只内置能够发现和捕获这种任意复杂性的元方法。 They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. 这些方法的关键是它们能够找到好的近似，但对它们的搜索应该由我们的方法来完成，而不是由我们来完成。 Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. 我们想要的是能够像我们一样发现的人工智能智能体，而不是包含我们已经发现的东西的智能体。 We want AI agents that can discover like we can, not which contain what we have discovered. 内置我们的发现只会让我们更难看清发现过程是如何完成的。 Building in our discoveries only makes it harder to see how the discovering process can be done. ---- AI总结提炼 1. 主要教训：过去70年的AI研究表明，利用大规模计算力的通用方法（搜索和学习）远比嵌入人类知识的方法更有效。 2. 根本原因：摩尔定律使计算成本持续下降，计算力呈指数级增长。 3. 历史案例：无论是国际象棋、围棋、语音识别还是计算机视觉，最初基于人类专家知识的方法都输给了后来基于大规模计算的简单方法。 4. 研究者的误区：研究人员总想把自己的思维方式编程到AI中，这在短期有效且令人满意，但长期会成为瓶颈。 5. "苦涩"的原因：这个教训之所以"苦涩"，是因为研究者不愿接受自己精心设计的、体现人类智慧的方法输给了"暴力"计算。 6. 正确方向：应该构建能够自主发现和学习的AI系统，而不是把人类已有的知识硬编码进去。 7. 核心启示：不要试图告诉AI如何思考，而要给它提供学习和搜索的能力，让它自己去发现规律。

#AI #苦涩的教训 #计算力 #通用方法 #深度学习