超新星 发表于 2024-11-23 04:52:49

深度解析:最强AI引擎AlphaZero是怎样学习国际象棋的?

本帖最后由 超新星 于 2024-11-23 04:54 编辑

Alphazero是到底如何学习国际象棋的呢?它到底是如何做出走某一步棋的决定的?它是如何看待‘王的安全’或者‘子力协调性’这种概念?它到底怎样学习开局,它对开局的理解和人类现有的开局理论又有什么样的区别?本篇文章(源自Chess.com的文章和DeepMind团队最新论文)带你看看最强AI引擎AlphaZero是怎样学习国际象棋的。
AlphaZero是怎样学习国际象棋的?



在某种程度上来讲,AlphaZero的学习过程和人类的学习过程是比较类似的。根据DeepMind团队最新发表的论文来看(其中包括第十四任世界冠军克拉姆尼克发表的观点),尽管AlphaZero从未学习任何人类的对局,但还是在其神经网络里发现了许多人类可以解读的思路和概念。

How does AlphaZero learn chess? Why does it make certain moves? What values does it give to concepts such as king safety or mobility? How does it learn openings, and how is that different from how humans developed opening theory
那么,Alphazero是到底如何学习国际象棋的呢?它到底是如何做出走某一步棋的决定的?它是如何看待‘王的安全’或者‘子力协调性’这种概念?它到底怎样学习开局,它对开局的理解和人类现有的开局理论又有什么样的区别?

Questions like these are being discussed in a fascinating new paper by DeepMind, titled Acquisition of Chess Knowledge in AlphaZero. It was written by Thomas McGrath, Andrei Kapishnikov, Nenad Tomasev, Adam Pearce, Demis Hassabis, Been Kim, and Ulrich Paquet together with Kramnik. It is the second cooperation between DeepMind and Kramnik, after their research from last year when they used AlphaZero to explore the design of different variants of the game of chess, with different sets of rules.
以上问题都在DeepMind团队最新的论文中得以讨论,论文题目是《解读AlphaZero的国际象棋理论》。该论文由Thomas McGrath, Andrei Kapishnikov, Nenad Tomasev, Adam Pearce, Demis Hassabis, Been Kim, Ulrich Paquet 以及克拉姆尼克共同撰写。这也是DeepMind团队第二次和克拉姆尼克合作,去年他们共同研究了如何利用AlphaZero去创立不同的国际象棋变种以及相关走子规则。



编码“人类现有的概念知识”

In their latest paper, the researchers tried a method for encoding human conceptual knowledge, to determine the extent to which the AlphaZero network represents human chess concepts. Examples of such concepts are the bishop pair, material (im)balance, mobility, or king safety. These concepts have in common that they are pre-specified functions that encapsulate a particular piece of domain-specific knowledge.
在他们最新的论文里,研究人员尝试使用一种方法将人类现有的国际象棋知识和概念编码化,以确定Alphazero神经网络里在多大程度上可以代表人类的国际象棋思路和概念,比如双象优势,子力不对等,协调性或者王的安全等等概念。每一种特定概念都被设计成预先指定的函数,对其进行了封装。

Some of these concepts were taken from Stockfish 8's evaluation function, such as material, imbalance, mobility, king safety, threats, passed pawns, and space. Stockfish 8 uses these as sub-functions that give individual scores leading to a "total" evaluation that is exported as a continuous value, such as "0.25" (a slight advantage to White) or "-1.48" (a big advantage to Black). Note that more recent versions of Stockfish have developed into Alpha-Zero-like neural networks but were not used for this paper.
其中一些概念则取自于Stockfish8里的评估功能,比如子力,不对等性,协调性,王的安全,威胁,通路兵,以及空间。Stokfish给每一种概念设计成一个函数,每一个函数返回的数值相加则形成了‘最终’评估分数,比如“0.25”(白方稍优)或者“-1.48”(黑方大优)。值得一提的更新版本的Stockfish已经开始采用了AlphaZero类型的神经网络,但在本篇论文里没有用到。

The third type of concepts encapsulates more specific lower-level features, such as the existence of forks, pins, or contested files, as well as a range of features regarding pawn structure.
第三种类型的概念则是将更多的底层特征进行封装,比如找出捉双,牵制,线路,以及一系列有关兵型的概念。

Having established this wide array of human concepts, the next step for the researchers was to try and find them within the AlphaZero network, for which they used a sparse linear regression model. After that, they started visualizing the human concept learning with what they call what-when-where plots: what concept is learned when in training time where in the network.
在建立了广泛的人类知识图谱和模型后,研究人员下一步的工作则是在AlphaZero的神经网络里采用稀疏化线性模型来寻找人类模型的痕迹,然后再将整个学习过程进行可视化,可视化的展现形式为 what-when-where 图:即在神经网络的哪个地方,什么时间学习了哪种概念。

According to the researchers, AlphaZero indeed develops representations that are closely related to a number of human concepts over the course of training, including high-level evaluation of the position, potential moves and consequences, and specific positional features.
根据研究人员发现,AlphaZero的某些学习特征,的确与人类训练时学习的概念产生重合,包括对局面的抽象评估,潜在招法及其后果,以及特定的局面性特征。

One interesting result was about material imbalance. As was demonstrated in Matthew Sadler and Natasha Regan's award-winning book Game Changer: AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI (New In Chess, 2019), AlphaZero seems to view material imbalance differently from Stockfish 8. The paper gives empirical evidence that this is the case at the representational level: AlphaZero initially "follows" Stockfish 8's evaluation of material more and more during its training, but at some point, it turns away from it again.
其中在‘子力不对等’这个概念中发现了一个有意思的现象。就像Matthew Sadler 和 Natasha Regan在其获奖著作中(Game Changer: AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI)说明的那样,AlphaZero在对待‘子力不对等’这个问题上似乎与Stockfish8不同,本篇论文在实验基础上证实了这个过程:起先,Alphazero在训练过程中,随着时间推移,越来越‘赞同’Stockfish对待子力的观点,但是就在某个节点,慢慢又出现了相反的观点。



棋子价值和子力

The next step for the researchers was to relate the human concepts to AlphaZero's value function. One of the first concepts they looked at was piece value, something a beginner will first learn when starting to play chess. The classical values are nine for a queen, five for a rook, three for both the bishop and knight, and one for a pawn. The left figure below (taken from the paper) shows the evolution of piece weights during AlphaZero's training, with piece values converging towards commonly-accepted values.
研究人员下一步的工作是比较人类与Alphazero对待棋子价值的区别。就像初学者学棋一样,研究人员首先研究的是每个棋子自身的价值,传统上每个棋子的分值分别是后9分,车5分,象或马3分,兵1分。下面左图(取自论文)显示了AlphaZero学习过程中每个棋子价值的演变,最后得出的分数也与人类的看法基本一致。


左图:AlphaZero在训练过程中对每个子力价值的估计

The image on the right shows that during AlphaZero's training, material becomes more and more important in the early stages of learning chess (consistent to human learning) but it reaches a plateau and at some point, the values of more subtle concepts such as mobility and king safety are becoming more important while material actually decreases in importance.
右图说明了AlphaZero在初期训练中,认为子力是最重要的(与人类的认知一致),但是当到了某一个节点,个别微妙的概念,如协调性和王的安全的重要性开始越来越高,而子力的重要性相对之前则有所减少。



AlphaZero的训练过程 Vs. 近代人类对与国际象棋的认知过程

Another part of the paper is dedicated to comparing AlphaZero's training to the progression of human knowledge over history. The researchers point out that there is a marked difference between AlphaZero’s progression of move preferences through its history of training steps, and what is known of the progression of human understanding of chess since the 15th century:
该论文的另一部分着重比较了AlphaZero的训练历程与近代人类对于国际象棋的认知过程。研究人员们指出AlphaZero在整个训练阶段选择招法的过程和人类自从15世纪以来人类对于国际象棋理解的过程存在着显著差异。

AlphaZero starts with a uniform opening book, allowing it to explore all options equally, and largely narrows down plausible options over time. Recorded human games over the last five centuries point to an opposite pattern: an initial overwhelming preference for 1.e4, with an expansion of plausible options over time.
AlphaZero在第一步棋时,平等看待每步可走的棋,随着时间推移再筛选出最合理的走法。而人类过去五个世纪的对局记录则是相反:最开始基本人人都走e4,随着时间推移,开始越来越多的采用其他走法。

The researchers compare the games AlphaZero is playing against itself with a large sample taken from the ChessBase Mega Database, starting with games from the year 1475 up till the 21st century.
研究人员们将AlphaZero自身产生的对局与Chessbase Mega Database里的对局进行大量比对,选择的人类对局时间范围为1475年-21世纪。

Humans initially played 1.e4 almost exclusively but 1.d4 was slightly more popular in the early 20th century, soon followed by the increasing popularity of more flexible systems like 1.c4 and 1.Nf3. AlphaZero, on the other hand, tries out a wide array of opening moves in the early stage of its training before starting to value the "main" moves higher.
人类一开始几乎都只走1.e4,在20世纪早期的时候1.d4开始越来越流行,然后1.c4 1.Nf3也开始慢慢普及。AlphaZero则相反,最开始的时候会它尝试每一种走法,而后慢慢的筛选出所谓“主流”走法。




AlphaZero各个不同时间段的招法偏好



西班牙开局柏林防御

A more specific example provided is about the Berlin variation of the Ruy Lopez (the move 3...Nf6 after 1.e4 e5 2.Nf3 Nc6 3.Bb5), which only became popular at the top level early 21st century, after Kramnik successfully used it in his world championship match with GM Garry Kasparov in 2000. Before that, it was considered to be somewhat passive and slightly better for White with the move 3...a6 being preferable.
拿西班牙开局柏林防御变例举例(1.e4 e5 2.Nf3 Nc6 3.Bb5 Nf6),该变例直到21世纪初期才开始流行,流行于2000年克拉姆尼克vs卡斯帕罗夫的世界冠军赛。在此之前3...Nf6这步棋被广泛认为略微被动,会给白棋稍优的局面,3...a6则是更流行的走法。

The researchers write Looking back in time, it took a while for human chess opening theory to fully appreciate the benefits of Berlin defense and to establish effective ways of playing with Black in this position. On the other hand, AlphaZero develops a preference for this line of play quite rapidly, upon mastering the basic concepts of the game. This already highlights a notable difference in opening play evolution between humans and the machine.
研究人员写道:

回望过去,柏林防御花了很长时间才被人类的布局理论彻底接受,并被认为是对黑棋非常有利的一个布局。而另一方面,AlphaZero只需要在掌握对局基础概念的时候,就能非常快速的采用柏林防御。这一点说明了人类和机器之间在布局理论进化上的显著差别。




AlphaZero与人类对柏林防御的学习过程



Remarkably, when different versions of AlphaZero are trained from scratch, half of them strongly prefer 3… a6, while the other half strongly prefer 3… Nf6! It is interesting as it means that there is no "unique” good chess player. The following table shows the preferences of four different AlphaZero neural networks:
值得注意的是,当不同AlphaZero版本在初期训练的时候,有半数的版本极其偏好3...a6,另一半则极其偏好3...Nf6! 这就意味着AlphaZero在这里产生了“人格分裂”。下图表格里显示了四种AlphaZero不同神经网络版本里的偏好:




AlphaZero四种不同神经网络版本里(在1. e4 e5 2. Nf3 Nc6 3. Bb5之后)的偏好,每种走法都是在经过100万次训练以后得出的答案。有时AlphaZero会倾向于走3...a6,有时也会倾向3...Nf6



In a similar vein, AlphaZero develops its own opening "theory" for a much wider array of openings over the course of its training. At some point, 1.d4 and 1.e4 are discovered to be good opening moves and are rapidly adopted. Similarly, AlphaZero's preferred continuation after 1.e4 e5 is determined in another short temporal window. The figure below illustrates how both 2.d4 and 2.Nf3 are quickly learned as reasonable White moves, but 2.d4 is then dropped almost as quickly in favor of 2.Nf3 as a standard reply.
同样,AlphaZero在自我训练的过程中,发展出了属于它自己的布局理论,在某个时间段,1.d4和1.e4被认定是最好的走法,也被迅速采纳。同样地AlphaZero在1.e4 e5 之后也是经过一点短暂的时间后才决定出来哪步棋最好。下图中显示了2.d4和2.Nf3迅速被认为是最佳走法,但是马上2.d4的走法被放弃,取而代之的是2.Nf3为标准走法。


AlphaZero在决定1.e4 e5之后的最佳走法。



克拉姆尼克的质量评估

Kramnik's contribution to the paper is a qualitative assessment, as an attempt to identify themes and differences in the style of play of AlphaZero at different stages of its training. The 14th world champion was provided sample games from four different stages to look at.
克拉姆尼克对于本篇论文的贡献体现在质量评估方面,DeepMind团队提供给世界冠军克拉姆尼克AlphaZero在四个不同训练阶段的产生的对局作为样本,让其分析一下AlphaZero在自我训练过程的不同阶段中的训练主题和走棋风格。

According to Kramnik, in the early training stage, AlphaZero has "a crude understanding of material value and fails to accurately assess material in complex positions. This leads to potentially undesirable exchange sequences, and ultimately losing games on material." In the second stage, AlphaZero seemed to have "a solid grasp on material value, thereby being able to capitalize on the material assessment weakness" of the early version.
根据克拉姆尼克的看法,“AlphaZero在早期训练过程中,对子力的理解非常粗糙,并且经常在复杂局面中出现分析失误。这就导致了很多错误的换子顺序,最终由于少子输棋。” 在第二个阶段的时候,AlphaZero看起来对子力价值有了充分理解,解决了第一阶段对于子力评估的问题。

In the third stage, Kramnik feels that AlphaZero has a better understanding of king safety in imbalanced positions. This manifests in the second version "potentially underestimating the attacks and long-term material sacrifices of the third version, as well as the second version overestimating its own attacks, resulting in losing positions."
在第三阶段,克拉姆尼克开始感觉到AlphaZero在子力不对等的局面中对于王的安全有了更好的理解。这也体现在第二版本AZ与第三版本AZ对弈时,经常低估第三版本AZ攻击的潜力和弃子带来的长期价值,第二版本AZ也时常过于乐观的估计自己的攻击,最终导致输棋。

In its fourth stage of the training, has a "much deeper understanding" of which attacks will succeed and which would fail. Kramnik notices that it sometimes accepts sacrifices played by the "third version," proceeds to defend well, keep the material advantage, and ultimately converts to a win.
在训练的第四个阶段,AlphaZero开始有了更深层次的理解,知道哪些攻击会奏效,哪些攻击会失败。克拉姆尼克注意到,有时候第四阶段AZ会接受第三阶段AZ的弃子,然后顽强顶住,保持子力优势,直到最终转换成赢棋。

Another point Kramnik makes, which feels similar to how humans learn chess, is that tactical skills appear to precede positional skills as AlphaZero learns. By generating self-play games over separate opening sets (e.g. the Berlin or the Queen's Gambit Declined in the "positional" set and the Najdorf and King's Indian in the "tactical" set), the researchers manage to provide circumstantial evidence but note that further work is needed to understand the order in which skills are acquired.
克拉姆尼克提出另一个观点,就像人类学棋一样,AlphaZero的学棋过程也是先偏重学习战术棋,而不是战略棋。通过让AlphaZero在不同开局主题下进行自我对弈学习(比如,“局面型布局”柏林防御或者后翼弃兵拒绝弃兵变例,以及“战术型布局”纳道尔夫和古印度防御),研究者们设法提供更多间接证据,不过现阶段仍需要更多的研究工作来证明AlphaZero的技能学习顺序。




克拉姆尼克再一次与DeepMind合作参与对AlphaZero的研究



本篇论文对于国际象棋界以外的影响

For a long time, it was believed that machine-learning systems learn uninterpretable representations that have little in common with human understanding of the domain they are trained on. In other words, how and what AI teaches itself is mostly gibberish to humans.
长期以来,人们认为在机器学习系统中,机器所自学的那些无法解释的特征很难与人类对现有的,所训练事物的规律认知产生联系。换句话说,AI那些自学的过程对于人类来讲毫无参考的用处。

With their latest paper, the researchers have provided strong evidence for the existence of human-understandable concepts in an AI system that wasn't exposed to human-generated data. AlphaZero's network shows the use of human concepts, even though AlphaZero has never seen a human game of chess.
但在本篇论文中,研究人员们提供了强有力的证据,证明了人工智能系统中是存在人类可以理解的概念,AlphaZero的神经网络里展示了人类概念的使用,尽管AlphaZero从未学习过任何一盘人类对局。



This might have implications outside the chess world. The researchers conclude:
这一观点可能会对国际象棋界以外的行业产生影响,研究人员得出结论:

The fact that human concepts can be located even in a superhuman system trained by self-play broadens the range of systems in which we should expect to find human-understandable concepts. We believe that the ability to find human-understandable concepts in the AZ network indicates that a closer examination will reveal more.
通过在人工智能系统的自我训练过程中找出到人类概念这一事实来看,我们可以期待在其他更多领域里,我们都应该能在机器学习过程中都能找到人类概念的身影。我们相信只要再细致研究,便能够在AZ神经网络里找出更多与人类概念有关的细节。

Co-author Nenad Tomasev commented to Chess.com that for him personally, he was really curious to consider if there is such a thing as a "natural" progression of chess theory:
论文合著者Nenad Tomasev对Chess.com评论说,就他个人而言,他很想认真考虑是否到底存在国际象棋理论的“自然”发展这样的事情:

Even in the human context—if we were to 'restart' history, go back in time— would the theory of chess have developed in the same way? There were a number of prominent schools of thought in terms of the overall understanding of chess principles and middlegame positions: the importance of dynamism vs. structure, material vs. sacrificial attacks, material imbalance, the importance of space vs. the hypermodern school that invites overextension in order to counterattack, etc. This also informed the openings that were played. Looking at this progression, what remains unclear is whether it would have happened the same way again. Maybe some pieces of chess knowledge and some perspectives are simply easier and more natural for the human mind to grasp and formulate? Maybe the process of refining them and expanding them has a linear trajectory, or not? We can't really restart history, so we can only ever guess what the answer might be.
假设我们‘重启’历史,回到过去,国际象棋理论还会以同样的方式发展吗?在对国际象棋原则和中局位置的整体理解方面,有过许多突出的思想流派,如:动态与结构的重要性对比,保存子力与弃子攻击的重要性对比,子力不对等与空间的重要性对比,空间的重要性与超现代学派的所谓引诱对手拉长战线再给与其反击的重要性对比等等。这些学派思想也指引了布局理论的思想。回首过往的进程,我们很难确定这一切如果重来,是不是还会再以同样的方式重现。也许一些知识概念和一些观点对于人类的思维来说更容易、更自然地掌握和形成?还是说提炼和扩展这些知识的过程有一个线性轨迹?我们无法真正重启历史,一切答案也只在猜测之中。

However, when it comes to AlphaZero, we can retrain it many times—and also compare the findings to what we have previously seen in human play. We can therefore use AlphaZero as a Petri dish for this question, as we look at how it acquires knowledge about the game. As it turns out, there are both similarities and dissimilarities in how it builds its understanding of the game compared to human history. Also, while there is some level of stability (results being in agreement across different training runs), it is by no means absolute (sometimes the training progression looks a little bit different, and different opening lines end up being preferred).
然而,当我们谈到AlphaZero时,我们可以对其进行多次重新训练——并将结果与我们之前在人类对局中看到的结果进行比对。因此,我们可以将 AlphaZero 用作这类问题的实验道场,用来了解它如何获取国际象棋的知识。事实证明,与人类国际象棋理论的发展历程相比,AZ对国际象棋理论领悟的过程与其既有相似之处,也有不同之处。当然了,该结论虽然基本靠谱(结果在不同的训练运行中基本一致),但却不是绝对正确(有时训练进程看起来有点不同,会导致不同的开局偏好)。

Now, this is by no means a definitive answer to what is, to me personally, a fascinating question. There is still plenty to think about here. Yet, we hope that our results provide an interesting perspective and make it possible for us to start thinking a bit deeper about how we learn, grow, improve—the very nature of intelligence and how it goes all the way from a blank slate to what is a deep understanding of a very complex domain like chess.
对我个人而言,面对这样引人入胜的问题,我无法给出一个确切的答案,这里还有很多值得深思的地方。然而,我们希望这项实验结论可以提供给人们更多有趣的视角,使我们能够更深入地思考我们人类到底如何学习、成长、改进,深入思考关‘智力’的本质,以及我们的智力到底如何从一张白纸发展到深刻理解像国际象棋这样非常复杂的领域。



克拉姆尼克的看法

"There are two major things which we can try to find out with this work. One is: how does AlphaZero learn chess, how does it improve? That is actually quite important. If we manage one day to understand it fully, then maybe we can interpret it into the human learning process.
“我们可以通过这项工作尝试解决两个重要的课题。一是:AlphaZero是如何学习国际象棋的,二是它是如何持续进步的?这也是非常重要的一点。如果我们有一天能够完全解读背后的过程,那么也许我们就可以解密人类的学习过程。

Secondly, I believe it is quite fascinating to discover that there are certain patterns that AlphaZero finds meaningful, which actually make little sense for humans. That is my impression. That actually is a subject for further research, in fact, I was thinking that it might easily be that we are missing some very important patterns in chess, because after all, AlphaZero is so strong that if it uses those patterns, I suspect they make sense. That is actually also a very interesting and fascinating subject to understand, if maybe our way of learning chess, of improving in chess, is actually quite limited. We can expand it a bit with the help of AlphaZero, of understanding how it sees chess."
其次,我认为探寻某些 AlphaZero 认为有意义的‘规律认知’是一个非常有意思的过程,尽管这些所谓的规律对人类,至少对我来说没有太大的意义,但对我们来说确是有待进一步研究的课题。事实上我曾经想过,也许我们在国际象棋上遗漏了许多很重要的概念,虽然我怀疑这些概念是否对我们有任何意义,但AlphaZero确实使用了我们不懂的概念,也因此才会变的如此之强,搞懂这些规律和概念实际上将会是一个非常有意思的课题。也许我们学习、提高国际象棋水平的过程与能力十分有限,但在AlphaZero的帮助下,我们或许可以扩展我们的思路,帮助我们更好的理解国际象棋本身这项运动。(完)


页: [1]
查看完整版本: 深度解析:最强AI引擎AlphaZero是怎样学习国际象棋的?