DeepMind, the machine learning laboratory supported by Alphabet chess go, Starcraft 2, The revenge of Montezuma, and beyond, believes that the Diplomacy board game could motivate a promising new direction in reinforcement learning research. In one paper Published on the preprint server Arxiv.org, the firm’s researchers describe an AI system that achieves high diplomacy scores while producing “consistent improvements”.
AI systems have achieved solid competitive play in complex large-scale games like Hex, shogi, and poker, but most of them are two-player zero-sum games where a player can only win by causing another player to lose. This does not necessarily reflect the real world; tasks such as planning congestion routes, negotiating contracts and interacting with customers all involve compromise and taking into account how group members’ preferences coincide and conflict. Even when AI software agents are interested, they can benefit from coordinating and cooperating, so interaction between various groups requires complex reasoning about the goals and motivations of others.
The Diplomacy game forces these interactions by instructing seven players to control multiple units on a province-wide map of Europe. Each turn, all players move all their units simultaneously to one of the 34 provinces, and one unit can support another unit belonging to the same player or another to allow it to overcome the resistance of other units. (Alternatively, units – which have equal strength – can hold a province or move to an adjacent space.) Provinces are supply centers, and units capture supply centers by occupying the province. Having more supply centers allows a player to build more units, and the game is won by owning the majority of supply centers.
Due to the interdependencies between units, players must negotiate the movements of their own units. They have everything to gain by coordinating their movements with those of other players, and they must anticipate how other players will act and reflect these expectations in their actions.
“We are proposing to use games like Diplomacy to study the emergence and detection of manipulative behavior … to ensure that we know how to mitigate such behavior in real applications,” wrote the co-authors. “Research on diplomacy could pave the way for the creation of artificial agents capable of cooperating successfully with others, in particular by dealing with the difficult questions that arise concerning the establishment and maintenance of trust and alliances.” “
DeepMind has focused on the “pressless” variant of Diplomacy, where no explicit communication is allowed. He trained reinforcement learning agents – agents who take action to maximize a reward – using an approach called Sampled Best Responses (SBR), which managed the large number of actions (10⁶⁴) that players can to undertake in diplomacy, with a technique of iteration of policy which approaches the best responses to the actions of the players as well as to the fictitious game.
At each iteration, the DeepMind system creates a set of game data, with actions chosen by a module called an improvement operator that uses a previous strategy (policy) and value function to find a policy that defeats the previous policy . It then forms the policy and value functions to predict the actions that the improvement operator will choose as well as the results of the game.
The aforementioned SBR identifies policies that maximize the expected return for agents of the system against the policies of adversaries. SBR is coupled with Best Response Policy Iteration (BRPI), a family of algorithms adapted to the use of SBR in multiplayer games, the most sophisticated of which trains policies to predict only the last BR and to average Explicit historical checkpoints to provide current empirical strategy data.
To assess system performance, DeepMind measured head-to-head payout rates against six agents from different algorithms and a population of six players independently drawn from a benchmark corpus. They also considered “meta-games” between the control points of a training race to test constant improvement and examined the exploitability (the margin by which an opponent would defeat a population of agents) of agents. Game.
System win rates weren’t particularly high – on average across five seeds in each game, they ranged from 12.7% to 32.5% – but DeepMind notes that they represent a big improvement over trained agents supervised learning. Against a particular algorithm – DipNet – in a 6 to 1 game, where six of the agents were controlled by the DeepMind system, the success rates of the DeepMind agents improved steadily through training.
In future work, researchers plan to explore ways to reduce the exploitability of agents and to build agents who reason on the incentives of others, potentially through communication. “Using [reinforcement learning] to improve the game in … Diplomacy is a prerequisite for studying the complex mixed patterns and multiplayer aspects of this game … Beyond the direct impact on diplomacy, possible applications of our method include the commercial areas , economic and logistic… ability to train a basic tactical agent for diplomacy or similar games, this work also opens the way to the search for agents capable of forming alliances and using more advanced communication capacities, with other machines or with humans.