Poker's Artificial Intelligence Moment
In 2019, a team consisting of students from Carnegie Mellon and researchers from Facebook AI collaborated in creating an AI poker program named Pluribus.
Pluribus is different from its Poker AI predecessors because it has learned how to play, and win, against multiple players. Previously, AI was known to have superiority in solely two-player games: Chess, Go, two-player Texas Hold’em, et cetera.
Being able to consistently win in Poker against multiple people, however, was a great challenge. The reason was evident-- Poker was a game with too much hidden information. Parts of the game: bluffing, not knowing your opponent’s cards, are features that AI find difficult to analyze and develop a sufficient strategy to counter.
Learning the game of six-player no-limit Texas Hold’em by playing against five copies of itself, Pluribus was tasked with playing against five professional poker players. Next, one professional poker player was tasked with playing against five Pluribus programs. The research showed that over the course of 10,000 hands, Pluribus performed significantly better than its human counterparts.
The science behind two-player AI
AI have achieved mastery in games such as checkers, chess, Go, and the likes. They are all two-player and zero-sum games (which means that one player wins, and one loses). In each of these games, the AI designs an algorithm which estimates a Nash equilibrium strategy.
A Nash equilibrium is “a list of strategies, one for each player, in which no player can improve by deviating to a different strategy.” In the games mentioned above, the outcome of someone winning will not change no matter what the others do. For example, in rock-paper-scissors, the Nash equilibrium strategy is to pick one of the three choices with equal probability. In two-player zero-sum games, implementing a Nash equilibrium makes it impossible for the player to lose in expectation regardless of what the opponent does. The best case scenario for the opponent is to end up in a tie. However, the same goes for the player.
However, in games involving three or more players, it is not considered possible to efficiently calculate a Nash equilibrium strategy. In the Lemonade Stand game, where each player picks a point on a circle and tries to be as far away as possible from the other players, the Nash equilibrium is to have all four players equidistant from one another on the circle. However, there are infinite ways that this Nash equilibrium can be achieved. So, if each player independently calculates a Nash equilibrium, it is highly unlikely to result in all players actually being spaced uniformly along the circle.
The shortcomings of Nash equilibria outside of two-player zero-sum games prove that an AI, designed to beat players in multiplayer Poker, should not be programmed through specific game-theoretic solutions. Instead, it should be focused on actually beating human opponents, without theoretical guarantees.
So how was Pluribus able to accomplish what no other AI had before within Poker?
Like its past competitors, the core of Pluribus’ algorithm was computed through self-play. This means that it develops its strategy through playing against copies of itself, with no other data from humans or prior AI play. Referred to as the blueprint strategy, it produces a strategy for the entire game off-line.
What is CFR and MCCFR?
Counterfactual regret minimization (CFR) is an iterative self-play algorithm in which "the AI starts from scratch by playing randomly, and gradually improves as it determines which actions, and which probability distribution over those actions, lead to better outcomes against earlier versions of its strategy."
Pluribus uses a distinct form of Monte Carlo CFR (MCCFR) that tests actions in the game tree, rather than traversing the entire game tree for each turn.
On each iteration of the algorithm, MCCFR designates one player as the “traverser” whose current strategy is updated. At the start of the iteration, MCCFR simulates a hand based on the strategy of all the players (all initially random).
When the hand is complete, the AI then reviews each decision the “traverser” made and sees how much better or worse it would have done with choosing the other hypothetical actions instead. Next, the AI assesses the choices that would come up from choosing those other hypothetical actions, and so on.
Pluribus is able to explore the hypothetical outcomes because the algorithm is developed through the self-play strategy, mentioned above. “If the AI wants to know what would have happened if some other action had been chosen, then it need only ask itself what it would have done in response to that action.”
The difference between the hypothetical the traverser would have received for choosing a specific action in contrast to what the traverser actually achieved is added to the counterfactual regret for the action. At the end of the algorithm, the strategy of the traverser is updated so that actions with higher counterfactual regret are more likely to be chosen.
In short, by searching just a few moves ahead rather than searching the entire game, the AI compares the different hypothetical options at each turn and determines the best move. An in-depth analysis of the algorithm can be found in the Science Magazine publication.
Pluribus research and its application
The blueprint strategy of Pluribus was trained in eight days on a 64-core server for a total of 12,400 CPU core hours. It required less than 512 GB of RAM. At the cloud computing instance rates, it would cost approximately $144 to produce. This is in sharp contrast to other AI breakthroughs in gaming, which require large numbers of servers and farms of graphics processing units (GPU).
These forms of AI cost millions to train and process. According to researchers Noam Brown and Tuomas Sandholm, “more memory and computation would enable a finer-grained blueprint that would lead to better performance but would also result in Pluribus using more memory or being slower during real-time search.”
With the low-cost and effective search strategy, Pluribus is evidence that AI research and breakthroughs can occur solely with original thinking and limited resources. Pluribus also gives researchers a “fundamental understanding of how to build general AI that can cope with multi-agent environments, both with other AI agents and with humans.”
According to Brown, by proving that the self-play algorithm can succeed in an environment with hidden information and limited communication among players, Pluribus’ research can be implemented in the advancement of certain fields such as fraud prevention and cybersecurity.
Poker's Artificial Intelligence Moment Written by David Wang
Edited by Jimei Shen & Ryan Cunningham