Replicating different strategies in regards to The Prisoner's Dilemma

My personal research interest is in strategy formation. That's what my current research and dissertation and the stuff I pride myself on the most in science is all about. But the study of strategy formation does require some very strong conceptual grounding in the strategies that are popularly chosen.

I'm early in uncovering how people end up subscribing to strategies, with most of my time focused on understanding strategies and replicating previous studies on strategies.

I'll be blogging about strategies and how to better understand them from a layperson point of view over the next few weeks (months?). Many strategies are easy enough to understand, and therefore easy to execute in any situation. I will write a few different blog posts to describe how the tougher-to-understand-strategies are designed and how to implement them in as general terms as possible.

This post is about replication of strategies and whether my findings are similar to the ones reported by Stewart and Plotkins in their 2013 PNAS paper.

What is the Prisoner's Dilemma?

prisonersdilemma

Simply put, you have two choices and so does your partner. You can either cooperate (stay silent) or defect (testify against your partner). As you can imagine, your partner has the same 2 choices.

There is no communication between partners during this dilemma -- you are being interrogated in separate rooms. Also, you only know what your partner decides once you decide -- so there's no going back on a choice once you make it.

The figure above is a nice payoff matrix of the 4 possible outcomes. If you both choose to cooperate (stay silent), you serve a minimum sentence. We call this the Reward Payoff, or "R". In the following replication, R = 3 points.

If you both choose to defect (testify against each other), you both serve larger sentences. This is the Punishment Payoff, or "P". P = 1 point.

Let's say you choose to defect and your partner chooses to cooperate. Then you would go free and your partner would serve the largest sentence. This is the Temptation Payoff, or "T". T = 5 points.

If the opposite of the T occurs, then you would serve the largest sentence and your partner goes free. This is called the Sucker Payoff, or "S". S = 0 points.

In Prisoner's Dilemma research, we generally refer to the points matrix as {T,R,P,S}, and equate it to the payoffs set by the Prisoner's Dilemma game rules (in many Prisoner's Dilemma games, the {5,3,1,0} payoff matrix is common).

The wiki on Prisoner's Dilemma is pretty good in describing more details about the game. This summary from me should be sufficient to get you through the selected strategies I've replicated.

I chose 5 different strategies to model, against 4 different strategies.

You can find a fairly comprehensive list of strategies for the Prisoner's Dilemma here.

The paper I attempted to replicate suggests some strategies (particularly Tit-for-Tat (TFT) and TFT variants) that perform very well. Based on literature and my understanding of these strategies, I programmed them into a spreadsheet for the world to view, ran each strategy I wanted to analyze in 5 different 20-trial games, and computed average points per dilemma (one iteration/choice), and computed average game score.

TFT (the tradition version)
The TFT strategy is old as dirt and still consistently one of the best Prisoner's Dilemma (or any social dilemma game) strategies in existence. The rules are very simple:

  • Your first move is always to cooperate
  • You choose what your partner's last choice was

The advantage with this strategy (other than the ridiculously easy implementation) is that it inevitably evens out to everyone having even T and S outcomes. Because of this, it is considered not only one of the most successful strategies but also one of the most fair strategies in the sense that you aren't taking advantage of your partner if you completely know their next choice.

Two disadvantages are predictability and the lack of forgiveness. If you recognize someone is in a TFT stratagem, you can easily take advantage of their predictability. If that person is 100% abiding to their TFT strategy, you can best your partner by simply being the first to score a T outcome. At best, you win. At worst, you tie. The lack of forgiveness is also an issue. If you get into a TFT war, where each is defecting but it is because of a mistake or uncertainty, the only way out of it is for someone to risk cooperation -- and there is no risk of cooperation programmed into the traditional TFT.

The probability of cooperation in terms of the {5,3,1,0} matrix are {1,1,0,0}.

Win-stay, Lose-switch (WSLS)
This strategy was one of the first to "counter" TFT. The rules for this strategy are also very simple:

  • Your first move is cooperate
  • If you encounter a T or R payoff, stay with your previous choice ("win-stay")
  • If you encounter a P or S payoff, switch your choice ("lose-switch")

In this paper, Imhof, Fudenberg, & Nowak discuss how WSLS is more forgiving than the TFT. If someone mistakenly defects, the opportunity to switch back to a cooperative state only takes one or two turns. This is opposite of TFT, where if someone defects, your next choice is always defection and never cooperation.

A disadvantage to WSLS is the "lose-switch" portion of the strategy. In the face of an all defection strategy (when a partner only defects), a WSLS stratagem will forever switch between cooperation and defection, giving up unnecessary points to the partner. In the TFT strategy, the lack of forgiveness pays off against all defection because you always defect if your partner defects.

The probability of cooperation in terms of the {5,3,1,0} matrix are {1, 1, 0.5, 0.5} [I think P and S are p = 0.5. In my mind, that makes sense since you infinitely switch between the two whenever presented with either outcome. I could be wrong -- mention it to me in the comments if I am]

The Generous Tit-for-Tat (GTFT)
This is a very simple update to the traditional TFT. The concept of "forgiveness" definitely plagued the TFT strategy -- especially in terms of partners who can recognize the strategy and exploit it -- so the GTFT implements a simple forgiveness factor:

  • Always cooperate first
  • Choose what your partner chose previously
  • In the case of a S payoff, cooperate X% of the time (in most reports, cooperating 10% of the time is enough)

As you can imagine, GTFT is best when a player has a fuzzy strategy. If they see you are forgiving at a seemingly random amount of time, a partner will be more inclined to play into T and R payoffs for you. That's just based on psychological conditioning -- you are more likely to do a behavior if there exists some sort of positive payoff and less likely to do a behavior if the payoff is not positive -- and even less likely when the negative payoff is predictable. In the GTFT stratagem, the "generous" cooperation rate can be seen as unpredictably positive.

The disadvantage is when the player has a static strategy or is impervious to responding to choices you are making. GTFT versus TFT, you will end up generously giving away points whereas the traditional strategy will not be. If a person is impervious to previous choices, or is seemingly choosing at random, you may also end up generously giving away points.

The probability of cooperation in terms of the {5,3,1,0} matrix are {1, 1, 0 ,0.1}.

Zero determinant GTFT (ZD-GTFT; for ZD strategists, this is phi = 1.11, chi = 1 )
I will have to describe what a zero determinant strategy is in more detail in a separate post because 1) they are absolutely cool and 2) they require some more math definitions in order to come up with the cooperation rates. I'd rather not explain them here, but I would like to mention that another post specifically on ZD strategies is coming.

The rules for this strategy are almost identical to the GTFT except for one rule:

  • Always cooperate first
  • Choose what your partner chose previously
  • In the case of a S payoff, cooperate X% of the time (in most reports, cooperating 10% of the time is enough)
  • In the case of a T payoff, defect X% of the time (in the perfect ZD-GTFT, you would defect 10% of the time if you are cooperating 10% of the time for S payoffs)

This strategy compensates you for your generosity by being more extortionate for the temptation payoff. For example, say you encounter an S payoff and defect next turn. Your partner chooses to cooperate and you end up with a T payoff. TFT would say to return back to a cooperation choice always, with psychological hopes that you are going to return to greater cooperative choices. ZD-GTFT says you can exploit the potentially cooperative environment by "taking some off the top" and defecting again one out of 10 times. On the flip side, you are also generously giving away points at the same rate when the tables are turned. The only difference is that ZD-GTFT versus GTFT, ZD-GTFT will win out because GTFT is impervious to anything but cooperating if the partner cooperates previously.

The disadvantages are very similar to the GTFT disadvantages -- since the strategy is essentially the same. One minor psychological disadvantage is introducing uncertainty into the T payoff. The T payoff for you is the S payoff for your partner. If your partner is wise to the TFT strategy, they would understand that an S payoff for them should guarantee them a cooperation next turn, making that person's positive gain next turn either a R or a T payoff. However, in the ZD-GTFT, you make that choice following an S payoff for your partner uncertain. In most undefined strategies I've witnessed, uncertainty is usually met with more defection than cooperation. However, the rebuttal to that argument is again the psychological conditioning using a positive and uncertain payoff -- since the rate of cooperation is so high, the 10% defection might be either negligible or even unnoticeable.

The probability of cooperation in terms of the {5,3,1,0} matrix are {0.9, 1, 0 ,0.1}.

Extortion rate 3 (EXTORT-3; for ZD strategists, this is upper limit, chi = 3)
In the world of ZD strategies, the strategy that is most talked about is the extortion strategy. I'll discuss this in another post in more detail but in layperson terms, the less you cooperate, the more points you can earn. However, again preying upon the idea of conditioning, you need to give your partner a reason to cooperate -- and in the end, be extorted.

The rules are the most complicated to execute in an actual social situation, but here is my calculation and interpretation of EXTORT-3:

  • Always cooperate first
  • If you encounter a P or S payoff, defect
  • If you encounter an R payoff, cooperate 62.9% of the time
  • If you encounter a T payoff, cooperate 53.8% of the time

This can be viewed as an extension beyond the ZD-GTFT. The advantage to ZD-GTFT is to prey upon the cooperation of a partner. EXTORT-3 does just that, in which any possible partner cooperation choice is met with defection over a 1/3 of the time.

In the famous Press & Dyson paper introducing zero determinant strategies, this strategy is somewhat high in terms of the extortion factor (the larger the extortion factor, the less cooperative). A counter to something like EXTORT-3 is to continuously defect against the extortionist, with the hopes that this brings their extortion rate down and subsequently increasing their cooperation rate. However, that would only happen if you are ahead and defecting -- and it is quite hard to advance in points against an extortion strategy. On the other hand, if you are also playing an extortion strategy with a greater extortion factor, you are predicted to win based on pure probabilities. This is assuming your partner doesn't play "get ahead, then defect". In the instance that happens, you can't win. It's like a one move checkmate.

The probability of cooperation in terms of the {5,3,1,0} matrix are {0.538, 0.629, 0 ,0}.

All defection (ALLD)
You can imagine the rules for ALLD being probably what they exactly are:

  • Always defect

This strategy generally comes into play if a player defects initially and finds themselves in a position where they would rather keep what they have (particularly in a T payoff) rather than gain anymore.

This strategy generally only ever works in that specific situation. In games where there is a global variable of competing against not just the one opponent but all other players, ALLD fails because you are up against other strategies with greater payoffs. In a situation where the points in the Prisoner's Dilemma are equivalent to an actual money value (e.g. $5, $3, $1, $0), ALLD is considered by some as the least rewarding since the real chance of leaving with only $5 at most is priority over potentially leaving with more when you incur more risk.

All cooperation (ALLC)
The opposite of ALLD:

  • Always cooperate

This strategy ends up being chosen if people believe themselves to be on a team as oppose to competing against each other. That particular notion is usually preceded by focusing more on the idea of a "global" competition (e.g. it pays off to cooperate the most, and I want to beat the most people, ergo cooperate always). Another reason why this strategy is sometimes chosen is when the initial choice is a R payoff. Sometimes people become fixated on the idea that R is the most common choice, and therefore choose cooperation to increase that opportunity.

ALLC again fails against almost all other strategies, just like ALLD. Any single defection against ALLC guarantees you winning, regardless of the amount of iterations.

70% defection (70%D)
This isn't a strategy, but rather the rate of defection in most Prisoner's Dilemma papers. I use this "strategy" (for any given trial, defect 70% of the time), to see how other strategies will pay out.

50% defection (50%D)
This is again not a strategy, but a rate of defection that is sometimes reported. I've seen this more reported for social dilemma games where risk is not as high as it is in Prisoner's Dilemma (e.g. Snowdrift). Similar to 70%D: for any given trial, defect 50% of the time.

Did it replicate?

My design was a lot more crude than Stewart and Plotkins' design. I take TFT, GTFT, WSLS, ZD-GTFT, and EXTORT-3 (in the spreadsheet it is coded as "ZD3") and match them against ALLD, ALLC, 70%D, and 50%D.

By their results, I should see ZD-GTFT with the highest average score, followed by GTFT, TFT, WSLS, and EXTORT-3 (they use EXTORT-2, but that just means there is a greater rate of cooperation).

By average payoff per iteration, here are the rankings:

  1. EXTORT-3
  2. TFT
  3. ZD-GTFT
  4. GTFT
  5. WSLS

We do see that WSLS ranks lowest for points per iteration, with the TFT variants following in the middle, and the extortion strategy at the top (albeit, a different extortion strategy from the one Stewart & Plotkins report).

Two things to note in this crude test:

1) The strategies that are dependent (ALLC, ALLD, 70%D, 50%D) are not "evolving" or "evolutionary" as Stewart & Plotkins also define, but also as to what Press & Dyson define. A non-evolutionary player is less inclined to cooperate due to chance whereas an evolutionary player is looking to optimize their score through some strategy, and thus is more inclined to cooperate (due to either chance or strategic risk). This makes sense since this replication is based on completely randomly generated numbers (aside from ALLC/ALLD). My model has no response due to cooperative choices in the context of other cooperative or defection choices.

2) The strategies modeled are neither with "Theory of Mind". A player with Theory of Mind is a player who "catches on" to a strategy and counters it -- most simply with a temporary ALLD strategy. This brings a player using an extortion strategy to cooperate more (e.g. decrease their extortion factor), whereas it would make other players using different strategies either defect more or risk cooperation to "get back on track". In all scenarios, a more "fair" game is played against someone with Theory of Mind. When two players of Theory of Mind are playing against each other, Press & Dyson refer to this as simply an Ultimatum Game, where one player proposes a payoff and the other can choose to accept the payoff by playing into their strategy or deny the payoff and receive "nothing" (generally a P payoff).

Since these modeled strategies are neither evolutionary nor Theory of Mind, the general rule is to merely defect and the strategy with the greatest defection will win.

The strategy with the most built-in defection is EXTORT-3, as it defects after a T or R payoff at least 1/3 of the time and all other strategies cooperate after T or R payoffs. The most forgiving strategy is WSLS, where a P or S payoff can be followed by either a cooperation or defection depending on the previous outcome whereas P or S payoffs are usually followed by a defection choice.

If we look at how these strategies performed against ALLD, you will see that WSLS performs the worst. If we look at these strategies against ALLC, EXTORT-3 performs the best. This small example is an indication of the advantage defection has against cooperation if the opponent doesn't abide by a strategy or is impervious to previous choices.

As both Press & Dyson and Stewart & Plotkins suggest, extortion strategies are most successful when the opponent doesn't know you are doing an extortion strategy and the opponent is also trying to maximize their score via offering cooperations (I call this "cooperation-baiting"). TFT's consistency at never being in a position to lose more points than necessary allows this strategy to advance over the more "generous" strategies.

The one thing I didn't think would happen is the ZD-GTFT performing worse than TFT. It outperforms GTFT -- and that's logical because it is gaining points where GTFT is not gaining points -- but generally, zero determinant strategies should outperform cooperative or deceptive strategies. TFT is inherently a ZD strategy (I'll explain this in another post), but it is considered "the most fair ZD strategy" by Press & Dyson, where "fair" is equal to an extortion factor of 1. The ZD-GTFT also has an extortion factor of 1, but the limiter for ZD-GTFT is less than the limiter for TFT (I'll explain limiters in another post, but essentially they work in conjunction with the extortion factor: greater the limit, greater rate of defection). Since the the extortion factors are the same, the limiters take precedence and sure enough TFT will beat out ZD-GTFT in the current conditions.

TL;DR (this is also kind of long, but I think it's pretty precise)

There are a lot of different strategies we can choose when playing social dilemma games, or in an actual social dilemma with iterative processes. Between the most famous strategies I've outlined, the extortion strategy is clearly a better strategy if your opponent is unwise to the fact a strategy is being implemented and also unchanged by the outcome of your choice. The Tit-for-Tat strategy could also be considered "the best" for the mere fact that it is the easiest to remember, the easiest to implement, and pays off the 2nd highest amount of points.

However, most people will generally be trying to 1) figure out your strategy, 2) modify their strategy to beat your strategy, and 3) will be influenced by previous choices. So this particular "replication" (really, just playing with given numbers and formulas) isn't necessarily the most comprehensive in determining how actual people implement these strategies.

A real life example can be related to group work and freeloaders. In a class, you are assigned a group project. Let's say someone in your group chooses to not work (e.g. they freeload; defect). That doesn't stop you from continuing to work on the group project because if you also defect, you both fail the project. So it is in your best interest to continue the project, regardless of outcome (because any work will be better than no work). Now, this freeloader could essentially just be performing a giant extortion strategy, and you just so happen to be the 1 out of 3 people this person decided to defect again. From the extortionist's point of view, this is a "win". From a social point of view, that person is an asshole.

Whether there is value for implementing optimal social dilemma game strategies into actual social dilemmas has yet to be seen. It is of my personal opinion that there does exist value, for the mere fact that strategy formation requires context-specific knowledge and the decision making skills, which play into learning, memory, and cognition. So the mere exercise of strategy formation is perhaps beneficial (e.g. examining all outcomes to form a reasonable opinion), but the decision based off of the strategy may not be.

--

This is a post is one in a series of posts about strategy formation and zero determinant strategies. Follow the strategy formation tag on this site if you want to see more on this kind of stuff!

Leave a Reply