Flogging the Data: Legacy Deck Choice

1 An Introduction
One of the biggest decisions that any player faces is choosing a deck. The number of potential decks is legion, but only a minuscule subset of those are actually played. The set of competitively-playable decks constitutes the metagame. Finding out which deck is the best was the motivation behind this investigation of the current Legacy metagame.

Unfortunately, much of the current analysis available in cyberspace often lacks a solid research design and empirical basis. Even when statistics or top-eight finishes are put forward as evidence for the strength of a deck, the conclusions drawn can be misleading because there is a complex set of interactions that are often unaccounted for. For example, the prevailing presumption is that those decks, which make it to the top of the tournament leaderboard are better than the rest; this type of interpretation is fundamentally flawed if one does not account for other explanatory factors. One simple explanation for a deck's top-eight finish might be its relative prevalence—simple odds. But there are other explanations, such as the influence that a player, the opponent, and that opponent's deck have on the outcome. Taking alternative explanations systematically into account is necessary in order to get a clear picture of the effect a deck has on performance. To my knowledge, taking various explanations simultaneously into account in a rigorous statistical manner has not been done, but it is a necessary step if we really want to understand what factors drive match outcomes, and in particular how to choose a good deck.

Before we go any further, I've received feedback from various readers and the editors that my initial drafts were a too technical for a general audience, and on closer inspection I had to concur. So at each step I have now tried to at least give an explanation of what some technical detail means in intuitive terms, and saved some of the equations and the goriest details for the full paper whose link you can find by scrolling down to the bottom of this article. Nevertheless, you'll see that overall I try to retain a scientific structure—this was a conscious decision precisely because so much of Magic commentary is unstructured or unfounded.

Legacy's most overrated creature.
2 Research Design
The basic idea is to use regression analysis on Legacy tournament data. Regression analysis establishes a relationship between a variable to be explained (match outcomes) and one or more explanatory variables (decks, players, etc.). More specifically, regression analysis can help us understand how the match outcome changes when any one of the explanatory variables, such as a deck or player, changes whilst the other explanatory variables remain constant. You may even have unwittingly done regression analysis in your high school chemistry or physics class by drawing a line through a cloud of points to estimate the role of temperature and pressure, or some other causal relationship. The idea is the same here only with more relationships.

The data analyzed comes from Jesse and Alix Hatfield's article series "Too Much Information." The research design employed here is stronger than the Hatfields' in that it incorporates player and opponent effects, which can be thought of as a way of incorporating player-specific attributes, such as skill, fatigue, or the player's familiarity with the deck and format. The basic unit of observation is the number of games won or lost between two players and their two decks in a best of three game match.

Regrettably, we cannot exploit the information contained in the deck lists because we do not have them. The Hatfields only report deck archetypes and sub-archetypes. Their and indeed my implicit assumption is that the (sub-)archetype of the deck corresponds to a specific set of cards, making it a useful intellectual category.

The dyadic match data are slightly problematic in that a player's wins appear as another's losses and vice-versa. This fact violates one of the technical assumptions of uncorrelated observations that underpins the statistical approach, and, if unaddressed, might otherwise invalidate the results. The remedy employed in this paper is to include all explanatory variables twice. That is to say a person will show up once as a player and once as an opponent. Because we are using two degrees of freedom for every explanatory variable, this is essentially like dividing one's sample, which is generally statistically inefficient; the standard errors for the parameter estimates will be larger. In layman's terms, this means we have less information per explanation offered. The less information we have per explanation, the less likely we are to establish a relationship, but at least if we do find one it should be accurate.

One of the assumptions made here, for better or worse, is that both bad and good players choose their decks "randomly." That is to say, players pick a deck that may or may not be optimal for the tournament metagame; and less adept players, like myself, copy the good players. What this means is that the average deck for both good and bad players is the same. If good players picked only good decks, and bad players picked only bad decks, we would not be able to easily disentangle the deck's performance from that of the player. A simple way of testing this assumption was to reorganize both the decks in ranked order based on games won and the players in ranked order based on games won across tournaments to see whether there is in fact no correlation as assumed. As it turned out, there seems to be about a 3% correlation between deck strength and player strength as measured by games won, which is a slight violation of the assumption, but is consistent with what we know about top players being "in-the-know" about which decks are hot.

One of the unique features of the Swiss tournament system, which is the standard system used in competitive Magic, is the round pairings whereby winners are paired with winners and losers with losers in each round. Because pairings are correlated with player ability and deck performance, I controlled for any potential effects by the including a "round" variable amongst the explanatory variables. I also account for tournament/spatial effects by including a dummy (i.e. yes/no) variable for each of the events. This means those alternative explanations can be ignored so that we can concentrate on the partial effect a deck has on the outcome.

3 Data
The Hatfields' data for the top-performing decks currently contain about 4098 observations covering ten events, 632 different players, 92 deck archetypes. Some of the archetypes contain sub-archetypes; these were omitted because the data is taxed heavily with the inclusion of some 800 variables. Along with the construction of the dependent variable (points based on games won, lost or drawn), a handful of observations had to be dropped or modified so as to make their data econometrically comparable. This means that the number of points varies between -2 and 2, with 0 points being the mean and median for the data. The data is also slightly unbalanced, meaning that certain decks or players appear without their conjugate opponent, making the double inclusion of the player and deck even more taxing on the data because we have two explanations but only one observation in some instances.

4 Results
The section looks at two things: the first is the determinants of performance in general; the second is a look at decks, which are statistical outliers.

4.1 Determinants of Performance
The following table presents the results from a statistical technique that breaks down the patterns (variances) in the data to reveal, which variables contribute the most to the match outcomes as defined in Equation 1.


Table 1: ANOVA of Games Won per Match

Variable: % Explained
=====================
Round: 0.0%
Event: 0.0%
Deck_player: 3.1%
Deck_opponent: 0.4%
Player: 14.5%
Opponent: 24.4%
Unexplained: 57.6%


The percentages in the table above tell us how much of a match that we can explain with a given variable. As we can see a player's deck contributes very little (3.1%) to explaining match outcomes. It means that there is not much difference between the observed decks and match outcome. There are a number of possible explanations for this result. For example, this could be because deck "archetype" is not a useful conceptual category. Or because all decks that appear in competitive tournaments are all extremely well tuned, implying that you cannot just put 60 cards and a pile and have a decent chance of winning.

The round and event variables are statistically insignificant reaffirming that Magic is both played and organized in the same manner across regions and has no effect on match outcomes. This is precisely what one might expect in an era of "net-decking" and the ubiquitous Wizards Event Reporter software used by event organizers to pair players in tournaments.

The third interesting result from Table 1 is that the players, and not the decks, explain much of the outcome. A player's own play skill explains 14.5% of the match outcome. There are countless articles on how important play skill is—and Table 1 supports them. At the same time, we can see who the player gets paired against explains even more of the outcome than her own play (24.4% vs. 14.5%). Furthermore, Table 1 provides strong evidence against the assertion made by two of Magic's most renown pros, Patrick Chapin and Jon Finkel, who maintain: "[m]ore Magic games are decided by technical play than all other factors combined" (Chapin & Finkel: 2009-05-14). This is because Table 1 suggests that technical play explains only about 15% of outcomes, whereas all other factors combined (decks, opponents, and unexplained residual) explain 85%.

4.2 Effect of Decks
Recall that the goal is to find out how many games I will win by playing a certain deck. By using statistical techniques we can cut through the noise of random outcomes to establish a relationship between playing a certain deck and the number of games won.

Table 2 below shows the results for those decks, which have positive (i.e. winning) coefficients that are statistically significant at the 95% confidence level. Statistical significance is important because it tells us we probably have enough information to make a reliable statement; reliable is defined as being correct 95 times out of 100. Occasionally, the structure of statistical data makes uncovering the relationships tricky. Hence, I used three different estimation techniques, and two different methodologies. As you can see, the maximum likelihood (ML) and ordinary least squares (OLS) techniques yield the same parameter estimates (as they should). That is to say they agree on how many games would be won per match by playing the deck.


Table 2: Wins per Match for Decks with Statistical Significance

Model Technique: ML, OLS, REML
==============================
4-Color Loam: 3.68, 3.68, 2.79
Buried Ooze: -, -, 1.97
RWB Blade: 2.36, 2.36, -
Manaless Dredge: 2.42, 2.42, 1.54
NO Bant: 1.83, 1.83, -
Lands: 2.23, 2.23, -
Cephalid Breakfast: 2.66, 2.66 -
Blazing Shoal: 1.87, 1.87, -
--------------------------------
R2: -, 0.43, -
AIC: 15826, 15826, 15536


The table above only presents those decks, for which we have enough information to make a prediction about how many games the deck will win per match on average. You'll notice several of the values are above 2, which is impossible because 2 is the maximum. This is a statistical artifact, and worth looking at a bit more closely.

So let's take the case of 4-Color Loam played at a Legacy tournament.


Micah Greenbaum, the deck's only observed player, beat opponents who were on average better than him and avoided decks that are known to trump his (primarily combo decks). These factors most likely contribute to the model's over-prediction. The OLS and ML models tell us that by playing the deck, we should win 3.68 games per match, which is obviously impossible because matches have at most two wins. Over- and under- prediction come from the fact that statistical techniques try to find a solution to all observations. You might have noticed that the REML model predicts only 2.79 games per match, and hence it might be better at dealing with this kind of data. My suspicion is that, with more data for some of the rarer decks, this type of over-prediction will disappear. Mr. Greenbaum's specific case also suggests tournament outcomes are highly contingent on round pairings.

The third column of Table 2 shows the result of a mixed model, which allows player and opponent skill-levels to vary, and then estimates the effects of the decks. This model is slightly better as evidenced by its lower Akaike's information criterion (AIC), a statistical measure for how accurate a model is. More importantly from a player's standpoint, it comes up with another high performing and statistically significant deck (Buried Ooze). But the REML model here is akin to demeaning the data, which perhaps is good at purging correlations between deck and player, but might be a tad too aggressive for our purposes here because we are tossing out some of the information from other games to get a better prediction of the games between any two players. But it does give us an indication that the other estimations might be too high.

One way statisticians see whether they are on the right track is to take a look at the cases where the model yields bad predictions—these are the residuals, or unexplained cases. If there is some underlying pattern in the unexplained game outcomes, it's a clue that some part of the explanation might be missing. Fortunately, the residuals for the OLS model are as close to textbook as they generally get in empirical work, which is a decent indication that there are not any major missing explanatory variables in this simple deck/player model.

Based on this evidence, I picked Micah Greenbaum's Four-Color Loam deck with which he took first place. The deck racks up card advantage with Life from the Loam and Dark Confidant whilst pressuring the opponent's life total using Tarmogoyf and Knight of the Reliquary.

Card advantage and deck disclosure device
However before plonking down for a playset of Mox Diamonds, I did a robustness check on my deck selection by imputing player "abilities" using DCI Eternal ratings, which brings us to the next section.

5 Application: My Predicted Performance
Statistical models have a handy feature in that one can insert player and opponent parameters along with deck archetypes to obtain a predicted match outcome. However, player performance parameter is measured by the DCI using a variant of the Elo system as in chess. Thus in order to translate between the two estimates of player ability and my model, I needed another statistical model.

So, I queried the DCI rating database for the top players, whose ratings should be stable. Then I regressed the parameter estimates of the players on the natural logarithm of those sampled DCI ratings, which revealed that a 16% increase in player ability leads to an extra game won.

What is nice to know is that Mr. Greenbaum actually faced a tough field as he took first place in Baltimore. Using my model, Greenbaum's opponents on average were predicted to win about 0.6 games. I asked whether I had a shot at qualifying for the Grand Prix in Amsterdam using Mr. Greenbaum's concoction despite my mediocre playing skills. Using the parameters estimates for my ability, I substituted the relevant values into the model's equation, and came up with a predicted 2.05 games per match. It would have seemed that I should have had at least a better than average statistical chance...

5.1 Tournament Performance
My model predicted that I would win 2.05 games on average—in fact I averaged about a dismal 0.6 losses per match. Over the course of three recent Legacy events, it became infinitely obvious that:
  1. I didn't know how to play this deck in particular (The Maze of Ith it contains was apparently intended for my own Knight of the Reliquary).
  2. I don't play well in general (e.g. tapping wrong lands in certain plays, poor combat decisions, etc.).

The anecdote of my poor performance reinforces the interpretation of Table 1 above, which indicates a deck only explains about 3% of the outcome. In my experience, it was my playing that seemed to explain many of the match outcomes rather than the deck per se.

6 Future Work
While Magic is inherently and partially random by construction, my intuition as a player and econometrician lead me to believe that inclusion of additional explanatory variables would improve the performance prediction. My results also qualify the Hatfields' assumption that archetypes matter and reject the Chapin-Finkel lemma that one's own play ability is the most important determinant of match outcomes.

The fact that my performance did not match my predicted performance along with the fact the model does not even explain all the variance indicate that further investigation is in order. Based on my tournament experiences and some general knowledge of the game, I believe the following extensions would yield more reliable results:
  • Deck difficulty: Certain decks are easier to play than others the difficulty of playing a particular deck. One way of accounting for this effect might be the use of an interaction term between player ability and the deck parameter, or some measure of difficulty for a deck could help.
  • Opponent's deck: The model presented above is an unconditional analysis, i.e. it assumes one deck is better on average than all others. In Greenbaum's post victory interview, he mentioned that he did not face a single combo deck amongst his opponents, whereas I faced three in my first tournament. Parsing individual match-ups might tell us how much of a difference round pairings make in final tournament standings.
  • The sideboard: Since players exchange up to 15 cards after the first game, and those cards tend to affect the game state in a larger way (based on the opposing deck), it may be worth trying to classify the sideboard just as much as the deck—possibly even more than the deck. In the tournament rounds I played, some of my sideboard cards (Choke, Pulverize, Leyline of the Void) could essentially win the game against certain strategies.
  • Who plays first: My guess this has a tiny, but possibly significant, effect.

The PDF version of the article with footnotes, references, equations, complete tables, and basic tournament report is available here. The replicable datasets are available here (requires R and Zelig).

Comments

Posts Quoted:
Reply
Clear All Quotes