Flogging the Data: Sizing up the Legacy Metagame

1 Introduction
You've all seen them. You may scour them. You may even write them. But should you really bother with them? Ah, the metagame report, a staple of the competitive Magic community, but are they really worth reading, let alone compiling?
This article is essentially a Legacy metagame report broadly along the lines of typical tournament metagame report. While I cannot promise that this article is light reading, I have endeavored to make my inferential statistics both rigorous, comprehensible whilst sparing you most of the arcana (which is available in the full PDF version here).

A typical metagame report contains essentially three elements: 1. a break down of the most prevalent decks; 2. performance estimates (usually win %s); 3. deck performances in specific match-ups. The implicit assumption is that this will help one prepare for the next tournament by knowing which decks are going to predominate, and to know how to tune one's deck to deal with its worst rivals.
In contrast to my last article, where I looked at decks that were out-performers, this article looks at the usual suspects in Legacy. In doing so it also ties up some of the empirical loose ends of that last piece whilst laying the empirical groundwork for an attack on solving for a Legacy metagame equilibrium. It also develops a more robust methodology for analyzing deck match-ups than is seen in most metagame reports. Lastly, it poses a challenge to the community to come up with a better way of classifying decks.

2 Research Design & Data
The basic design here relies on regression analysis of Legacy tournament data. Recall that regression analysis establishes a relationship between a variable to be explained (match outcomes) and one or more explanatory variables (decks, players, etc.). More specifically, regression analysis can help us understand how the match outcome changes when any one of the explanatory variables, such as a deck or player, changes while the other explanatory variables remain constant. It lets us partial out player and opponent effects, which can be thought of as a way of incorporating player-specific attributes, such as skill, fatigue, or the player's familiarity with the deck and format. This is essential for a clean analysis of the decks themselves, but it's a consideration that is woefully lacking in most post-tournament analysis. The research design employed here is also stronger than my last design in several ways: 1. it incorporates more data; 2. it models the results in a more appropriate way; 3. it tests the sample properties.

2.1 Data
The data analyzed still come from Jesse and Alix Hatfield's article series "Too Much Information", but there is a massive improvement in the data used in this paper. First, there are about 26,000 observations rather than the 4,100 or so used in the last paper, which represents is a huge sample. Some of those tournaments have been held in the same location—this draws in the same players who have played in a previous tournament; this means we have better information, which leads to better identification of the deck attributes. Third, the dataset is now balanced, meaning we have equal numbers of opponents and players. The deeper and balanced sample have essentially ruled out any (postulated) bias that may have been induced by the sample truncation in the last paper.
The data were cleaned a bit as usual so as to make their data econometrically comparable. For example, players occasionally appear with or without their middle initial, or had been misspelled at registration; the basic assumption made was that they were repeat players, who had been misrecorded. In sum, the data boast about 218 decks, 2741 players over the course of 15 different tournaments. The most common decks, collectively representing about 61% of the collective tournament field, are presented in Figure 1 below.


Figure 1: Legacy commentators' common observation about metagame diversity is fair.

2.2 Identification
Because our goal is to identify the effect that a deck has on match outcome, we have to some how figure a way to disentangle the player effects from the deck effects. In the approach used here, it has been assumed that players have a constant ability and skill level throughout the tournament history (approximately April through September 2011). Thus, other things being equal, such as a player's opponent or her deck, whenever we observe the same player playing different decks, and the outcomes differ, we can attribute this difference to the deck. Conversely, when we observe two different players using the same deck, we can infer the players' "abilities".

This strategy obviously has certain weaknesses. The first is that tournaments with the same participants and who play different decks are relatively rare. This requires a massive sample in order to get reasonable estimates. Weak identification was already suspected of driving part of results in my last paper, where I found 4-Color Loam to be a statistical out-performer, but had only one player, Micah Greenbaum, who had played in two tournaments with two different decks.

The second weakness is presuming that a participant's ability stays constant; all players have ups and downs over time. The term "ability" is a bit of a misnomer. What is actually captured are all player attributes at the time of the tournament. So it assumes a player shows up with the same blood-alcohol level and amount of sleep to every tournament. These facts do not necessarily preclude the analysis, but they do introduce more statistical noise. This is where the greatly expanded sample has really helped in that it gives us more confidence about the estimated parameters. Through better identification in an expanded sample, 4-Color Loam's outperformance, for example, has since disappeared.


Figure 2: Merfolk is fairly strongly identified whereas Elves is probably at the lower limit.

2.3 Modeling Magic Matches
One of the challenges with empirically evaluating Magic matches is finding the most appropriate model. The last Flogging the Data employed some of the simpler models, which are often a good first approximation, but have their its weaknesses. To allay some of the concerns expressed, a few alternative models are presented here, which turn out to have their own foibles. The key difference here between the analysis presented here is that the player, opponent, and opposing deck are controlled for, which means we can lay bare a deck's actual performance, and the error of such analysis is explicitly accounted for.

2.3.1 Linear Models
The simpler linear techniques assume that the dependent variable is continuous over a fair range of values, but this is not the case here because our match outcome variable here only takes on 5 values: -2, -1, 0, 1, 2. 0 points is by construction the mean and median for the data.

The first reason for converting games into match points is that there is some information in the fact that a deck won the match two games to nil vs. two games and one loss. But because matches are best two of three we are still missing the information about those decks that would have had gone win-win-loss or win-win-win had the third match actually been played. This in one of the flaws in the current tournament system as far as data-miners are concerned. Just because win-win-loss is never observed in the current system, does not mean that it is not equivalent to loss-win-win and win-loss-win. Not only does it truncate the match in an arbitrary manner yielding less efficient round pairings, it precludes particular sideboard and deck strategies that would rely on having two post-boarded matches for certain.

The baseline technique, ordinary least squares (OLS) incorporates effects from the player, opponent, the player's deck, and her opponent's deck. The dependent variable are the match points from above.

Another approach, generalized least squares (GLS), to dealing with this type of problem of limited values is to compute model in a different fashion using weights. The observations were weighted by the inverse of the square root of the variance for residuals for an observed outcome conditional on which point category it fell into (-2, -1, 0, 1, 2). That is just a statistician's way of saying that matches with less randomness are more accurate, and that much of that randomness comes from how many match points were observed. The model assumes the variance of the error term in Equation 2 is constant for all observed values, by limiting the variable in this fashion to 5 values, the variance at the extrema is slightly different than near the mid-point. The technique accounts for this fact.
Another linear approach, a mixed effects model, incorporates the same variables, and was used in the last paper. It is very similar to the classic linear model, but models the players effects as being drawn from a random variable, and incorporates this when calculating the deck effects using a different estimation technique, restricted maximum likelihood (REML).

2.3.2 Non-Linear Models
One of the correct critiques of the last Flogging the Data was inappropriate statistical modeling was possibly driving the results. To address these concerns, the more traditional binary response models have been tested here. Binary response models, such as logit and probit, do a better job at modeling effects that are probabilities or where the outcome is a discrete value—as with the case at hand. They have the disadvantage that they are computationally more difficult, especially when thousands of variables are incorporated into the calculation. Indeed, here it required a bit of fiddling with the underlying numerical solver, and the final computations for these models took between 12-20 hours.
Given that the variables for the tournament round and event location were insignificant in the last investigation, and proved again to be so in preliminary testing, larger sample notwithstanding, they were left out of the analysis here.

3 Results
Enough of formalities, let's get into the results, but before getting to the estimates for the actual decks, we need to first take a look at the meaning of the estimates.

3.1 Models Compared
All models should give us the same results: we should get approximately the same estimates for the linear models (OLS, REML, GLS) and get the same estimates for the non-linear models (LOG, PROB). The approach taken here was to order, from worst to best, all five sets of estimates for the 219 decks. The estimates along this ordered list should be roughly the same. The hitch is that the linear models cannot be directly compared to the non-linear models, so we also had to find common way of comparing the coefficients. (See the full paper for the details of this conversion.) The OLS, REML, and GLS models all estimate in terms of games won whereas the LOG and PROB estimate in terms of probabilities.


The first thing to notice is the fact that decks even in the middle of the pack, i.e. second quartile, put up less than a 50% win-percentage. The explanation is fairly simple: there are an unlimited number of bad decks that can be built/played, but there are a lot fewer good decks. It implies that you cannot just put sixty cards and a pile and have a decent chance of winning. This is also evidenced by the fact that the logit model, for example, found that 15% of the decks are statistical under-performers, but found only one statistical out-performer.

The second thing to notice is that OLS over and under-predicts compared to the other models; GLS does a bit better in that respect (as expected). The third point is that all the models except for REML predict a large number of game losses. For OLS and GLS this is due model misspecification; for the LOG and PROB models, it is a computational artifact from the underlying solver, which gives highly unreliable estimates for some of the lowest estimates. The final point worth noting is that four out of the five estimates agree in the third quartile, which is a good thing. This agreement in the higher range of values might point to the fact that estimates for the better performing decks are better because they are observed in more rounds given the additional single elimination rounds for the top 8 or 16 players of the Swiss rounds, or players dropping from the tournament. The magnitude of this intra-tournament selection effect remains an open issue for the time being.

In addition to yielding win-percentages similar to the logit and probit models and actually being computationally feasible, the REML model has the convenient feature that its estimated parameters map to the data's actual domain (-2 through 2), allowing the estimates to realistically be couched in terms of wins and losses. Figure 3 below is a visual example of how this conversion between game wins and win-percentage is done.


Figure 3: The implied standard 95% confidence interval is in red: note the huge error.
One possible interpretation of Figure 3 above is that the highest estimates represent limits of deck efficiency—decks beyond this efficiency frontier are non-feasible. The estimates however should be taken with a large grain of salt given the error involved.

Now that we have seen how the models compare with one another, we now turn to how the decks stack up against one another.

3.2 Decks Compared
Table 2 below shows the estimated number of wins in a best two of three game match for the fifteen most common decks using various statistical models.

The first noteworthy aspect of the coefficients in the following table is that none of the decks have statistical significance using any statistical model — the error ranges anywhere from ±41-91%! That is to say no deck can be statistically identified as superior on average; it also underscores why deck archetype explains so little of match outcome as seen in the last article. The simple fact may be that deck archetype, as currently conceived, may be irrelevant.


Since the estimates have such large margins of error, it would be wrong to read much into them. But in general OLS seems to underestimate and GLS seems to overestimate. LOG and PROB essentially agree as they usually do, and the REML estimates to be somewhere between the GLS and non-linear models.

Not shown in Table 2 are the bootstrapped coefficients and errors. These are coefficients obtained by resampling the data with replacement over-and-over to confirm both the estimates and establish the error intervals. The current tournament sample now seems to be nigh large enough in that resampling had little effect on the measured effects, but the errors tended to be even larger than in the table above indicating that the sample is not quite yet asymptotic (i.e. large enough to not effect the results).

While decks do have a detectable but minor effect, the evidence from Table 2 raises major doubts as to whether archetype is a salient empirical category, and whether decks can said to be unconditionally better than one another.

Before moving on to match-ups, let us take stock of where we stand empirically, and possible alternative explanations.

  1. Model misspecification: The basic players and decks model components still seems to make sense. Since we do not have many other variables to work with there is much more we can do about this anyway.
  2. Statistical technique: We knew OLS was a fiction, and the more appropriate probit and logit models underscore this. However, they too have their own computational problems. The smaller error and plausible estimates would seem to make the mixed effects model (REML) the best choice moving forward.
  3. Insufficient data: More observations have extinguished some of the previous article's outliers. The bootstrapping here seems to confirm the sample is a decent enough size, but an econometrician will almost never turn down more data should it be made available (note this does not include the Hatfields' most recent post banned list update).
  4. The notion of "archetype": There are two possibilities here: either the notion is a fiction or miscoded. 'Tis probably the latter. Unfortunately we cannot rely on the fact that there are lots of statistically significant bad decks to trenche the question because these archetypes suffer from the outlier problem, and players are not likely to show up to a tournament with the miserably performing "Black Rats" or "Walls" decks ever again.

Because none of the most common decks have statistical significance within the basic modeling framework, we have to start thinking about alternatives, and that last proposition seems especially worth exploring considering all the commentary concerning the Legacy metagame. One alternative is to start thinking less in terms of absolute efficiency, and more in terms of performance conditional on the metagame.

4 Conditional Performance
The notion of the metagame implies that a player's deck choice is predicated on the choices of the other players. In order to make an optimum choice the player needs to know both what the other players will Nash-optimally play, and what her best response is. We shall leave the Nash conditions aside to focus on the responses within the player's strategy set—the responses being a particular deck in a given metagame. In order to choose a response, a player needs to know how a deck choice compares with the deck options of the other players.

This means we need to come to grips with how decks interact with one another, and the first step in doing so is to look at the performance of the deck within the metagame. But in stark contrast with articles that present match-ups without any sense or error or accounting for players' abilities, the approach presented here does.

4.1 Metagame Conditional Performance
We obviously have the number of games one in a match, but we need to find a measure for the player effects. This is where the model comparisons of the previous section prove useful. The REML coefficients appear to yield the most reasonable results throughout the entire range of values, and hence will form our baseline estimates. Figure 4 below shows how points translate into win percentages after taking player effects into account.


Figure 4: Within any match outcome, there is a range of possible win percentages after taking player effects into accounts.
Now that the we've presented a method from translating game wins without player effects into the familiar and ubiquitous win percentage, we look at how this might influence the win percentages for the 15 most common decks. Table 3 below does just that by showing both the raw win percentages, and the win percentages calculated after netting out player effects, along with the "absolute" win percentages as estimated via the mixed effects model (REML).


The first column is the raw win result typically found in a metagame report. The predicted mean and median of columns three and four scrub out the player effects, but leave in the "metagame" effects. The last column, REML, are the estimates from Table 2, and show the win percentage for the deck facing an unknown metagame.
With such a large sample, netting out player effects does not have much effect on the predicted results; the predicted win percentage is even identical in some cases. However, for smaller samples, such as a single individual tournament, netting out player effects can alter raw win percentages by ±1-5%, which is a fair amount given winning in Magic is often determined by smallish margins.

The next interesting aspect of the Table 3 is that we can see which decks' performances are highly metagame dependent. Take UWLandstill for example; facing a random deck, the deck is a vast under-performer with only a 20% win rate—put that same deck in the current metagame and its win percentage suddenly jumps above par. In contrast, one could interpret Bant's win percentage as extremely stable across metagames.

The bigger question regarding Table 3 is whether average win percentages are a good measure of central tendency, or whether median win is a better measure. The typical Merfolk match-up is pretty favorable with a 70% median win, but Merfolk is also a highly anticipated deck, and there are no doubt some blowouts when opponents use their sideboards with devastating effect. These extreme losses tend lower the mean performance of the deck without affecting the median much. Dredge and BWStoneblade have means and median performances that are largely stable. While this is not the place to unpack the strategic aspects of this observation, it's just worth reminding players and commentators that the average is but one measure of central tendency.
Having examine at the macroënvironment of the metagame, we now look at the outcomes of particular deck interactions.

4.2 Match-Up Conditional Performance
The basic idea is to look number of games won again, partial out the player effects, and resample the data with replacement to obtain some idea of the error on those match-ups. We then obtain a win percentage that is contingent on the particular combination of decks. Table 4.2 shows the results.




In the table above, the win percentage for a given archetype can be found by reading across into the right upper diagonal; should the value be missing, then unity minus the opposing deck's win percentage yields the win percentage. The margin of error can be found in a similar manner. For example, Merfolk wins 35% of the time against Zoo; conversely, Zoo's (missing) win percentage is 65% against Merfolk. The margin of error for that match-up is ±5%, which means there is a 95% likelihood that Merfolk wins between 30% to 40% of its matches against Zoo. Another interpretation of the margin of error is that certain match-ups exhibit a larger degree of inherent variance, but distinguishing between measurement (i.e. classification) error and inherent variance remains an open issue for now. For players, it may be worth spending more time practicing high variance match-ups that they can effect rather than simply practicing against a standard gauntlet of the most common decks.

Notice that the margin of error tends to get bigger going from most common to least common. This is because, despite the massive sample, some match-ups have rarely occurred.

5 Conclusion
To recap, we've first seen how Legacy tournament data might be best modeled, and using a suitable model furnished a metagame report that incorporates player effects to yield more accurate estimates of deck performance. Those improvements notwithstanding, we could not find evidence to reject the hypothesis that deck archetype has no effect on match outcomes in Legacy. We thus explored the notion of performance conditional on the prevailing metagame, and examined whether typical tournament reports are robust in this respect.

That inquiry raised several issues to address in further research: 1. the current concept of "archetype" does not predict performance well, a new scheme ought be developed; 2. metagame breakdown reports employing win averages instead of median wins or not accounting for player effects could give false impressions of a decks underlying performance; 3. the information on specific match-ups between decks is fairly sparse, and measurement error is thus respectively large.

All these things conspire to make the current metagame report in its common form of limited value for inferring archetype performance.

5.1 My Challenge to You
I challenge the community to come up with a better way of systematically classifying decks—an archetype label is not enough. While I do play a bit of Legacy, I don't feel that I am the best person to classify the decks—deck builders and regular players probably have the advantage here. If you send me your coded classification schemes I'll test and feature them in an upcoming Flogging the Data.

The goal is to explain more of the game outcomes based on the deck attributes. Keep in mind that fewer variables is often better, and that numbers or percentages allow more of the information to be exploited. Some examples of better schemes might be:
  1. Color composition of each deck
  2. % of types present in each deck
  3. Colors included in each deck
  4. The key cards of the deck
  5. Play style of the deck: aggro, control, combo
or any combination of these.

Keep in mind that you have almost unlimited information in suggestions #1 & #2, less in #3 & #4, and the possibly the least in case #5. I'll address more specific questions in the forum.

The working dataset for deck coding is available here.

Comments

Posts Quoted:
Reply
Clear All Quotes