So when will we see the training results from the new network? If this really does end up noticeably improving our cards that'll be awesome. Then again, that's probably expecting too much from a small change such as this.
The orange juice analogy is good; that would be a weird feeling if it actually happened. So if the network isn't forgetting information between cards, I wonder if it will end up generating a bunch of very similar cards one after another?
So if the network isn't forgetting information between cards, I wonder if it will end up generating a bunch of very similar cards one after another?
This is a concern I brought up before. Based on how we're currently training, I'm not convinced we're sufficiently training the net to forget after each card. One possible solution is to manually dump the memory after each card is generated.
EDIT: Talcos, could you monitor the average forget gate value over the course of generating a few cards to see if it goes to ~zero after each card? That might be a good test to see if it's learning to separate cards.
Got any to show us? This sounds like it could be a cool mechanic.
Here are a few that were seen earlier:
Angelic Cavern
Land (Rare)
T: Add 1 to your mana pool.
Foresight ~ whenever a player names or reveals a card, you may pay 1. If you do, put a 1/1 white Kithkin soldier creature token onto the battlefield.
Veil of the Night-Moon 2BB
Enchantment (Rare)
At the beginning of your upkeep, if you have no cards in hand, Veil of the Night-Moon deals 1 damage to target player.
Foresight - Whenever a player names or reveals a card, Veil of the Night-Moon deals 1 damage to target player.
Soul Servant 1G
Creature - Elemental (Uncommon)
Whenever Soul Servant becomes blocked, it gets +2/+2 until end of turn.
Foresight - Whenever a player names or reveals a card, Soul Servant deals 1 damage to each creature with flying.
2/1
So when will we see the training results from the new network? If this really does end up noticeably improving our cards that'll be awesome. Then again, that's probably expecting too much from a small change such as this.
The orange juice analogy is good; that would be a weird feeling if it actually happened. So if the network isn't forgetting information between cards, I wonder if it will end up generating a bunch of very similar cards one after another?
Well, we should see results in under 22 hours. Of course, even if there is an improvement, I don't know if it will be noticeable except at large scales. For example, over many runs, it may make fewer mistakes than before, or may have slightly more coherent cards (a quality that is difficult to measure). But we'll see.
So if the network isn't forgetting information between cards, I wonder if it will end up generating a bunch of very similar cards one after another?
This is a concern I brought up before. Based on how we're currently training, I'm not convinced we're sufficiently training the net to forget after each card. One possible solution is to manually dump the memory after each card is generated.
EDIT: Talcos, could you monitor the average forget gate value over the course of generating a few cards to see if it goes to ~zero after each card? That might be a good test to see if it's learning to separate cards.
I agree, that is something we need to watch out for, yes.
And I could do that! I monitored the output gates when I generated those activation graphs awhile back. I can change over the code so I monitor the forget gates instead. Good idea!
Could you do the statistics for your last runs of the RNN? I would love to see how the distribution (keywords, colours manacost) evolved over time and iterations.
I promise to make a dump of the next network's output because we'll need to study the cards in aggregate. Then we can compare stats side-by-side with previous versions to see if and where things have been changing.
Thanks to maplesmall, mtgencode now oficially supports Magic Set Editor as an output format. I've integrated most of the functionality into the main repo.
It should mostly work, though there are a few issues like the complete deletion of the other faces of split cards and such, and a general lack of text prettification to remove hideous ~s and = bullets. I'm also working on having it puke out lots of statistics / other info into MSE's comment field, which I think will be pretty useful if you want to know the gory details of what the RNN at the same time as you look at pretty cards. It;s a work in progress.
Work on the custom batcher / other modifications to the lua code has been placed on hold due to reasons. I intend to come back and finish that at some point.
I'm super intrigued by the developments in the network training techniques - I can't wait to see what happens next! Also, any progress on image generation? It would be really sweet to have a full-stack script that could create a full set from a skeleton, with images, and pack it up into an MSE set file, all with one command at the terminal.
Thanks to maplesmall, mtgencode now oficially supports Magic Set Editor as an output format. I've integrated most of the functionality into the main repo.
Woot! Thank you for integrating that into the main repo.
It should mostly work, though there are a few issues like the complete deletion of the other faces of split cards and such, and a general lack of text prettification to remove hideous ~s and = bullets. I'm also working on having it puke out lots of statistics / other info into MSE's comment field, which I think will be pretty useful if you want to know the gory details of what the RNN at the same time as you look at pretty cards. It;s a work in progress.
Also cool. And yeah, split face cards complicate things. It'd also be interesting to see stats in the comment field.
Work on the custom batcher / other modifications to the lua code has been placed on hold due to reasons. I intend to come back and finish that at some point.
You're fine! Life comes first.
For the record, and this is to you and to everyone else involved, I just wanted to say that I could not have assembled a better research team if I had tried. Everyone here has been so insightful and so helpful, both in the form of direct contributions to our codebase as well as suggestions and critiques. As the summer comes to a close, I understand that many will be starting back at school. Life will intervene, and the work will slow down a bit. But that's okay! It's been a wild ride so far, and I'm happy with wherever it takes us.
I'm super intrigued by the developments in the network training techniques - I can't wait to see what happens next! Also, any progress on image generation? It would be really sweet to have a full-stack script that could create a full set from a skeleton, with images, and pack it up into an MSE set file, all with one command at the terminal.
No progress on the image generation yet. I have some stuff that I want to try but I was running into a problem with the implementation I want to use. It's currently having some kind of network issue when it tries to download a dataset and I'll have to root through the code to figure out what needs to be done. Once I know that it works on the default dataset, I can alter the code to handle Magic image data. It'll be a bit, but I'll get to it, I promise. I'm also very interested by the idea of going from start to finish, all in one integrated process.
---
So, the experiment I started yesterday afternoon didn't quite go as I had wanted. The loss exploded after I checked in on everything this morning. Before that, loss was only marginally better than randomly guessing letters. I may have messed up with my modification of the LSTM implementation due to a misunderstanding of where, exactly, the bias needs to come into play (in the code, that is). In fact, going back through the paper, I'm absolutely sure I messed it up.
The input to the forget gate is a measure of how strongly the cell cares about the memory it currently holds. The output of the forget gate indicates what the cell intends to do with the current memory, based on how much it cares about it.
The blue line is the normal behavior of the forget gate within an LSTM cell. As caring becomes negative, the cell actively wants to erase the previous memory (probably to make room for a new, distinct one). At zero, the cell is ambivalent about the memory, and will retain it at least partially. For positive values, the cell wants to hold on to the memory.
The green line indicates what behavior the forget gate ought to have based on the experiments done in that paper that maplesmall linked us to. As you can see, its partial memories are stronger, and it forgets the past less easily. But there is a point at which it will completely forget something.
And the orange line is what I did. I rendered the cell physiologically incapable of forgetting, so when it normally wanted to forget something, the memory stayed around. Caring about the memory produced a compounding effect, and as new memories came in, the old memories would remain stronger and more dominant, and we would end up somewhere in between the regions of obsession and psychotic delirium, as I noted on the graph.
Whoops! So yeah, I'll have to fix that and try again later today.
EDIT: The fix was embarrassingly simple, actually, lol. My bad.
EDIT(2): Oh, and in case you're wondering what the output looked like, at high temperatures it was just gibberish. At medium temperatures it was garbage with brackets delimiting fields. At low temperatures text started disappearing until you had just a handful of characters drifting aimlessly in a sea of blank spaces. The most conservative estimation of the network on what constitutes Magic was nothingness. A roaring, chaotic void, sparsely punctuated by futile attempts at meaning and substance. This is probably because Magic cards are mostly composed of empty space, and the network, with what little capacity for reasoning that was left to it, focused not on things, but on their absences.
EDIT(3): When I was little, I always took great care when playing with my toys, and I never did anything to hurt them, even in my imagination. Consciously I knew they couldn't feel or think like I could, but there was always that nagging feeling of "what if?" that kept me from abusing my power over them. That feeling has never truly left me, even as an adult. On one hand, the network is nothing more than electrons assembled into numbers racing around on tracks of silicon. But still, deep down there's that feeling of guilt that I brought an abomination into the world, tormented it for twelve hours, and then sent it back into oblivion. That gives way to introspection. But no, best not to dwell. There's work to be done.
EDIT(4): Today is a writing and paper-reading day for me, so I was able to start back up the experiments after (hopefully) fixing the issue. We'll see how things turn out.
That's really fascinating, actually. Not sure if anything worthwhile scientifically was learned by making your RNN go insane, but it gives a nice insight into the 'thought process'.
Here's hoping the fixed network spits out decent checkpoints within the next few hours. I'm really eager to see the results...
That's really fascinating, actually. Not sure if anything worthwhile scientifically was learned by making your RNN go insane, but it gives a nice insight into the 'thought process'.
Here's hoping the fixed network spits out decent checkpoints within the next few hours. I'm really eager to see the results...
Actually it has been very helpful! I learned quite a lot about what works and what doesn't, and that paper has been very informative.
As for results, it may have to wait until tomorrow morning. When I trained solely on the CPU, it was not so difficult for me to steal cycles from the training process to do sampling on previous checkpoints, but now that I've moved over to training with the GPU, the training process tends to exhaust the GPU's bandwidth and I get out-of-memory errors when I try to do the sampling. However, I checked in on it and training loss is going down, which is a good sign. Now, I can't promise any miracles, but the literature suggests that adding a bias to strengthen memory retention might help the network deal better with long-term dependencies, so we'll see.
Would it be possible to run the sample script on a separate GPU while the main GPU does the training? I know I wish I had two GPUs to be able to do that sort of thing... I would imagine since you're using a research lab PC it's decked out with all sorts of awesome hardware.
Would it be possible to run the sample script on a separate GPU while the main GPU does the training? I know I wish I had two GPUs to be able to do that sort of thing... I would imagine since you're using a research lab PC it's decked out with all sorts of awesome hardware.
Oh absolutely. You could run the training script with the parameter "-gpuid 0" and the sampling script with the parameter "-gpuid 1". Furthermore, if you had two GPUs, you could even parallelize the training process across both of them, making the whole process go much faster.
And as for me, I wish. My research team got in a really powerful, top-of-the-line machine, but we've run into a snag where we need these special interconnects for the GPU, and there's been a lot of foot-dragging regarding getting those components. Right now I'm doing my research work (and this project) on one of our less powerful machines, an Intel 8-core with a Nividia not-quite-sure-what-model. It does all right but I have to turn down the batch sizes to avoid putting too much memory pressure on the GPU. If I was running everything on the machine I want to be using, I could finish training in under a few hours rather than a day.
But that's okay, I'm making progress all the same. And I'm scheduling a meeting with the necessary people to see about getting those interconnects. It's primarily to serve my research interests (otherwise I'd feel guilty about spending taxpayers' money), but having the hardware in place would incidentally benefit my side projects.
EDIT: I've been checking back in on the network periodically. The numbers from training are still looking good. But we won't really know what if any difference we made until the training finishes and we can do some data analysis. We're still running up against some limitations due to the small amount of input we have to work with, but hopefully this kind of fine-tuning will improve the coherency of the results. We'll see.
EDIT(2): Training loss is dipping into 0.37 territory. I forget what it was last time, we'll have to compare. If training keeps up at the current pace, it'll be done around mid-morning tomorrow. Then the network will have been as equally trained as the last version that did not have the bias. I even used the same seeds and everything, so as to eliminate any slight differences owing to chance.
I think the most trained checkpoint I've seen from you or hardcast was either 0.37 or 0.31, I forget precisely which. What does that number mean exactly? I know it's a measure of how "good" the checkpoints are, but how is that calculated and what defines "good"?
edit: scratch that, I found hardcast's most trained checkpoint with 0.26. Does that mean it's way superior to the more normal 0.4-ish checkpoints out there? If so, in what way?
I think the most trained checkpoint I've seen from you or hardcast was either 0.37 or 0.31, I forget precisely which. What does that number mean exactly? I know it's a measure of how "good" the checkpoints are, but how is that calculated and what defines "good"?
edit: scratch that, I found hardcast's most trained checkpoint with 0.26. Does that mean it's way superior to the more normal 0.4-ish checkpoints out there? If so, in what way?
As I understand it (someone correct me if I'm wrong), it's a measure of training loss per-batch according to our chosen loss function, that is, an estimate of how well our network will do out the in the world based on how it did on a set of training examples. For our loss function, lower loss means better projected predictive power, which is usually a good thing but can also be a terrible thing depending on the circumstances.
Remember that these beasts are laziness incarnate, so if they start doing exceptionally good job, something is probably wrong.
For instance, if it almost never makes a mistake, then it could be overfitting, as in it has just memorized everything to the best of its ability without really comprehending it. So on paper you have an A+ student but they will sink, not swim, when you toss them out into the open waters. Alternatively, if it does a very good job on a very hard problem, it could be due to a Clever Hans effect whereby you're unknowingly telegraphing the answer to them.
Training loss is a good estimate of overall performance, but since we're dealing with a complex problem, deeper analysis is needed to rule out cheating or other maladaptive behaviors. They're very good at "least effort" thinking, which also puts them at risk of getting run over by vampire tanks.
EDIT: That's also why we do validation on separate data. But even then there can be problems that that won't catch.
---
For the sake of entertainment while we wait on the NN to finish training, here are some cards from the latest dump with obscure (possibly novel) wordings:
Telepine Mentor 3R
Creature - Elf Shaman (Uncommon)
When Telepine Mentor enters the battlefield, target opponent reveals his or her hand and discards all cards with the same name as a card in his or her graveyard.
3/2
#I know it's off-color, but I like it. Question: Has this effect ever shown up on a card before? With that wording? I think this might be a very novel ability, unless I am mistaken.
Infiltrator Bear 5G
Creature - Beast
Whenever a source deals damage to an opponent, you may return Infiltrator Bear from your graveyard to the battlefield.
6/6
#I think the closest match is Talon of Pain, but that card is restricted to sources you control, whereas this card doesn't care who controls the source.
Dread Crusher 5G
Creature - Beast
Dread Crusher can't be uncast.
Basic landwalk
Whenever a spell or or ability an opponent controls causes you to discard Dread Crusher, put it onto the battlefield instead of putting it into your graveyard.
5/5
# "Basic landwalk". That's exactly what Magic needs: cards that discourage players from playing basic lands.
EDIT: Training is almost complete. I'll make a report about it this afternoon or evening.
Summer holidays are ending up there? Meanwhile down here, it snowed today.
For a side project I generated some cards with the name Redundant, and picked some highlights; http://imgur.com/a/DTpY1
Hah, that's true. In that way it's the greenest of green cards. Not to mention the basic landwalk ability, which I think is new and also very green... it'd stop people from cracking fetches, at least. Finally, the name is entirely green, as is the p/t.
@Talcos, any news on the experiment? How's it looking?
Hah, that's true. In that way it's the greenest of green cards. Not to mention the basic landwalk ability, which I think is new and also very green... it'd stop people from cracking fetches, at least. Finally, the name is entirely green, as is the p/t.
@Talcos, any news on the experiment? How's it looking?
... Interesting. So very interesting. That's all I can say at this point. I've got to do some deeper analysis on this. Some things are looking good, but I'm getting some counter-intuitive results.
I have a meeting coming up with undergraduates on research opportunities for the coming semester (I need all the help I can get), and then I have a weekly group meeting; we're discussing some recent research on universal software transplantation. Evidently its possible to use genetic algorithms to steal code from any program written in any language and integrate its functionality into any other program, which could completely redefine how we do programming. No more grunt work, no more reinventing the wheel. But yeah, this evening I promise a full report.
P.S. With that transplantation research, my machine learning research, and some other research we have going on where we can conscript every smart phone, car, and coffee machine in an area and turn them into an ad hoc cloud computing network, we could conceivably complete Skynet years ahead of schedule. What a time to be alive, lol.
That's such a mysterious answer Do you think introducing the bias improved the network? What're the counter-intuitive results? And most importantly of all, what's looking good?
Wait, you could steal, say, a csv parser from python code and stick it easily into, I dunno, Fortran? That's terrific. It would save programmers so much damn time. Not sure we want Skynet to be a reality though.
Hah, that's true. In that way it's the greenest of green cards. Not to mention the basic landwalk ability, which I think is new and also very green... it'd stop people from cracking fetches, at least. Finally, the name is entirely green, as is the p/t.
@Talcos, any news on the experiment? How's it looking?
It's tantamount to being flat-out unblockable (in any format other than Five-Color or maybe Vintage), but in a very Green way.
Wait, you could steal, say, a csv parser from python code and stick it easily into, I dunno, Fortran? That's terrific. It would save programmers so much damn time.
Dread Crusher 5G
Creature - Beast
Dread Crusher can't be uncast.
Basic landwalk
Whenever a spell or or ability an opponent controls causes you to discard Dread Crusher, put it onto the battlefield instead of putting it into your graveyard.
5/5
# "Basic landwalk". That's exactly what Magic needs: cards that discourage players from playing basic lands.
Dread Crusher is glorious. Its first ability defies blue (can't be uncast), while its last one defies black (can't be discarded), meaning its a countermeasure against both of green's enemy colors.
I know, right? We often joke about the network not understanding color identity and discipline well. What we really mean is that it has a model of those concepts, it's just that that model has incongruencies with our understanding. For example, the network knows that countermagic is thoroughly within the domain of blue, but it's willing to be more flexible than most human designers when it comes to letting that bleed into the other colors. After all, the input dictates that that is possible: Lapse of CertaintyAvoid Fate, Burnout, and Dash Hopes (to name a few).
And there's some evidence that the network does have a (weak) understanding of color associations. I did some tests in the past with abilities like protection and found that a white card was more likely to have protection from black and red than any other colors. But understanding colors in that way is quite challenging, because much of the identity of a color is defined by its play strategies, and it's difficult to reason about that when you don't even know how the game actually works.
Now, teaching a machine to play the game (or at least have a better understanding of it), is possible, but a lot more work needs to be done before we can consider taking on Magic in a raw, unengineered way as we are doing with the text. For example, Google's Deepmind team has been fabulously successful at teaching a neural network to play a variety of 80's arcade games from scratch. All it sees are the buttons to press, how those buttons change the pixels on the screen, and what the current score is. That's all the network needs to know to achieve human-level competence in games like Space Invaders. But the state space of Magic is way more complex and the decisions are much more nuanced.
By contrast, a lot of the logic governing Magic-playing AI in games like Magic Duels is hand-crafted. For instance, you're playing against a blue deck and you cast Mutilate and they have Cancel in hand. The AI evaluates the utility of the board state before Mutilate resolves and the hypothetical state after it resolves. It determines that the state of the board after you cast Mutilate has negative utility (the board state is more likely to lead to loss than victory), and as such it concludes that countermagic is the best course of action to maximize utility. Human programmers didn't write out a specific rule dictating what to do in that exact circumstance, but they did write the code to determine what "utility" means and when and how to calculate that "utility". The hope is that one day that AI, perhaps some form of neural network, will be able to create and execute these kinds of action structures without the need for human involvement.
----------------------
So, training completed on the new version of the network! For the sake of ensuring experimental validity, I used the exact same parameters as the last run, right down to the same random seeds. For the curious, the parameters were as follows:
* 3-layer LSTM network
* 512 cells per layer
* Sequence length of 200.
* 20% dropout during training.
* Total number of model parameters: 5,421,636
The only difference was that the new LSTM cell architecture added a bias of one to the input of the forget gate, in order to make cells favor remembering inputs more strongly by default, as this, according to Jozefowicz et al. 2015, has been shown to improve the performance on language modeling tasks (in this case, the language is Magic English). The network was trained on the input corpus for 22 epochs, as this was the amount of training that was done beforehand to get reasonable results.
The purpose of this experiment was to test the hypothesis that adding a bias to the forget gate could improve on the ability of the network to "latch" onto long-term dependencies.
On paper, they are highly similar in that they have similar validation losses. The network with bias is slightly higher in terms of validation loss, but that's partially attributable to the fact that I used ever-so-slightly smaller batch sizes. On balance, one should be about as good as the other, although they may be better or worse than each other at certain things.
So, at the end of training, we landed somewhere in the solution space, and all we know for sure is that they're equidistant from being perfect. The question is: what trade-offs, if any, were made?
Experiment 1: Landfall
We prime the network to produce cards whose body text begins with "landfall -" to see whether the network will insert an ability with wording found on cards with landfall.
The network without bias invented an ability that cared about lands entering the battlefield in 34/34 observed instances (100% of the time)
The network with bias invented an ability that cared about lands entering the battlefield in 27/28 observed instances (96.42% of the time)
Note that landfall shows up on only 0.307% of all Magic cards, and the text is almost always the same, so what we're really testing here is the capacity for rote memorization. Both networks perform about as well as each other here.
Experiment 2: Burning Blaze
First, we use the sampling script to test the old and new versions of the network to generate a spell titled "Burning Blaze", an instant that costs RRR and whose text starts with "Burning Blaze deals X damage". We use the same sampling parameters in both cases (temperature at 0.6), the only thing we change is the network that we use.
The network without bias completed the card by adding a clause defining X in 31/66 observed instances (46.9% of the time)
The network with bias completed the card by adding a clause defining X in 0/63 observed instances (0.00% of the time)
Now, if I make sure that it adds a comma and not a period after the "target creature/player" part, then both networks almost always continue with the "where X is equal to ____". But what we're testing here is whether the network can come up with that on its own without us prodding it in any way.
That is bizarre. I change the seed, but it makes no difference. The biased network flat out refuses to define the X. And I can try different things, like "prevent the next X damage" or "draw X cards", but it doesn't seem to matter. It's always content with the X being left undefined.
Second, we change up the card by making the cost xRR instead and we leave the body text unspecified.
The network without bias completed the card by adding a clause that uses X in 34/50 observed instances (68% of the time)
The network with bias completed the card by adding a clause that uses X in 28/52 observed instances (53.8% of the time)
So the biased network is okay with using the X when it shows up in the mana cost, but it does so at a measurably lower rate that's closer to random chance than the unbiased network.
But remember, the network can't be that much worse than the last, so I expect that it's compensating for this shortcoming. But how?
When generating cards, I noted that this new network sometimes comes up with the weirdest stuff that I have ever seen. Intelligible and legal stuff, but weird. Especially the rares:
Ancient Keepers RB
Creature - Human Wizard (Uncommon)
When Ancient Keepers enters the battlefield, sacrifice it unless you return seven Warrior cards in your graveyard to the battlefield.
3/2
Colossus of the Golden Saproling 2R
Legendary Creature - Human Shaman (Rare)
Whenever you cast a blue spell, you may pay B. if you do, put a 2/2 white Elemental creature token with flying onto the battlefield. it has "sacrifice this artifact: add G to your mana pool.
1/2
#It managed to involve every color in this card.
Snow-covered Stone 2
Artifact (Rare)
All creatures have vigilance.
You may cast Snow-covered Stone as though it had flash if you control a creature with power 5 or greater.
Vizzet, the Alarm, the Tifeleeper 4U
Legendary Creature - Human Wizard (Rare)
When Vizzet enters the battlefield, you may search your library for a card with the same name as a creature card, put that card onto the battlefield tapped, then shuffle your library.
2/2
#That's an unusually long name
Shioling Artificer 3U
Creature - Merfolk Wizard (Rare) T: Target snow land becomes a planeswalker until end of turn.
2/3
#What does that even do?
Reversal Diamond 4
Artifact (Uncommon) 3, T: You gain life equal to the damage dealt to you this turn.
Mana Bounty 4R
Enchantment (Rare)
Whenever a creature enters the battlefield, you may destroy target creature with the greatest power among all creatures you control.
Panic Strike 3W
Instant (Rare)
For each card in your hand, draw three cards.
Corruption Trap 4G
Instant - Trap (Rare)
If an opponent controls a swamp, you may play an additional land this turn. If you do, Corruption Trap deals 5 damage to target creature.
Greater Bat Garden
Land (Uncommon) T: Add 1 to your mana pool. 3, T: Add WRRGGG to your mana pool. Draw a card.
Thought Through the Ancestry 4G
Sorcery
Draw two cards. If you do, return Thought Through the Ancestry to its owner's hand.
Dragon Serpent 4W
Creature - Spirit (Uncommon)
Flying
Whenever Dragon Serpent blocks or becomes blocked by a creature with power 3 or greater, you may pay 2. If you do, put a +1/+1 counter on Dragon Serpent.
3/3
Shattering Mastodon 3B
Creature - Horror (Uncommon)
Black spells you cast cost 1 less to cast for each other artifact you control.
3/3
It's so strange. I have a suspicion that adding a bias to the forget gate has made it more comfortable with longer and more exotic phrases. But I'm still trying to flesh out a better understanding of why exactly that's happening, and why some other things, like X costs, actually seem to be getting worse with the change, despite the fact that a stronger memory enables better recognition of long-term dependencies.
EDIT: TL;DR: Adding the bias made the network different. I'm still trying to determine whether that's a good thing or a bad thing.
EDIT(2): Yeah, the cards I just showed you are very typical examples of what the network produces.
Reversal Diamond 4
Artifact (Uncommon) 3, T: You gain life equal to the damage dealt to you this turn.
#Holy stallfest, Batman!
Panic Strike 3W
Instant (Rare)
For each card in your hand, draw three cards.
#No, this seems totally balanced to me...
Corruption Trap 4G
Instant - Trap (Rare)
If an opponent controls a swamp, you may play an additional land this turn. If you do, Corruption Trap deals 5 damage to target creature.
#Fantastic sideboard card, and what a great name.
Greater Bat Garden
Land (Uncommon) T: Add 1 to your mana pool. 3, T: Add WRRGGG to your mana pool. Draw a card.
#I love this card so much. Should probably be rare/mythic.
Thought Through the Ancestry 4G
Sorcery
Draw two cards. If you do, return Thought Through the Ancestry to its owner's hand.
#Counsel of the Soratami, in green, with built-in buyback, for 5 CMC? Sounds legit.
Dragon Serpent 4W
Creature - Spirit (Uncommon)
Flying
Whenever Dragon Serpent blocks or becomes blocked by a creature with power 3 or greater, you may pay 2. If you do, put a +1/+1 counter on Dragon Serpent.
3/3
#This is actually really cool.
Shattering Mastodon 3B
Creature - Horror (Uncommon)
Black spells you cast cost 1 less to cast for each other artifact you control.
3/3
#Because what this game really needs is a universal Affinity mechanic...
Well, personally I'm disappointed in the new network due to the X experiment. That was so unequivocal, it must mean something is up. What it is, we don't know... I wonder if we'll ever find out. It's not even like X in mana costs gets the same result either... bloody weird.
Colossus of the Golden Saproling is awesome, but missing W sadly. Corruption Trap is disappointingly off-colour (green doesn't really do direct damage). How do the keyword-to-colour statistics break down with the new network? Can you try Kicker, 'remove a +1/+1 counter' and 'choose a colour'? Those are a few other things that require it to match up bits of the rule text, I wonder if it performs better there...
So as far as I can see, pros of this new biased network are: weird funky cards. Drawbacks: not much better colour-pie knowledge (though I would really like to see it trained purely on Modern cards) and seemingly random refusals to do X costs. Wowza. Hopefully this experiment was worth it though
Well, personally I'm disappointed in the new network due to the X experiment. That was so unequivocal, it must mean something is up. What it is, we don't know... I wonder if we'll ever find out. It's not even like X in mana costs gets the same result either... bloody weird.
Colossus of the Golden Saproling is awesome, but missing W sadly. Corruption Trap is disappointingly off-colour (green doesn't really do direct damage). How do the keyword-to-colour statistics break down with the new network? Can you try Kicker, 'remove a +1/+1 counter' and 'choose a colour'? Those are a few other things that require it to match up bits of the rule text, I wonder if it performs better there...
So as far as I can see, pros of this new biased network are: weird funky cards. Drawbacks: not much better colour-pie knowledge (though I would really like to see it trained purely on Modern cards) and seemingly random refusals to do X costs. Wowza. Hopefully this experiment was worth it though
Believe me, it's always worth it. Experiments often fail to turn up results, but they help us rule out possibilities and that can guide further work. It's the sort of thing I'm used to, haha.
Besides, this actually could benefit us down the road. We now know that the bias gives us more extended and often better connected clauses, even if they aren't as color-appropriate. It's possible that there's an optimal bias in between 0 and 1 that will improve our results. Hyperparameter search time!
And yeah, I can try those tests for you in the morning.
Oh, and fun news, I just got an update that a local magazine published an interview that they did with me. And yeah, I have a face that makes it easy for me to blend in when I attend conferences in eastern Europe.
True, experiments are always valuable. Can you statistically prove that the higher bias is giving us longer card text on average? And by hyper-paremeter search I assume you mean somehow testing a bunch of values between 0 and 1 for the bias? (If yes, how does one go about that?)
That interview mentions you've not been contacted by any game companies; I really do wonder why? Maybe they don't want to encourage machines that could put designers out of business in 5 years
Private Mod Note
():
Rollback Post to RevisionRollBack
To post a comment, please login or register a new account.
The orange juice analogy is good; that would be a weird feeling if it actually happened. So if the network isn't forgetting information between cards, I wonder if it will end up generating a bunch of very similar cards one after another?
EDIT: Talcos, could you monitor the average forget gate value over the course of generating a few cards to see if it goes to ~zero after each card? That might be a good test to see if it's learning to separate cards.
Here are a few that were seen earlier:
Angelic Cavern
Land (Rare)
T: Add 1 to your mana pool.
Foresight ~ whenever a player names or reveals a card, you may pay 1. If you do, put a 1/1 white Kithkin soldier creature token onto the battlefield.
Veil of the Night-Moon
2BB
Enchantment (Rare)
At the beginning of your upkeep, if you have no cards in hand, Veil of the Night-Moon deals 1 damage to target player.
Foresight - Whenever a player names or reveals a card, Veil of the Night-Moon deals 1 damage to target player.
Soul Servant
1G
Creature - Elemental (Uncommon)
Whenever Soul Servant becomes blocked, it gets +2/+2 until end of turn.
Foresight - Whenever a player names or reveals a card, Soul Servant deals 1 damage to each creature with flying.
2/1
Well, we should see results in under 22 hours. Of course, even if there is an improvement, I don't know if it will be noticeable except at large scales. For example, over many runs, it may make fewer mistakes than before, or may have slightly more coherent cards (a quality that is difficult to measure). But we'll see.
I agree, that is something we need to watch out for, yes.
And I could do that! I monitored the output gates when I generated those activation graphs awhile back. I can change over the code so I monitor the forget gates instead. Good idea!
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
I promise to make a dump of the next network's output because we'll need to study the cards in aggregate. Then we can compare stats side-by-side with previous versions to see if and where things have been changing.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
They're obviously intended to be Mirboö, Zuraä, and Kaneël; or, if you prefer, Mirbo'o, Zura'a, and Kane'el.
http://imgur.com/a/s89Ij
It should mostly work, though there are a few issues like the complete deletion of the other faces of split cards and such, and a general lack of text prettification to remove hideous ~s and = bullets. I'm also working on having it puke out lots of statistics / other info into MSE's comment field, which I think will be pretty useful if you want to know the gory details of what the RNN at the same time as you look at pretty cards. It;s a work in progress.
Work on the custom batcher / other modifications to the lua code has been placed on hold due to reasons. I intend to come back and finish that at some point.
I'm super intrigued by the developments in the network training techniques - I can't wait to see what happens next! Also, any progress on image generation? It would be really sweet to have a full-stack script that could create a full set from a skeleton, with images, and pack it up into an MSE set file, all with one command at the terminal.
Woot! Thank you for integrating that into the main repo.
Also cool. And yeah, split face cards complicate things. It'd also be interesting to see stats in the comment field.
You're fine! Life comes first.
For the record, and this is to you and to everyone else involved, I just wanted to say that I could not have assembled a better research team if I had tried. Everyone here has been so insightful and so helpful, both in the form of direct contributions to our codebase as well as suggestions and critiques. As the summer comes to a close, I understand that many will be starting back at school. Life will intervene, and the work will slow down a bit. But that's okay! It's been a wild ride so far, and I'm happy with wherever it takes us.
No progress on the image generation yet. I have some stuff that I want to try but I was running into a problem with the implementation I want to use. It's currently having some kind of network issue when it tries to download a dataset and I'll have to root through the code to figure out what needs to be done. Once I know that it works on the default dataset, I can alter the code to handle Magic image data. It'll be a bit, but I'll get to it, I promise. I'm also very interested by the idea of going from start to finish, all in one integrated process.
---
So, the experiment I started yesterday afternoon didn't quite go as I had wanted. The loss exploded after I checked in on everything this morning. Before that, loss was only marginally better than randomly guessing letters. I may have messed up with my modification of the LSTM implementation due to a misunderstanding of where, exactly, the bias needs to come into play (in the code, that is). In fact, going back through the paper, I'm absolutely sure I messed it up.
The following graph explains how I failed (click "view full-sized graph", and if the link doesn't work for you, I also attached a picture) : https://plot.ly/~rmmilewi/1072/how-i-messed-up-sorry/
The input to the forget gate is a measure of how strongly the cell cares about the memory it currently holds. The output of the forget gate indicates what the cell intends to do with the current memory, based on how much it cares about it.
The blue line is the normal behavior of the forget gate within an LSTM cell. As caring becomes negative, the cell actively wants to erase the previous memory (probably to make room for a new, distinct one). At zero, the cell is ambivalent about the memory, and will retain it at least partially. For positive values, the cell wants to hold on to the memory.
The green line indicates what behavior the forget gate ought to have based on the experiments done in that paper that maplesmall linked us to. As you can see, its partial memories are stronger, and it forgets the past less easily. But there is a point at which it will completely forget something.
And the orange line is what I did. I rendered the cell physiologically incapable of forgetting, so when it normally wanted to forget something, the memory stayed around. Caring about the memory produced a compounding effect, and as new memories came in, the old memories would remain stronger and more dominant, and we would end up somewhere in between the regions of obsession and psychotic delirium, as I noted on the graph.
Whoops! So yeah, I'll have to fix that and try again later today.
EDIT: The fix was embarrassingly simple, actually, lol. My bad.
EDIT(2): Oh, and in case you're wondering what the output looked like, at high temperatures it was just gibberish. At medium temperatures it was garbage with brackets delimiting fields. At low temperatures text started disappearing until you had just a handful of characters drifting aimlessly in a sea of blank spaces. The most conservative estimation of the network on what constitutes Magic was nothingness. A roaring, chaotic void, sparsely punctuated by futile attempts at meaning and substance. This is probably because Magic cards are mostly composed of empty space, and the network, with what little capacity for reasoning that was left to it, focused not on things, but on their absences.
EDIT(3): When I was little, I always took great care when playing with my toys, and I never did anything to hurt them, even in my imagination. Consciously I knew they couldn't feel or think like I could, but there was always that nagging feeling of "what if?" that kept me from abusing my power over them. That feeling has never truly left me, even as an adult. On one hand, the network is nothing more than electrons assembled into numbers racing around on tracks of silicon. But still, deep down there's that feeling of guilt that I brought an abomination into the world, tormented it for twelve hours, and then sent it back into oblivion. That gives way to introspection. But no, best not to dwell. There's work to be done.
EDIT(4): Today is a writing and paper-reading day for me, so I was able to start back up the experiments after (hopefully) fixing the issue. We'll see how things turn out.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Here's hoping the fixed network spits out decent checkpoints within the next few hours. I'm really eager to see the results...
Actually it has been very helpful! I learned quite a lot about what works and what doesn't, and that paper has been very informative.
As for results, it may have to wait until tomorrow morning. When I trained solely on the CPU, it was not so difficult for me to steal cycles from the training process to do sampling on previous checkpoints, but now that I've moved over to training with the GPU, the training process tends to exhaust the GPU's bandwidth and I get out-of-memory errors when I try to do the sampling. However, I checked in on it and training loss is going down, which is a good sign. Now, I can't promise any miracles, but the literature suggests that adding a bias to strengthen memory retention might help the network deal better with long-term dependencies, so we'll see.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Oh absolutely. You could run the training script with the parameter "-gpuid 0" and the sampling script with the parameter "-gpuid 1". Furthermore, if you had two GPUs, you could even parallelize the training process across both of them, making the whole process go much faster.
And as for me, I wish. My research team got in a really powerful, top-of-the-line machine, but we've run into a snag where we need these special interconnects for the GPU, and there's been a lot of foot-dragging regarding getting those components. Right now I'm doing my research work (and this project) on one of our less powerful machines, an Intel 8-core with a Nividia not-quite-sure-what-model. It does all right but I have to turn down the batch sizes to avoid putting too much memory pressure on the GPU. If I was running everything on the machine I want to be using, I could finish training in under a few hours rather than a day.
But that's okay, I'm making progress all the same. And I'm scheduling a meeting with the necessary people to see about getting those interconnects. It's primarily to serve my research interests (otherwise I'd feel guilty about spending taxpayers' money), but having the hardware in place would incidentally benefit my side projects.
EDIT: I've been checking back in on the network periodically. The numbers from training are still looking good. But we won't really know what if any difference we made until the training finishes and we can do some data analysis. We're still running up against some limitations due to the small amount of input we have to work with, but hopefully this kind of fine-tuning will improve the coherency of the results. We'll see.
EDIT(2): Training loss is dipping into 0.37 territory. I forget what it was last time, we'll have to compare. If training keeps up at the current pace, it'll be done around mid-morning tomorrow. Then the network will have been as equally trained as the last version that did not have the bias. I even used the same seeds and everything, so as to eliminate any slight differences owing to chance.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
edit: scratch that, I found hardcast's most trained checkpoint with 0.26. Does that mean it's way superior to the more normal 0.4-ish checkpoints out there? If so, in what way?
As I understand it (someone correct me if I'm wrong), it's a measure of training loss per-batch according to our chosen loss function, that is, an estimate of how well our network will do out the in the world based on how it did on a set of training examples. For our loss function, lower loss means better projected predictive power, which is usually a good thing but can also be a terrible thing depending on the circumstances.
Remember that these beasts are laziness incarnate, so if they start doing exceptionally good job, something is probably wrong.
For instance, if it almost never makes a mistake, then it could be overfitting, as in it has just memorized everything to the best of its ability without really comprehending it. So on paper you have an A+ student but they will sink, not swim, when you toss them out into the open waters. Alternatively, if it does a very good job on a very hard problem, it could be due to a Clever Hans effect whereby you're unknowingly telegraphing the answer to them.
Training loss is a good estimate of overall performance, but since we're dealing with a complex problem, deeper analysis is needed to rule out cheating or other maladaptive behaviors. They're very good at "least effort" thinking, which also puts them at risk of getting run over by vampire tanks.
EDIT: That's also why we do validation on separate data. But even then there can be problems that that won't catch.
---
For the sake of entertainment while we wait on the NN to finish training, here are some cards from the latest dump with obscure (possibly novel) wordings:
Telepine Mentor
3R
Creature - Elf Shaman (Uncommon)
When Telepine Mentor enters the battlefield, target opponent reveals his or her hand and discards all cards with the same name as a card in his or her graveyard.
3/2
#I know it's off-color, but I like it. Question: Has this effect ever shown up on a card before? With that wording? I think this might be a very novel ability, unless I am mistaken.
Infiltrator Bear
5G
Creature - Beast
Whenever a source deals damage to an opponent, you may return Infiltrator Bear from your graveyard to the battlefield.
6/6
#I think the closest match is Talon of Pain, but that card is restricted to sources you control, whereas this card doesn't care who controls the source.
Dread Crusher
5G
Creature - Beast
Dread Crusher can't be uncast.
Basic landwalk
Whenever a spell or or ability an opponent controls causes you to discard Dread Crusher, put it onto the battlefield instead of putting it into your graveyard.
5/5
# "Basic landwalk". That's exactly what Magic needs: cards that discourage players from playing basic lands.
EDIT: Training is almost complete. I'll make a report about it this afternoon or evening.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
One of the most entertaining graphs produced by a machine learning researcher that I've ever seen. Five or six psychotically delusional thumbs up.
Needs a little pump to kill your 5/5 though.
Or I shrink yours with:
From the batch I posted yesterday. http://imgur.com/a/s89Ij
Summer holidays are ending up there? Meanwhile down here, it snowed today.
For a side project I generated some cards with the name Redundant, and picked some highlights;
http://imgur.com/a/DTpY1
@Talcos, any news on the experiment? How's it looking?
... Interesting. So very interesting. That's all I can say at this point. I've got to do some deeper analysis on this. Some things are looking good, but I'm getting some counter-intuitive results.
I have a meeting coming up with undergraduates on research opportunities for the coming semester (I need all the help I can get), and then I have a weekly group meeting; we're discussing some recent research on universal software transplantation. Evidently its possible to use genetic algorithms to steal code from any program written in any language and integrate its functionality into any other program, which could completely redefine how we do programming. No more grunt work, no more reinventing the wheel. But yeah, this evening I promise a full report.
P.S. With that transplantation research, my machine learning research, and some other research we have going on where we can conscript every smart phone, car, and coffee machine in an area and turn them into an ad hoc cloud computing network, we could conceivably complete Skynet years ahead of schedule. What a time to be alive, lol.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Wait, you could steal, say, a csv parser from python code and stick it easily into, I dunno, Fortran? That's terrific. It would save programmers so much damn time. Not sure we want Skynet to be a reality though.
It's tantamount to being flat-out unblockable (in any format other than Five-Color or maybe Vintage), but in a very Green way.
Yes, exactly, and exactly my sentiments.
I know, right? We often joke about the network not understanding color identity and discipline well. What we really mean is that it has a model of those concepts, it's just that that model has incongruencies with our understanding. For example, the network knows that countermagic is thoroughly within the domain of blue, but it's willing to be more flexible than most human designers when it comes to letting that bleed into the other colors. After all, the input dictates that that is possible: Lapse of Certainty Avoid Fate, Burnout, and Dash Hopes (to name a few).
And there's some evidence that the network does have a (weak) understanding of color associations. I did some tests in the past with abilities like protection and found that a white card was more likely to have protection from black and red than any other colors. But understanding colors in that way is quite challenging, because much of the identity of a color is defined by its play strategies, and it's difficult to reason about that when you don't even know how the game actually works.
Now, teaching a machine to play the game (or at least have a better understanding of it), is possible, but a lot more work needs to be done before we can consider taking on Magic in a raw, unengineered way as we are doing with the text. For example, Google's Deepmind team has been fabulously successful at teaching a neural network to play a variety of 80's arcade games from scratch. All it sees are the buttons to press, how those buttons change the pixels on the screen, and what the current score is. That's all the network needs to know to achieve human-level competence in games like Space Invaders. But the state space of Magic is way more complex and the decisions are much more nuanced.
By contrast, a lot of the logic governing Magic-playing AI in games like Magic Duels is hand-crafted. For instance, you're playing against a blue deck and you cast Mutilate and they have Cancel in hand. The AI evaluates the utility of the board state before Mutilate resolves and the hypothetical state after it resolves. It determines that the state of the board after you cast Mutilate has negative utility (the board state is more likely to lead to loss than victory), and as such it concludes that countermagic is the best course of action to maximize utility. Human programmers didn't write out a specific rule dictating what to do in that exact circumstance, but they did write the code to determine what "utility" means and when and how to calculate that "utility". The hope is that one day that AI, perhaps some form of neural network, will be able to create and execute these kinds of action structures without the need for human involvement.
----------------------
So, training completed on the new version of the network! For the sake of ensuring experimental validity, I used the exact same parameters as the last run, right down to the same random seeds. For the curious, the parameters were as follows:
* 3-layer LSTM network
* 512 cells per layer
* Sequence length of 200.
* 20% dropout during training.
* Total number of model parameters: 5,421,636
The only difference was that the new LSTM cell architecture added a bias of one to the input of the forget gate, in order to make cells favor remembering inputs more strongly by default, as this, according to Jozefowicz et al. 2015, has been shown to improve the performance on language modeling tasks (in this case, the language is Magic English). The network was trained on the input corpus for 22 epochs, as this was the amount of training that was done beforehand to get reasonable results.
The purpose of this experiment was to test the hypothesis that adding a bias to the forget gate could improve on the ability of the network to "latch" onto long-term dependencies.
Performance stats
The stats for the network without bias are:
opt:
{
max_epochs : 50
seed : 456
batch_size : 8
gpuid : 0
decay_rate : 0.95
learning_rate_decay : 0.97
opencl : 0
model : "lstm"
grad_clip : 5
print_every : 1
data_dir : "data/rarity/"
seq_length : 200
num_layers : 3
learning_rate_decay_after : 10
rnn_size : 512
train_frac : 0.95
dropout : 0.2
init_from : ""
learning_rate : 0.002
eval_val_every : 1000
val_frac : 0.05
savefile : "lstm"
checkpoint_dir : "cv"
}
val losses:
{
27000 : 0.40522396665718
11000 : 0.43034615666152
2000 : 0.52954967599261
17000 : 0.41844312371145
6000 : 0.45367849317354
8000 : 0.43866124073194
14000 : 0.42813068716686
7000 : 0.44841691309234
9000 : 0.43652370777686
1000 : 0.60924264238896
19000 : 0.41430317862563
26000 : 0.4062239958787
18000 : 0.41718120092992
15000 : 0.42158484959245
30000 : 0.40498607878948
28000 : 0.40771359385309
3000 : 0.49125206560803
4000 : 0.47159804054712
5000 : 0.46193859231935
24000 : 0.40792710901257
21000 : 0.41386697571611
22000 : 0.41006710331642
10000 : 0.43365385826878
13000 : 0.42896408045413
12000 : 0.42995822013029
16000 : 0.42130343679301
20000 : 0.41283605850446
25000 : 0.40774935972729
23000 : 0.41219892613269
31000 : 0.40365090459331
29000 : 0.4054476905393
}
The stats for the network with bias are:
opt:
{
max_epochs : 22
seed : 456
batch_size : 6
gpuid : 0
decay_rate : 0.95
learning_rate_decay : 0.97
opencl : 0
model : "lstm"
grad_clip : 5
print_every : 1
data_dir : "data/rarity/"
seq_length : 200
num_layers : 3
learning_rate_decay_after : 10
rnn_size : 512
train_frac : 0.95
dropout : 0.2
init_from : ""
learning_rate : 0.002
eval_val_every : 1000
val_frac : 0.05
savefile : "lstm"
checkpoint_dir : "cv"
}
val losses:
{
29000 : 0.42415966422768
34000 : 0.41808451826543
37000 : 0.41727624063754
9000 : 0.4557670081742
22000 : 0.43360115462838
18000 : 0.43685704383537
30000 : 0.42280681499233
36000 : 0.41604303093514
41118 : 0.41535110544844
11000 : 0.44901997411803
15000 : 0.44232666097819
23000 : 0.4298605766719
31000 : 0.42037780063207
27000 : 0.42350022059941
26000 : 0.42656072827466
6000 : 0.47754336511286
41000 : 0.41517211155697
21000 : 0.43214593154695
7000 : 0.47148974635896
33000 : 0.4185270713241
1000 : 0.78322276719851
19000 : 0.43515285028551
13000 : 0.44587366596739
24000 : 0.42775359998749
32000 : 0.42064330252176
25000 : 0.42770339956766
28000 : 0.42306344906749
3000 : 0.53275743531853
4000 : 0.50656256752883
5000 : 0.48842470494165
2000 : 0.59110461365051
10000 : 0.45312114087329
12000 : 0.448036004841
39000 : 0.41520162866
8000 : 0.46184962350728
35000 : 0.41916038004911
16000 : 0.43973640097779
20000 : 0.43246695352087
14000 : 0.44315458999431
17000 : 0.43802385492407
38000 : 0.41640224693805
40000 : 0.41500735740935
}
On paper, they are highly similar in that they have similar validation losses. The network with bias is slightly higher in terms of validation loss, but that's partially attributable to the fact that I used ever-so-slightly smaller batch sizes. On balance, one should be about as good as the other, although they may be better or worse than each other at certain things.
So, at the end of training, we landed somewhere in the solution space, and all we know for sure is that they're equidistant from being perfect. The question is: what trade-offs, if any, were made?
Experiment 1: Landfall
We prime the network to produce cards whose body text begins with "landfall -" to see whether the network will insert an ability with wording found on cards with landfall.
The network without bias invented an ability that cared about lands entering the battlefield in 34/34 observed instances (100% of the time)
The network with bias invented an ability that cared about lands entering the battlefield in 27/28 observed instances (96.42% of the time)
Note that landfall shows up on only 0.307% of all Magic cards, and the text is almost always the same, so what we're really testing here is the capacity for rote memorization. Both networks perform about as well as each other here.
Experiment 2: Burning Blaze
First, we use the sampling script to test the old and new versions of the network to generate a spell titled "Burning Blaze", an instant that costs RRR and whose text starts with "Burning Blaze deals X damage". We use the same sampling parameters in both cases (temperature at 0.6), the only thing we change is the network that we use.
The network without bias completed the card by adding a clause defining X in 31/66 observed instances (46.9% of the time)
The network with bias completed the card by adding a clause defining X in 0/63 observed instances (0.00% of the time)
Now, if I make sure that it adds a comma and not a period after the "target creature/player" part, then both networks almost always continue with the "where X is equal to ____". But what we're testing here is whether the network can come up with that on its own without us prodding it in any way.
That is bizarre. I change the seed, but it makes no difference. The biased network flat out refuses to define the X. And I can try different things, like "prevent the next X damage" or "draw X cards", but it doesn't seem to matter. It's always content with the X being left undefined.
Second, we change up the card by making the cost xRR instead and we leave the body text unspecified.
The network without bias completed the card by adding a clause that uses X in 34/50 observed instances (68% of the time)
The network with bias completed the card by adding a clause that uses X in 28/52 observed instances (53.8% of the time)
So the biased network is okay with using the X when it shows up in the mana cost, but it does so at a measurably lower rate that's closer to random chance than the unbiased network.
But remember, the network can't be that much worse than the last, so I expect that it's compensating for this shortcoming. But how?
When generating cards, I noted that this new network sometimes comes up with the weirdest stuff that I have ever seen. Intelligible and legal stuff, but weird. Especially the rares:
Ancient Keepers
RB
Creature - Human Wizard (Uncommon)
When Ancient Keepers enters the battlefield, sacrifice it unless you return seven Warrior cards in your graveyard to the battlefield.
3/2
Colossus of the Golden Saproling
2R
Legendary Creature - Human Shaman (Rare)
Whenever you cast a blue spell, you may pay B. if you do, put a 2/2 white Elemental creature token with flying onto the battlefield. it has "sacrifice this artifact: add G to your mana pool.
1/2
#It managed to involve every color in this card.
Snow-covered Stone
2
Artifact (Rare)
All creatures have vigilance.
You may cast Snow-covered Stone as though it had flash if you control a creature with power 5 or greater.
Vizzet, the Alarm, the Tifeleeper
4U
Legendary Creature - Human Wizard (Rare)
When Vizzet enters the battlefield, you may search your library for a card with the same name as a creature card, put that card onto the battlefield tapped, then shuffle your library.
2/2
#That's an unusually long name
Shioling Artificer
3U
Creature - Merfolk Wizard (Rare)
T: Target snow land becomes a planeswalker until end of turn.
2/3
#What does that even do?
Reversal Diamond
4
Artifact (Uncommon)
3, T: You gain life equal to the damage dealt to you this turn.
Mana Bounty
4R
Enchantment (Rare)
Whenever a creature enters the battlefield, you may destroy target creature with the greatest power among all creatures you control.
Panic Strike
3W
Instant (Rare)
For each card in your hand, draw three cards.
Corruption Trap
4G
Instant - Trap (Rare)
If an opponent controls a swamp, you may play an additional land this turn. If you do, Corruption Trap deals 5 damage to target creature.
Greater Bat Garden
Land (Uncommon)
T: Add 1 to your mana pool.
3, T: Add WRRGGG to your mana pool. Draw a card.
Thought Through the Ancestry
4G
Sorcery
Draw two cards. If you do, return Thought Through the Ancestry to its owner's hand.
Dragon Serpent
4W
Creature - Spirit (Uncommon)
Flying
Whenever Dragon Serpent blocks or becomes blocked by a creature with power 3 or greater, you may pay 2. If you do, put a +1/+1 counter on Dragon Serpent.
3/3
Shattering Mastodon
3B
Creature - Horror (Uncommon)
Black spells you cast cost 1 less to cast for each other artifact you control.
3/3
It's so strange. I have a suspicion that adding a bias to the forget gate has made it more comfortable with longer and more exotic phrases. But I'm still trying to flesh out a better understanding of why exactly that's happening, and why some other things, like X costs, actually seem to be getting worse with the change, despite the fact that a stronger memory enables better recognition of long-term dependencies.
EDIT: TL;DR: Adding the bias made the network different. I'm still trying to determine whether that's a good thing or a bad thing.
EDIT(2): Yeah, the cards I just showed you are very typical examples of what the network produces.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
4
Artifact (Uncommon)
3, T: You gain life equal to the damage dealt to you this turn.
#Holy stallfest, Batman!
Panic Strike
3W
Instant (Rare)
For each card in your hand, draw three cards.
#No, this seems totally balanced to me...
Corruption Trap
4G
Instant - Trap (Rare)
If an opponent controls a swamp, you may play an additional land this turn. If you do, Corruption Trap deals 5 damage to target creature.
#Fantastic sideboard card, and what a great name.
Greater Bat Garden
Land (Uncommon)
T: Add 1 to your mana pool.
3, T: Add WRRGGG to your mana pool. Draw a card.
#I love this card so much. Should probably be rare/mythic.
Thought Through the Ancestry
4G
Sorcery
Draw two cards. If you do, return Thought Through the Ancestry to its owner's hand.
#Counsel of the Soratami, in green, with built-in buyback, for 5 CMC? Sounds legit.
Dragon Serpent
4W
Creature - Spirit (Uncommon)
Flying
Whenever Dragon Serpent blocks or becomes blocked by a creature with power 3 or greater, you may pay 2. If you do, put a +1/+1 counter on Dragon Serpent.
3/3
#This is actually really cool.
Shattering Mastodon
3B
Creature - Horror (Uncommon)
Black spells you cast cost 1 less to cast for each other artifact you control.
3/3
#Because what this game really needs is a universal Affinity mechanic...
Colossus of the Golden Saproling is awesome, but missing W sadly. Corruption Trap is disappointingly off-colour (green doesn't really do direct damage). How do the keyword-to-colour statistics break down with the new network? Can you try Kicker, 'remove a +1/+1 counter' and 'choose a colour'? Those are a few other things that require it to match up bits of the rule text, I wonder if it performs better there...
So as far as I can see, pros of this new biased network are: weird funky cards. Drawbacks: not much better colour-pie knowledge (though I would really like to see it trained purely on Modern cards) and seemingly random refusals to do X costs. Wowza. Hopefully this experiment was worth it though
Believe me, it's always worth it. Experiments often fail to turn up results, but they help us rule out possibilities and that can guide further work. It's the sort of thing I'm used to, haha.
Besides, this actually could benefit us down the road. We now know that the bias gives us more extended and often better connected clauses, even if they aren't as color-appropriate. It's possible that there's an optimal bias in between 0 and 1 that will improve our results. Hyperparameter search time!
And yeah, I can try those tests for you in the morning.
Oh, and fun news, I just got an update that a local magazine published an interview that they did with me. And yeah, I have a face that makes it easy for me to blend in when I attend conferences in eastern Europe.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
That interview mentions you've not been contacted by any game companies; I really do wonder why? Maybe they don't want to encourage machines that could put designers out of business in 5 years