By the way, I know we've been talking a lot about graphics lately, but I'm still interested in improving our card generating capabilities. I've been looking into Neural Turing Machines (NTMs), but there are actually some simpler permanent memory architectures that could deliver excellent results. I'm particularly interested in Stack-RNNs (paper here, github here). The idea is that you have permanent memory in the form of one or more stacks that the neural network can use to store and retrieve information on demand (see diagram).
Now, since we're dealing with simple stacks rather than a bidirectional array with a variable number of read/write heads (as the NTM has), it's less powerful in terms of what it can do. On the other hand, it's way easier to implement and way easier to train. From the paper I read, it looks like the RNN with an added stack does much better on memorization tasks. If you look at the graph, you see how the predictive power of conventional RNN/LSTM networks gets weaker and weaker as you lengthen the size of the sequence. Notice how the Stack-RNN does not have this issue.
There are many situations where a stack could come into play to enable the network to achieve better results. Examples:
* Observe an X in the mana cost, push the X onto the stack. At each subsequent timestep, the network can see the X, which is a reminder that the X needs to be used in the body. And if X shows up in the body first, it knows that it needs to define X later in the body.
* Observe the sequence "planeswalker" in the type line, push three or four "ability" symbols on the stack. The network can pop these off one at a time as it gives the planeswalker an ability. When the stack is empty, it knows it's done.
So this would definitely solve a lot of problems. Of course, it all comes down to whether we can train the network to do these things. However, it's nice to know that it's possible.
I'll probably end up implementing a Torch version of Facebook's Stack-RNN because it would be useful for my research (and I was running into problems training the NTMs). If it pans out, then I can apply that to Magic cards.
I've been following your thread since July, and I have to say you've truly sparked my interest in RNNs and machine learning. Are there any papers or projects you recommend to someone wanting to delve into the subject? Keep up the amazing work!
I've been following your thread since July, and I have to say you've truly sparked my interest in RNNs and machine learning. Are there any papers or projects you recommend to someone wanting to delve into the subject? Keep up the amazing work!
Well, I'm happy that you've enjoyed and been intrigued by the whole experience.
Now, as for recommendations, I have some, but not too many. My decade of experience with computer science has mostly centered around the clever application of hand-crafted formal logic and verification techniques to solve tough problems. I only began to investigate the possibility of teaching machines to do emulate that expertise a handful of months ago. It's been a very eye-opening experience, but I'm still learning. Most of my successes have come about due to three things: intuition, patience, and the charisma/showmanship needed to recruit hearts and minds to my cause. Beyond that, it's just been a steady process of trial and error.
The machine learning subreddit has a wealth of materials if you're just getting started. Other than that, I've just been looking up all the latest machine learning journals, printing out all the articles I like, and marking and annotating them, building up a body of knowledge.
EDIT: @Mustard_Fountain, still nothing yet, but I'll figure your problem out as soon as I can, lol. I haven't forgotten about you.
Hmmm, stack-rnns look pretty cool. I've got the whole long weekend to write some more code, so I'm thinking about pushing out a new rev of mtgencode, and trying to train my own MTG-oriented image recognition network with caffe.
How hard is it to implement this architecture on top of the existing char-rnn code? I'd try to do a port myself, but unfortunately I'm not that familiar with how these things actually work, so I'm not too confident in my ability to get it to work. I should read the papers more closely.
Also, has anyone used the ntm implementation here at all? If it's not too hard to set up, I'd be interested in training it up with the existing format just to see what it's capable of.
Hmmm, stack-rnns look pretty cool. I've got the whole long weekend to write some more code, so I'm thinking about pushing out a new rev of mtgencode, and trying to train my own MTG-oriented image recognition network with caffe.
How hard is it to implement this architecture on top of the existing char-rnn code? I'd try to do a port myself, but unfortunately I'm not that familiar with how these things actually work, so I'm not too confident in my ability to get it to work. I should read the papers more closely.
I'm actually investigating that right now. The communication between the network units and the stack is negotiated though some linear layers. The units propose the course of action (push/pop/peek) and the stack passes back a result that is then used in the subsequent computation. At this point, I'm trying to determine what is the best way to set everything up.
Also, has anyone used the ntm implementation here at all? If it's not too hard to set up, I'd be interested in training it up with the existing format just to see what it's capable of.
I attempted to adapt that NTM implementation for my research purposes, but I was running into issues integrating it with my little framework. I will figure it out eventually, but I put it on the backburner for the time being. Honestly though it shouldn't be overly difficult, because most of the novel stuff can be hidden from view. For our purposes, it can look like just a regular LSTM network, which should make it fairly easy to integrate.
EDIT: Emphasis on the word should. I haven't tried it yet. lol
EDIT(2): I saw that there's a list out for accepted papers at the NIPS 2015 conference. Some of them were released to the public before they actually got accepted, so some I've already seen. Relevant papers include:
* Bidirectional Recurrent Neural Networks as Generative Models: As in, we read/predict the inputs in both directions, both left and right. We actually discussed this possibility earlier in the thread, so it's nice to know that some well-tested research has come out in favor of the idea. It might help with color discipline, actually. Fun idea that this enables:bi-directional priming.
* Deep Convolutional Inverse Graphics Network: "I like the Magic art we just generated, but could you rotate the goblin about 25 degrees? That way his body is turned more towards the angel in the background."
* Skip-Thought Vectors: I've been hearing about this paper. Skip-thought vectors are like our word vectors but on steroids. It'd let you do things seeing the firebreathing ability and being able to predict the surrounding context for that ability, like a red Dragon creature. But we'll see how this one turns out. I'm waiting for them to release an implementation that I can play with.
* Teaching Machines to Read and Comprehend: Exactly what it sounds like. This is another paper that proposes a system that could pass the Magic judge rules tests by absorbing the comp rules.
And then there's a whole mess of papers about different kinds of tricks and techniques for improving performance that I'll have to look though next week.
The recent radio silence is because, fed up with trying to work through fiddly partial differential linear algebra... stuff, I have been trying to write an optimizing compiler for linear algebra + tanh and basic multivariable calculus, so I can just describe networks and have it automatically do stuff like generate gradients for recursive autoencoders.
Current status: adding is hard...
ETA: multiplying is harder...
ETA: And now I have to add in sigmoid (shouldn't actually be too hard), the various kinds of variables that can come up (mostly bureaucracy), and the functions I assumed the existence of to make addition and multiplication possible (sorry, future self). After that, I can try to write the evaluators for this.
ETA: On reflection, I probably should have tried to write unit tests before I got 300 lines of code written.
ETA: On reflection, I probably should have tried to write unit tests before I got 300 lines of code written.
I always tell myself to do that but I rarely follow through and end up regretting it, haha.
---
(I think) I'm making some progress with porting the Stack-RNN implementation over to Torch. There are, of course, some challenges.
For those of you who haven't touched any of the code before, what Torch does is that it lets you build a graph that is a "blueprint" for the network you want to use. You say "I'm going to have a layer of neurons, and they're going to connect to these layers downstream which do X, Y, and Z". It's actually quite elegant, because you don't have to worry about the low-level coding; the Torch libraries handle all of the wiring for you. However, sometimes you end up having to phrase things in very obtuse ways.
For example, I can't just say "output + 1" to add 1 to the output of a layer. Instead, I have to specify a network component on the blueprint that details a tiny machine whose sole purpose in life is to take an incoming number, increase it by one, and then spit out that number.
The Stack-RNN code that I'm using is written in C++ and is much closer to the metal, so I'm having to go through that code and "lift" the design into the realm of abstractions so that everything will work in Torch. It's coming along, and I'll get it figured out, but it takes time, haha.
EDIT: It almost looks like it'll work. Almost. I'm trying to figure out how to pass the state of the stacks across layers in the network, since they're shared.
EDIT(2): I think I figured out the stack stuff, but I'm getting some size mismatches when I try to run everything which implies I made a typo somewhere. Looking into that.
EDIT(3): By the way, useful tip for those of you working with Torch. The nngraph library lets you output modules to dot format which can be rendered. Here's a graph of the current incarnation of the Stack-RNN with a single shared stack. I have it set up where I can (in theory) increase the number of stacks as much as I want. But before I mess with that, I need to identify the when/what/where/why of that network component colored in red.
EDIT(4): Still getting some other input related errors. But I'll get that figured out. On the bright side, I'm fairly certain I implemented the stack algorithm correctly. The problems I'm having are Torch-specific data handling issues at this point.
EDIT(5): After a good night's sleep, I think I found the core of my problem.
When I get the input for the current rnn_state, what I'm really getting is a matrix whose dimensions are batch_size by rnn_size. So every rnn_size'd chunk belongs to a different batch.
Torch actually obscures this fact, because, as is noted in the documentation for linear layers, "If the input is a matrix, then each row is assumed to be an input sample of given batch."
And I coded my extension in such a way that it assumed that we were not working with matrices. But I can fix that, now that I know what I did wrong. That or I only ever use this code with a batch size of one, which would be ridiculous.
EDIT(6): YES! Partial victory. I got it working with just one stack. I have to go back in and add the loops so I can have multiple stacks. I needed to know that everything was working for just one before I tried more of them. And it's possible that I may have made a mistake somewhere, but what I can say is that it definitely completes every forward and backward pass without issue and I get performance numbers out so that's excellent. Hah. And it only took me around 48+ hours, lol.
EDIT(6): YES! Partial victory. I got it working with just one stack. I have to go back in and add the loops so I can have multiple stacks. I needed to know that everything was working for just one before I tried more of them. And it's possible that I may have made a mistake somewhere, but what I can say is that it definitely completes every forward and backward pass without issue and I get performance numbers out so that's excellent. Hah. And it only took me around 48+ hours, lol.
Congrats! You're officially a better Torch7 hacker than I am.
I'm currently working on some improvements to the format, along the lines of this post. I'm collaborating with a friend to try and do some statistical analysis on the output to see if we can guide training parameters to something approaching optimal. Currently using R with RStudio, but I can easily convert the same data to arff format for weka. I'll provide info as things progress.
EDIT(6): YES! Partial victory. I got it working with just one stack. I have to go back in and add the loops so I can have multiple stacks. I needed to know that everything was working for just one before I tried more of them. And it's possible that I may have made a mistake somewhere, but what I can say is that it definitely completes every forward and backward pass without issue and I get performance numbers out so that's excellent. Hah. And it only took me around 48+ hours, lol.
Congrats! You're officially a better Torch7 hacker than I am.
I'm currently working on some improvements to the format, along the lines of this post. I'm collaborating with a friend to try and do some statistical analysis on the output to see if we can guide training parameters to something approaching optimal. Currently using R with RStudio, but I can easily convert the same data to arff format for weka. I'll provide info as things progress.
Oh, cool! That would definitely be worth investigating. I don't have much experience with R as most of my analysis needs are met by Weka, but I'd love to see whatever you come up with.
As for me, I ran into a small snag when I tried to move from the one stack case to the two stack case. In the diagram below, you'll see there is a component highlighted in red. It's a linear unit, and there's a size mismatch issue. It looks like it's getting the wrong inputs, so I'll have to follow stuff back up the chain to figure out why that is happening. Probably messed up a calculation. The network isn't quite done yet either, as I think the paper has some extra recurrent matrix feature at the very end which I'll have to figure out. But it's almost working. As you can guess from the scale of the network, a lot of patient attention went into doing all the wiring, haha. If I can get it to work for the 2-stack case, it'll work for the N-stack case.
The stacks are actually a rather interesting feature. Each RNN layer makes changes to the stack in the order of depth, so layer 2 gets to see what layer 1 pushed on the stack and can vote to push more stuff on or pop stuff off. We'll see whether the network can be trained to make good use of this feature.
Ugh. Coverage is up from 23% at the start to 46%. This stuff can do basic math and a lot of the comparison code does what I expect. The missing 54% is, like, actually doing anything remotely interesting with this code. Maybe I should try to figure out how to use Hypothesis to generate some of this stuff. Or suck it up and just write out some polynomials.
I'm not sure there's actually much potential for this approach to be useful. If I want the end result to actually be performant, I need to either force the toolchain to generate efficient operations, or figure out how to link a library into this stuff.
I'm mostly interested in continuing because I'd like to learn more about interpreters, especially the optimizing kind. I don't suppose anyone's come up with some interesting DSLs related to all of this.
Ugh. Coverage is up from 23% at the start to 46%. This stuff can do basic math and a lot of the comparison code does what I expect. The missing 54% is, like, actually doing anything remotely interesting with this code. Maybe I should try to figure out how to use Hypothesis to generate some of this stuff. Or suck it up and just write out some polynomials.
Hypothesis? Which tool is that? I think I've heard of it before, but if so I've never used it.
I'm mostly interested in continuing because I'd like to learn more about interpreters, especially the optimizing kind. I don't suppose anyone's come up with some interesting DSLs related to all of this.
Not as of yet, no. But I'll let you know if anything interesting comes up or if I find anything in my research.
Up to 71%, but I've been fixing bugs along the way. I'm not totally sure what's wrong with this latest one. It's like it's creating a malformed RRRRR there was an extra comma...
76%, now, and some of the missing lines are reprs I put in to debug this when it freaks out. Actually still haven't touched the calculus and evaluator code yet.
Anyway, Hypothesis is a Python library for "property based testing". I'm working in Python, writing code that is probably RPython-compatible. This is why I'm not totally sure how useful this'll be, beyond having some kind of static analysis step that might cut down on runtime overhead afterward. I have no idea whether RPython natively creates anything "matrix-like", or if it's possible to push it in that direction. In any case, I'm really curious how the jit features might work out, with code that isn't for an esolang. (Ever tried to figure out how to hint a jit for a language with no syntactic support for looping? Not going back there.)
ETA: Tossed in some basic evaluation and derivative stuff. Currently at 84%, and it's probably going to plummet when I enable branch checking.
Anyway, Hypothesis is a Python library for "property based testing". I'm working in Python, writing code that is probably RPython-compatible. This is why I'm not totally sure how useful this'll be, beyond having some kind of static analysis step that might cut down on runtime overhead afterward. I have no idea whether RPython natively creates anything "matrix-like", or if it's possible to push it in that direction. In any case, I'm really curious how the jit features might work out, with code that isn't for an esolang. (Ever tried to figure out how to hint a jit for a language with no syntactic support for looping? Not going back there.)
Oh, how fascinating! Most of my work deals with static analysis, property-based testing, and model checking. Model checking is the focus of my dissertation, and I aim to use combinations of static analysis and machine learning to predict property violations. In short, I use elaborate sorcery to cheat my way out of otherwise intractable situations.
I'll have to take a look at this Hypothesis library of yours. Sounds fun.
EDIT: For the record, I found the source of my I/O problem with the Stack-RNN. I would input things in a particular order but then I would output the results in reverse order. The problem is that the output gets fed back in as input, so after the first pass I would end up getting everything confused and errors resulted. Fixing that.
Not sure if it was clear, but Hypothesis is a third-party library that I heard about. What I'm doing right now is putting together code for manipulating syntax trees. The general idea is that I've implemented some very naive code to work out the syntax tree stuff, and leaving the way open to, for example, convert from these somewhat-flexible trees to something like an optimized pipeline.
Potentially, I won't need RPython at all, and pypy will be sufficient, especially if the new vecopt stuff from this summer makes it in.
Anyway, I'm at 95%. Currently missing comparisons between constant and non-constant sums, comparisons between sums and their negations, comparisons between sums of different lengths. Actually, I might be able to to just remove some of this code. Anyway, there's also evaluating 'negative' sums, coercing products with a negative coefficient to sums, comparing products that differ only in their coefficient, taking the derivative of products involving powers higher than 1, both with just one component, and with multiple components, coercing sigmoids to products, and comparing sigmoids that differ only in sign.
Not sure if it was clear, but Hypothesis is a third-party library that I heard about. What I'm doing right now is putting together code for manipulating syntax trees. The general idea is that I've implemented some very naive code to work out the syntax tree stuff, and leaving the way open to, for example, convert from these somewhat-flexible trees to something like an optimized pipeline.
Depending on how much investment you want to put into this, I might recommend you use the ROSE compiler framework if you're having to do a lot of AST work. I'm throwing that out there because I've contributed to the development of the tool and I can tell you that it's syntax tree manipulation capabilities are pretty much the best out there (all things considered). With ROSE, I can take any code written in a variety of languages and do any number of interesting and exotic transformations and analyses on it.
Just a thought.
Example: I can produce a ROSE tool that translates Matlab code to C++, optimizes the resulting code, and translates all the comments in the code from English to Esperanto (because why not?).
---
So, it looks like I can support a variable number of stacks now. Still need to make a pass over everything to make sure everything is working and that I've replicated described in the paper properly, but it's looking very promising. I'm running it just for fun though I think it's not quite complete. I just want to see if the loss explodes at this point or if it settles.
EDIT: Well, the good news is that it doesn't cause loss to explode. The system does get more accurate with time. However, the convergence rate is slow, and that's probably because I'm not done implementing everything yet.
Right now the connection from the stacks to the network units is protected by an linear layer. This weights the transmissions from the stacks each time step, so different units can turn some channels down and others up depending on what interests them, and it also means that you can insulate critical, sensitive components from outside chatter.
What's missing from my implementation is a recurrent matrix that would let you amplify or reduce signals from the stacks dynamically, and I've discovered that this becomes important as you increase the number of stacks. I noticed that after about 10 to 15 stacks I would start getting exploding losses. That's problematic, especially since the paper I'm reading shows good results when you have a handful of network units and 60+ stacks.
Why? Well, imagine that you're sitting in front of N radios and you're listening for interesting things. When there's only one radio, that's easy. When there are two, well, you can manage. But when there are say ten, and only one of them is broadcasting the information you want, it gets kinda hard to hear over the others. But you don't want to switch all the other radios down because they too might deliver useful information. So at that point what you really want is some kind of selective attention mechanism (something that would grant a cocktail party effect, if you will).
I'll need to implement that bit as described in the paper. That would be helpful.
99%, and... there's a bug in the remaining lines. I'm probably subtracting 1 when I shouldn't.
Or... passing a constant instead of the proper variable...
And, currently at 100%. I should add one more assert, just to be sure, and then I can enable branch checking.
And, three partial lines. I'm inclined to make some of them into asserts that the condition that isn't hit, will not be hit.
Still one partial? Okay... Now I've got it. Now, I'm going to take a break, then figure out how to specify networks and error functions for this thing.
ETA: Things to try out, potentially: exotic network architectures. This code can encode an arbitrary directed acyclic graph of algebraic operations and sigmoids, and it can calculate the gradient for them, given an error function. More memoization: the node operations are memoized, but evaluation is not. Key caching to state, and I can put together a lightweight result cache that means each node gets evaluated exactly once. (The only evaluation strategy I've implemented should be unacceptably slow for any interesting problem.) More constant folding; not too interesting, given that the only other thing I can really think to fold is sigmoid(0) = 0. More interesting evaluation strategies. What I have right now is a giant flat loop, because that's easy. It should be possible to recover the linear algebraic aspects of a given Sum node, and try to generate efficient matrix code for it. I don't really know how to manipulate the cache at this level, though. Not sure it's possible without customizing RPython. Instead of speed, accuracy. My ASTs are representation-agnostic; all constants are integers. It should be possible to develop evaluation strategies that use arbitrary-precision lazy values. Main problem with this is that it would encode the entire history of the network, which would make it quick to train, but slow to evaluate, and huge in memory.
I should probably focus on result memoization first. That's a generally useful thing.
ETA: Done. Probably doesn't compile, but I'll figure that out later.
ETA: It's memoized out the wazoo, but there's still no good API. I guess that's next.
I made some forays into getting the rest of the Stack-RNN implementation working. Evidently I'll be running up against some memory issues on the machine I'm currently using if I want to scale everything up, but this won't be a problem on the 20-core with the Tesla GPU. I'm in the process of getting Torch set up for that environment, so I'm excited to see how fast I can get things to run.
Also, I did some tests, and the stacks seem to be working. When I trace their contents over time I'm definitely seeing pushing and popping of numbers going on, so that's nice.
Ran into some things, and had some ideas. I just found out about interactive activation and competition networks, and they sound interesting, but I've just gone over the very-high-level information about them.
No closer to having an API to create networks, but it seems I was more interested in coming up with more exotic numeric representations, efficiency be damned. Anyone know anything about representing algebraic numbers? I know of one means of doing so, but I'm sure there are others.
So when do you think you'll be able to train a Stack-RNN network on the MtG corpus? It sounds like it'll be a fascinating result.
Yes, that's the plan! Right now I'm running into issues with adding lots of stacks because unless you can dynamically control which stacks a unit pays attention to, they can end up inundated with more data than they want or need. Working on fixing that.
When I try to train the network as is, with, say, 4 stacks, I get convergence towards acceptable levels of accuracy, but the convergence is very slow and I don't think the end result will be as good as what we get with an LSTM already. The noise issue is a contributing factor. That and I would like to have lots of stacks.
EDIT: I should be able to do the training as soon as the 20-core machine has everything I want installed on it. I'm working out everything with the machine's keepers.
Ran into some things, and had some ideas. I just found out about interactive activation and competition networks, and they sound interesting, but I've just gone over the very-high-level information about them.
No closer to having an API to create networks, but it seems I was more interested in coming up with more exotic numeric representations, efficiency be damned. Anyone know anything about representing algebraic numbers? I know of one means of doing so, but I'm sure there are others.
Do you mean in a purely symbolic way? Or like an obscenely long lazy computation?
I'd need to actually implement one of my ideas to have an idea of how it would perform, but the basic idea I have is that, using continued logarithms, it's possible to represent algebraic numbers as numbers that terminate with a repeating cycle. Not necessarily lazy, but probably quite slow and large. Therefore, not where I should be focusing my efforts if I want results. Let's see what happens if I just do this.
ETA: Well, the network immediately converged on garbage. There are several possible reasons for that, which I'll figure out and look into tomorrow. There are some promising things, though. Pypy is noticeably faster, which means that it might, might work out well for larger data sets, and once I clear up the whole garbage thing. If I can make this work in regular Python, there are some speedups I can add to improve things, and possibly some cache tweaks that would make my code more flexible.
Of course, first priority is working out why my code is somewhere around square -2 in terms of result quality.
ETA: Two mistakes: a sign error in the factory function meant that this would have been an error maximizer... if a 'refactor' hadn't turned this from "gradient descent" into "gradient oops". Fix those problems, and I end up with perf of 1/4th the speed of the numpy version, and allegedly slightly better accuracy.
The failure mode is about the same. If I can figure that out (either by implementing Newton's method, which is now more achievable, relatively speaking, or by some other set of tweaks), then (and only then?) will it make sense for me to start messing with weirder architectures.
And, again, if I throw my lot in with Pypy instead of RPython (which is not generally a bad idea, anyway), I should be able to deal with some performance issues in all this.
ETA: WHAT. I just got better results by telling it to maximize the error, then negating the output. That... shouldn't make any sense.
ETA: Not a magic bullet. Changed the seed, got equivalently bad data. I guess the next step really is a twin-pronged approach of "make this code faster" and "make the algorithm better". But now the algorithm is coded abstractly. And if I commit to Pypy... oh, such things I can do, such techniques unavailable in RPython...
http://wdfw.wa.gov/wildwatch/owlcam/graphics/barn-owl-family.jpg
Now, since we're dealing with simple stacks rather than a bidirectional array with a variable number of read/write heads (as the NTM has), it's less powerful in terms of what it can do. On the other hand, it's way easier to implement and way easier to train. From the paper I read, it looks like the RNN with an added stack does much better on memorization tasks. If you look at the graph, you see how the predictive power of conventional RNN/LSTM networks gets weaker and weaker as you lengthen the size of the sequence. Notice how the Stack-RNN does not have this issue.
There are many situations where a stack could come into play to enable the network to achieve better results. Examples:
* Observe an X in the mana cost, push the X onto the stack. At each subsequent timestep, the network can see the X, which is a reminder that the X needs to be used in the body. And if X shows up in the body first, it knows that it needs to define X later in the body.
* Observe the sequence "planeswalker" in the type line, push three or four "ability" symbols on the stack. The network can pop these off one at a time as it gives the planeswalker an ability. When the stack is empty, it knows it's done.
So this would definitely solve a lot of problems. Of course, it all comes down to whether we can train the network to do these things. However, it's nice to know that it's possible.
I'll probably end up implementing a Torch version of Facebook's Stack-RNN because it would be useful for my research (and I was running into problems training the NTMs). If it pans out, then I can apply that to Magic cards.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Well, I'm happy that you've enjoyed and been intrigued by the whole experience.
Now, as for recommendations, I have some, but not too many. My decade of experience with computer science has mostly centered around the clever application of hand-crafted formal logic and verification techniques to solve tough problems. I only began to investigate the possibility of teaching machines to do emulate that expertise a handful of months ago. It's been a very eye-opening experience, but I'm still learning. Most of my successes have come about due to three things: intuition, patience, and the charisma/showmanship needed to recruit hearts and minds to my cause. Beyond that, it's just been a steady process of trial and error.
A fun place to start might be here: https://www.reddit.com/r/MachineLearning/wiki/index
The machine learning subreddit has a wealth of materials if you're just getting started. Other than that, I've just been looking up all the latest machine learning journals, printing out all the articles I like, and marking and annotating them, building up a body of knowledge.
EDIT: @Mustard_Fountain, still nothing yet, but I'll figure your problem out as soon as I can, lol. I haven't forgotten about you.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
How hard is it to implement this architecture on top of the existing char-rnn code? I'd try to do a port myself, but unfortunately I'm not that familiar with how these things actually work, so I'm not too confident in my ability to get it to work. I should read the papers more closely.
Also, has anyone used the ntm implementation here at all? If it's not too hard to set up, I'd be interested in training it up with the existing format just to see what it's capable of.
I'm actually investigating that right now. The communication between the network units and the stack is negotiated though some linear layers. The units propose the course of action (push/pop/peek) and the stack passes back a result that is then used in the subsequent computation. At this point, I'm trying to determine what is the best way to set everything up.
I attempted to adapt that NTM implementation for my research purposes, but I was running into issues integrating it with my little framework. I will figure it out eventually, but I put it on the backburner for the time being. Honestly though it shouldn't be overly difficult, because most of the novel stuff can be hidden from view. For our purposes, it can look like just a regular LSTM network, which should make it fairly easy to integrate.
EDIT: Emphasis on the word should. I haven't tried it yet. lol
EDIT(2): I saw that there's a list out for accepted papers at the NIPS 2015 conference. Some of them were released to the public before they actually got accepted, so some I've already seen. Relevant papers include:
* Bidirectional Recurrent Neural Networks as Generative Models: As in, we read/predict the inputs in both directions, both left and right. We actually discussed this possibility earlier in the thread, so it's nice to know that some well-tested research has come out in favor of the idea. It might help with color discipline, actually. Fun idea that this enables:bi-directional priming.
* Deep Convolutional Inverse Graphics Network: "I like the Magic art we just generated, but could you rotate the goblin about 25 degrees? That way his body is turned more towards the angel in the background."
* Skip-Thought Vectors: I've been hearing about this paper. Skip-thought vectors are like our word vectors but on steroids. It'd let you do things seeing the firebreathing ability and being able to predict the surrounding context for that ability, like a red Dragon creature. But we'll see how this one turns out. I'm waiting for them to release an implementation that I can play with.
* Teaching Machines to Read and Comprehend: Exactly what it sounds like. This is another paper that proposes a system that could pass the Magic judge rules tests by absorbing the comp rules.
And then there's a whole mess of papers about different kinds of tricks and techniques for improving performance that I'll have to look though next week.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Current status: adding is hard...
ETA: multiplying is harder...
ETA: And now I have to add in sigmoid (shouldn't actually be too hard), the various kinds of variables that can come up (mostly bureaucracy), and the functions I assumed the existence of to make addition and multiplication possible (sorry, future self). After that, I can try to write the evaluators for this.
ETA: On reflection, I probably should have tried to write unit tests before I got 300 lines of code written.
I always tell myself to do that but I rarely follow through and end up regretting it, haha.
---
(I think) I'm making some progress with porting the Stack-RNN implementation over to Torch. There are, of course, some challenges.
For those of you who haven't touched any of the code before, what Torch does is that it lets you build a graph that is a "blueprint" for the network you want to use. You say "I'm going to have a layer of neurons, and they're going to connect to these layers downstream which do X, Y, and Z". It's actually quite elegant, because you don't have to worry about the low-level coding; the Torch libraries handle all of the wiring for you. However, sometimes you end up having to phrase things in very obtuse ways.
For example, I can't just say "output + 1" to add 1 to the output of a layer. Instead, I have to specify a network component on the blueprint that details a tiny machine whose sole purpose in life is to take an incoming number, increase it by one, and then spit out that number.
The Stack-RNN code that I'm using is written in C++ and is much closer to the metal, so I'm having to go through that code and "lift" the design into the realm of abstractions so that everything will work in Torch. It's coming along, and I'll get it figured out, but it takes time, haha.
EDIT: It almost looks like it'll work. Almost. I'm trying to figure out how to pass the state of the stacks across layers in the network, since they're shared.
EDIT(2): I think I figured out the stack stuff, but I'm getting some size mismatches when I try to run everything which implies I made a typo somewhere. Looking into that.
EDIT(3): By the way, useful tip for those of you working with Torch. The nngraph library lets you output modules to dot format which can be rendered. Here's a graph of the current incarnation of the Stack-RNN with a single shared stack. I have it set up where I can (in theory) increase the number of stacks as much as I want. But before I mess with that, I need to identify the when/what/where/why of that network component colored in red.
EDIT(4): Still getting some other input related errors. But I'll get that figured out. On the bright side, I'm fairly certain I implemented the stack algorithm correctly. The problems I'm having are Torch-specific data handling issues at this point.
EDIT(5): After a good night's sleep, I think I found the core of my problem.
When I get the input for the current rnn_state, what I'm really getting is a matrix whose dimensions are batch_size by rnn_size. So every rnn_size'd chunk belongs to a different batch.
Torch actually obscures this fact, because, as is noted in the documentation for linear layers, "If the input is a matrix, then each row is assumed to be an input sample of given batch."
And I coded my extension in such a way that it assumed that we were not working with matrices. But I can fix that, now that I know what I did wrong. That or I only ever use this code with a batch size of one, which would be ridiculous.
EDIT(6): YES! Partial victory. I got it working with just one stack. I have to go back in and add the loops so I can have multiple stacks. I needed to know that everything was working for just one before I tried more of them. And it's possible that I may have made a mistake somewhere, but what I can say is that it definitely completes every forward and backward pass without issue and I get performance numbers out so that's excellent. Hah. And it only took me around 48+ hours, lol.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Congrats! You're officially a better Torch7 hacker than I am.
I'm currently working on some improvements to the format, along the lines of this post. I'm collaborating with a friend to try and do some statistical analysis on the output to see if we can guide training parameters to something approaching optimal. Currently using R with RStudio, but I can easily convert the same data to arff format for weka. I'll provide info as things progress.
Oh, cool! That would definitely be worth investigating. I don't have much experience with R as most of my analysis needs are met by Weka, but I'd love to see whatever you come up with.
As for me, I ran into a small snag when I tried to move from the one stack case to the two stack case. In the diagram below, you'll see there is a component highlighted in red. It's a linear unit, and there's a size mismatch issue. It looks like it's getting the wrong inputs, so I'll have to follow stuff back up the chain to figure out why that is happening. Probably messed up a calculation. The network isn't quite done yet either, as I think the paper has some extra recurrent matrix feature at the very end which I'll have to figure out. But it's almost working. As you can guess from the scale of the network, a lot of patient attention went into doing all the wiring, haha. If I can get it to work for the 2-stack case, it'll work for the N-stack case.
The stacks are actually a rather interesting feature. Each RNN layer makes changes to the stack in the order of depth, so layer 2 gets to see what layer 1 pushed on the stack and can vote to push more stuff on or pop stuff off. We'll see whether the network can be trained to make good use of this feature.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
I'm not sure there's actually much potential for this approach to be useful. If I want the end result to actually be performant, I need to either force the toolchain to generate efficient operations, or figure out how to link a library into this stuff.
I'm mostly interested in continuing because I'd like to learn more about interpreters, especially the optimizing kind. I don't suppose anyone's come up with some interesting DSLs related to all of this.
Hypothesis? Which tool is that? I think I've heard of it before, but if so I've never used it.
Not as of yet, no. But I'll let you know if anything interesting comes up or if I find anything in my research.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
76%, now, and some of the missing lines are reprs I put in to debug this when it freaks out. Actually still haven't touched the calculus and evaluator code yet.
Anyway, Hypothesis is a Python library for "property based testing". I'm working in Python, writing code that is probably RPython-compatible. This is why I'm not totally sure how useful this'll be, beyond having some kind of static analysis step that might cut down on runtime overhead afterward. I have no idea whether RPython natively creates anything "matrix-like", or if it's possible to push it in that direction. In any case, I'm really curious how the jit features might work out, with code that isn't for an esolang. (Ever tried to figure out how to hint a jit for a language with no syntactic support for looping? Not going back there.)
ETA: Tossed in some basic evaluation and derivative stuff. Currently at 84%, and it's probably going to plummet when I enable branch checking.
Oh, how fascinating! Most of my work deals with static analysis, property-based testing, and model checking. Model checking is the focus of my dissertation, and I aim to use combinations of static analysis and machine learning to predict property violations. In short, I use elaborate sorcery to cheat my way out of otherwise intractable situations.
I'll have to take a look at this Hypothesis library of yours. Sounds fun.
EDIT: For the record, I found the source of my I/O problem with the Stack-RNN. I would input things in a particular order but then I would output the results in reverse order. The problem is that the output gets fed back in as input, so after the first pass I would end up getting everything confused and errors resulted. Fixing that.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Potentially, I won't need RPython at all, and pypy will be sufficient, especially if the new vecopt stuff from this summer makes it in.
Anyway, I'm at 95%. Currently missing comparisons between constant and non-constant sums, comparisons between sums and their negations, comparisons between sums of different lengths. Actually, I might be able to to just remove some of this code. Anyway, there's also evaluating 'negative' sums, coercing products with a negative coefficient to sums, comparing products that differ only in their coefficient, taking the derivative of products involving powers higher than 1, both with just one component, and with multiple components, coercing sigmoids to products, and comparing sigmoids that differ only in sign.
Let's see what removing some code does... 97%.
Depending on how much investment you want to put into this, I might recommend you use the ROSE compiler framework if you're having to do a lot of AST work. I'm throwing that out there because I've contributed to the development of the tool and I can tell you that it's syntax tree manipulation capabilities are pretty much the best out there (all things considered). With ROSE, I can take any code written in a variety of languages and do any number of interesting and exotic transformations and analyses on it.
Just a thought.
Example: I can produce a ROSE tool that translates Matlab code to C++, optimizes the resulting code, and translates all the comments in the code from English to Esperanto (because why not?).
---
So, it looks like I can support a variable number of stacks now. Still need to make a pass over everything to make sure everything is working and that I've replicated described in the paper properly, but it's looking very promising. I'm running it just for fun though I think it's not quite complete. I just want to see if the loss explodes at this point or if it settles.
EDIT: Well, the good news is that it doesn't cause loss to explode. The system does get more accurate with time. However, the convergence rate is slow, and that's probably because I'm not done implementing everything yet.
Right now the connection from the stacks to the network units is protected by an linear layer. This weights the transmissions from the stacks each time step, so different units can turn some channels down and others up depending on what interests them, and it also means that you can insulate critical, sensitive components from outside chatter.
What's missing from my implementation is a recurrent matrix that would let you amplify or reduce signals from the stacks dynamically, and I've discovered that this becomes important as you increase the number of stacks. I noticed that after about 10 to 15 stacks I would start getting exploding losses. That's problematic, especially since the paper I'm reading shows good results when you have a handful of network units and 60+ stacks.
Why? Well, imagine that you're sitting in front of N radios and you're listening for interesting things. When there's only one radio, that's easy. When there are two, well, you can manage. But when there are say ten, and only one of them is broadcasting the information you want, it gets kinda hard to hear over the others. But you don't want to switch all the other radios down because they too might deliver useful information. So at that point what you really want is some kind of selective attention mechanism (something that would grant a cocktail party effect, if you will).
I'll need to implement that bit as described in the paper. That would be helpful.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
Or... passing a constant instead of the proper variable...
And, currently at 100%. I should add one more assert, just to be sure, and then I can enable branch checking.
And, three partial lines. I'm inclined to make some of them into asserts that the condition that isn't hit, will not be hit.
Still one partial? Okay... Now I've got it. Now, I'm going to take a break, then figure out how to specify networks and error functions for this thing.
ETA: Things to try out, potentially: exotic network architectures. This code can encode an arbitrary directed acyclic graph of algebraic operations and sigmoids, and it can calculate the gradient for them, given an error function.
More memoization: the node operations are memoized, but evaluation is not. Key caching to state, and I can put together a lightweight result cache that means each node gets evaluated exactly once. (The only evaluation strategy I've implemented should be unacceptably slow for any interesting problem.)More constant folding; not too interesting, given that the only other thing I can really think to fold is sigmoid(0) = 0. More interesting evaluation strategies. What I have right now is a giant flat loop, because that's easy. It should be possible to recover the linear algebraic aspects of a given Sum node, and try to generate efficient matrix code for it. I don't really know how to manipulate the cache at this level, though. Not sure it's possible without customizing RPython. Instead of speed, accuracy. My ASTs are representation-agnostic; all constants are integers. It should be possible to develop evaluation strategies that use arbitrary-precision lazy values. Main problem with this is that it would encode the entire history of the network, which would make it quick to train, but slow to evaluate, and huge in memory.I should probably focus on result memoization first. That's a generally useful thing.
ETA: Done. Probably doesn't compile, but I'll figure that out later.
ETA: It's memoized out the wazoo, but there's still no good API. I guess that's next.
Also, I did some tests, and the stacks seem to be working. When I trace their contents over time I'm definitely seeing pushing and popping of numbers going on, so that's nice.
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
No closer to having an API to create networks, but it seems I was more interested in coming up with more exotic numeric representations, efficiency be damned. Anyone know anything about representing algebraic numbers? I know of one means of doing so, but I'm sure there are others.
Yes, that's the plan! Right now I'm running into issues with adding lots of stacks because unless you can dynamically control which stacks a unit pays attention to, they can end up inundated with more data than they want or need. Working on fixing that.
When I try to train the network as is, with, say, 4 stacks, I get convergence towards acceptable levels of accuracy, but the convergence is very slow and I don't think the end result will be as good as what we get with an LSTM already. The noise issue is a contributing factor. That and I would like to have lots of stacks.
EDIT: I should be able to do the training as soon as the 20-core machine has everything I want installed on it. I'm working out everything with the machine's keepers.
Do you mean in a purely symbolic way? Or like an obscenely long lazy computation?
My LinkedIn profile... thing (I have one of those now!).
My research team's webpage.
The mtg-rnn repo and the mtg-encode repo.
ETA: Well, the network immediately converged on garbage. There are several possible reasons for that, which I'll figure out and look into tomorrow. There are some promising things, though. Pypy is noticeably faster, which means that it might, might work out well for larger data sets, and once I clear up the whole garbage thing. If I can make this work in regular Python, there are some speedups I can add to improve things, and possibly some cache tweaks that would make my code more flexible.
Of course, first priority is working out why my code is somewhere around square -2 in terms of result quality.
ETA: Two mistakes: a sign error in the factory function meant that this would have been an error maximizer... if a 'refactor' hadn't turned this from "gradient descent" into "gradient oops". Fix those problems, and I end up with perf of 1/4th the speed of the numpy version, and allegedly slightly better accuracy.
The failure mode is about the same. If I can figure that out (either by implementing Newton's method, which is now more achievable, relatively speaking, or by some other set of tweaks), then (and only then?) will it make sense for me to start messing with weirder architectures.
And, again, if I throw my lot in with Pypy instead of RPython (which is not generally a bad idea, anyway), I should be able to deal with some performance issues in all this.
ETA: WHAT. I just got better results by telling it to maximize the error, then negating the output. That... shouldn't make any sense.
ETA: Not a magic bullet. Changed the seed, got equivalently bad data. I guess the next step really is a twin-pronged approach of "make this code faster" and "make the algorithm better". But now the algorithm is coded abstractly. And if I commit to Pypy... oh, such things I can do, such techniques unavailable in RPython...
No, wait, that's a newly spoiled BFZ card!