Go & Neural Net
Essay by review • March 16, 2011 • Essay • 758 Words (4 Pages) • 1,128 Views
the networks
The authors tried a variety of networks. The paper diagrams one sample network, but many others were experimented with, and the sample network doesn't use all the techniques mentioned in the paper.
The networks were trained by the temporal difference algorithm TD(0) to predict, not the overall result of the game, but rather the owner of each point on the board at the end of the game. (The winner of a go game is the player who controls more points at the end of the game, after making a small adjustment to eliminate the first player's advantage.) That gives the networks more information to learn from.
Features of a go position are approximately translation-invariant. In other words, a configuration of stones is about as valuable in one place as another, all else being equal, although it may depend on how close it is to the edge of the board. A network which does not take this into account will have to learn the value of each configuration at each location it can appear.
Therefore the networks learned feature detectors which were scanned over their inputs. The layers of the network were explicit feature maps. The sample network has two hidden layers, connected in parallel rather than in series, which were added at different points during the network's training. Each layer is a feature map produced by scanning feature detectors over its inputs.
Some networks (but not the sample network) were forced to obey the symmetry of the go board by being constrained to learn symmetrical features. The paper says, "Although this is clearly beneficial during the evaluation of the network against its opponents, it appears to impede the course of learning."
The networks played by making a one-ply search, evaluating every possible move. Nici Schraudolph wrote me that, although he used incremental techniques (not mentioned in the paper) to evaluate networks quickly, they were still too slow for full 19x19 go.
training
Training by self-play alone was found to be slow. Training against a skilled opponent was faster because the networks could learn from their opponents. The opponents used were a random player, useful to start off training; Wally, a weak public domain program; and Many Faces of Go, a comparatively strong commercial program.
Like any learner, the networks learned best from opponents not too far from their own strength. Networks started out knowing nothing, and so needed weak opponents. Wally was modified to play a certain proportion of random moves, and the proportion was reduced as the network improved. Against Many Faces, games were played with standard go handicaps.
Because go is a deterministic game, there was a risk that a network might never explore some options, because it falsely thought it "knew better". That would leave blind spots in its understanding. The problem was solved by introducing randomness into the network's play, using Gibbs sampling.
Networks that played too much against one opponent risked over-fitting to that opponent, hurting their results against other opponents.
results
The
...
...