I got sick of writing organized blog posts, so I’m going to compromise on clean blog posts and just dump thoughts.
DeepMind Go Self-Play Paper
Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning.
\(f\) : Network
\(f\): board state, history → (multinoulli over actions, value of state \(v \in (0, 1)\), representing probability of win given current state)
I love the term multinoulli.
Combine policy and value function into one network with two “heads”. I feel guilty about liking the term heads because outputs is a simpler term, but then I can’t say that a neural network is obviously just a smooth function without feeling uncomfortable since functions don’t have multiple outputs. The lies I tell myself.
Really, still 2 functions that share a bunch of intermediate functions
View MCTS as policy improvement operator, since it can perform the \(n\)-step lookahead that policy improvement demands.
MCTS forms the improve part of policy iteration by giving training data to actually use to improve.
The network is the evaluate step.
Human knowledge and prebuilt heuristics actually hurt AlphaGo Master.
DON’T TINKER WITH THE COMPUTER. DON’T TWEAK SHIT. USE COMPUTE RATHER THAN BRAINS.
Even typing that made me feel ill.
MCTS: policy → better policy
Find fixed point w.r.t. MCTS. Once found, MCTS not necessary at play time.
I wonder how much heuristics will matter in the future. Is Gerd Gigerenzer right?
I wonder if the network has a probability at which it gives up.
Against My Better Instincts
When I rode the subway in New York, I would get off at Canal Street every morning. I would walk up the platform as the train departed, making a point of getting closer and closer to the train every day without actually touching it. It was a good way to get pumped for the day, but I doubt that’s why I did it.