Notes for Sat, Nov 11 2017

I got sick of writing organized blog posts, so I’m going to compromise on clean blog posts and just dump thoughts.

DeepMind Go Self-Play Paper

Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning.

\(f\) : Network

\(f\): board state, history → (multinoulli over actions, value of state \(v \in (0, 1)\), representing probability of win given current state)

I love the term multinoulli.

Combine policy and value function into one network with two “heads”. I feel guilty about liking the term heads because outputs is a simpler term, but then I can’t say that a neural network is obviously just a smooth function without feeling uncomfortable since functions don’t have multiple outputs. The lies I tell myself.

Really, still 2 functions that share a bunch of intermediate functions

View MCTS as policy improvement operator, since it can perform the \(n\)-step lookahead that policy improvement demands.

MCTS forms the improve part of policy iteration by giving training data to actually use to improve.

The network is the evaluate step.

Human knowledge and prebuilt heuristics actually hurt AlphaGo Master.


Even typing that made me feel ill.

MCTS: policy → better policy

Find fixed point w.r.t. MCTS. Once found, MCTS not necessary at play time.

I wonder how much heuristics will matter in the future. Is Gerd Gigerenzer right?

I wonder if the network has a probability at which it gives up.

Against My Better Instincts

When I rode the subway in New York, I would get off at Canal Street every morning. I would walk up the platform as the train departed, making a point of getting closer and closer to the train every day without actually touching it. It was a good way to get pumped for the day, but I doubt that’s why I did it.

Related Posts

Random Thought: LC Theorem

I finally have an answer to "who's your favorite singer?"

My Top Tip for Helping People Get Started Programming


Random paper on angles

An Image is Worth 16x16 Words

Random stuff

Lossless Data Compression with Neural Networks by Fabrice Bellard

Downscaling Numerical Weather Models With GANs (My CI 2019 Paper)

Learning Differential Forms and Questions