Understanding the fundamentals in chapter 5, especially over and underfitting, really pays off. This chapter is really easy to read if you can even half-remember those earlier sections.
Their methodology is essentially using good engineering practice for ML.
- Figure out what you want (your goal)
- Set up metric to measure progress to the goal
- Set up basic end-to-end pipeline ASAP. Doesn’t have to be good just yet, but it does have to be capable of measuring how well each piece is doing (so you can improve it).
- Run pipeline, see what it fails it
- Improve that
- Repeat till you get acceptable performance
You can’t optimize everything since there’s only so much time and money, so focus on what will give the most improvement for cheap. That’s problem-specific.
Setting Goals
If you’re in research, this is basically to beat the existing benchmark or do something cool.
If you’re in industry, it’s whatever keeps the wolf from the door.
Hey, what did you expect me to say? It’s not like I know what your task is.
A common metric is the accuracy (0-1 loss). But in general, not all errors are equal. Consider the worst-named terms in the world: type 1 and 2 errors. Often one matters more than the other.
Your system doesn’t have to be perfect, just better/cheaper than humans. If it’s not confident about predictions, it can pass responsibility onto some poor human.
Get A Minimal Pipeline Up and Running ASAP
Don’t start fancy. If your problem could be solved with linear regression, start there.
If you can copy some existing solution (like pretrained networks for Imagenet), try that.
Should I Gather More Data?
If it’s cheap and easy, yes.
They make a good point that if you can’t overfit on purpose to get near-perfect accuracy on the training set, your model is probably flawed, not the data (unless the data is so noisy that it contradicts itself or something).
Once the train set performance is nailed down, see how you do on the test set. If the generalization was poor (good performance on train, bad on test), more data is good.
If you’re strapped for data, you can do hyperparameter tuning and regularization.
If all that fails and you can’t get good performance, time to send in your application for grad school, because you’ve got a research problem and you may as well be rewarded with a PhD for working on it.
hyperparameter tweaking
manual
You need to understand what’s actually going on and some basic statistical learning theory for your tweaking to be anything more than “monkey changes number and sees what happens”.
Understanding over/underfitting and model capacity will get you surprisingly far. Surprising because you can predict the authors’ words before reading them.
You’re trying to get capacity in the sweet spot in between over and underfitting.
learnign rate is the first hyperparameter to tune, and often the most important.
Automatic
Auto-Tune isn’t always bad, despite what The Boondocks says. See Blood on the Leaves, Heartless, and Dan Bilzerian.
Automatic tuning can be more computationally expensive, but it’s way less manual labor.
It can serve as a good starting point for further manual tuning.
Grid Search
This sucks. See the book for why, or imagine doing it over 10 hyperparameters.
Random Search
If good hyperparameters are dispersed throughout the hyperparameter space, randomly picking hyperparameters is not a bad strategy. Beats grid search anyway.
Hyperparameter Tuning as YAOP (Yet Another Optimization Problem)
Just as you can think of the cost function as a function of the weights, it can be thought of as a function of hyperparameters that can be optimized with respect to them. Except this optimization is rarely differentiable.
This is not a common approach, yet. But it’s the one I’d bet on, if only because I need to believe there’s something better than manual tuning.
Debugging
ML models are pretty robust. This can suck because it’s hard to tell what’s wrong since the system’s overall OK performance masks one thing being broken.
Visualization is the main tool at the moment. Print the model’s outputs, and see if there’s something funny.
Also see what predictions the model is most and least certain about, and look for common threads.
Read this section of the book, it’s worth it. And it’s about 3 pages.
Example: Google Street View
This glosses over the sheer difficulty of gathering as much data as google did, but what did you expect?
Printing outputs fixed the main bug.