I just made the slug for these URLs random since I’m sick of caring about them. Search is good enough to not have to care.
OpenAI paper on compute from this month (which I failed to find the link for easily so I gave up).
The ‘info extractor’ metaphor for large nets nicely explains why lower level data can be more useful than higher level stuff: it subsumes it and contains at least as much info as it.
I wonder how starting with higher level info and annealing towards lower level representations of the same data over training would work. the data can be consistently formatted by encoding the high level in the low (like compiling to machine code).
The compute budget stuff is cool too but I’m not going over that here.
It does seem to imply that the transformer/attention is a better primitive than conv because of what looks like better scaling behavior. The image is worth 16x16 words seems to lend evidence to that. Worse pref (compared to conv) initially but higher threshold if you have the compute for it. Dunno if it’s good enough for bigger stuff.
This is really impressing the trade-off between specialization/faster training/more cleverness required/less compute vs generality/higher performance/more compute. Yeah that’s ugly to parse.
Panjabi MC still satisfies. Ashok Gill’s voice — goddamn.