An Image is Worth 16x16 Words

Transformer gives you function \(F: [X] \to [Y]\). Anything that can be totally ordered can be learned. So it’s pretty general.

To me, this paper is more evidence for the transformer as a universal architecture. Some other (famous) examples:

AlphaFold
GPT
BERT

The Idea

Take an image (and its target image in training), cut it into patches, embed each patch, then use transformer on that. It’s that simple.

I wonder if different sequentialization strategies would work better or worse (rasterize up/down instead of left/right)? Would doing it randomly give noticeably different performance?

Bitter Lesson

We find that large scale training trumps inductive bias.

I gotta get more compute.

An Image is Worth 16x16 Words

The Idea

Bitter Lesson

Compactness of the Classical Groups

Derivative AT a Discontinuity

Just because 2 things are dual, doesn't mean they're just opposites

Boolean Algebra, Arithmetic POV

discontinuous linear functions

Continuous vs Bounded

Minimal Surfaces

November 2, 2023

NTK reparametrization

Kate from Vancouver, please email me

An Image is Worth 16x16 Words

The Idea

Bitter Lesson

Related Posts

Compactness of the Classical Groups

Derivative AT a Discontinuity

Just because 2 things are dual, doesn't mean they're just opposites

Boolean Algebra, Arithmetic POV

discontinuous linear functions

Continuous vs Bounded

Minimal Surfaces

November 2, 2023

NTK reparametrization

Kate from Vancouver, please email me