An Image is Worth 16x16 Words

Transformer gives you function \(F: [X] \to [Y]\). Anything that can be totally ordered can be learned. So it’s pretty general.

To me, this paper is more evidence for the transformer as a universal architecture. Some other (famous) examples:

AlphaFold
GPT
BERT

The Idea

Take an image (and its target image in training), cut it into patches, embed each patch, then use transformer on that. It’s that simple.

I wonder if different sequentialization strategies would work better or worse (rasterize up/down instead of left/right)? Would doing it randomly give noticeably different performance?

Bitter Lesson

We find that large scale training trumps inductive bias.

I gotta get more compute.

An Image is Worth 16x16 Words

The Idea

Bitter Lesson

Sensing and Intuition

What Does Big Mean?

Astrology for Men

a perfectable programming language

MBTI and AI

Double Date

Worse Than a Sranc

thanks whole foods lady

Another way of doing big O notation

Compactness of the Classical Groups

An Image is Worth 16x16 Words

The Idea

Bitter Lesson

Related Posts

Sensing and Intuition

What Does Big Mean?

Astrology for Men

a perfectable programming language

MBTI and AI

Double Date

Worse Than a Sranc

thanks whole foods lady

Another way of doing big O notation

Compactness of the Classical Groups