Generating a lot of language data with a theorem prover


I spent the last hour looking and failing to find the paper that made this specific claim:

Language models can be trained mostly on synthetic, hierarchical data and it works fine. A little bit of natural language data used for fine tuning allows it to translate language it hasn’t seen before just fine.

Closest I could find is a paper about synthetic language data

Using deep mind optimal compute estimates, you need ~50,000x more flops and ~1300x more tokens than PaLM in order to train a 10T param alien mind. Assuming similar util rates to PaLM, and given you discover like ~50x price improvements, I expect it would cost abt $10B to train rn

bayeslord

I think MetaMath 0 could be used to generate a lot of those tokens. Any theorem prover could work, but mm0 is really fast, which is of special value here.

By the argument about language models learning hierarchy, this would provide training data for them, since formal language is a subset of natural language1.

In one sense, theorem provers are “realer than real”: whatever hierarchy they learn is true, basically because it’s math. Bob Osserman comes to mind.

There has been a tendency in recent years to take the notion of proof down from its pedestal. Critics point out that standards of rigor change from century to century. New gray areas appear all the time. Is a proof by computer an acceptable proof? Is a proof that is spread over many journals and thousands of pages, that is too long for any one person to master, a proof? And of course, venerable Euclid is full of flaws, some filled in by Hilbert, others possibly still lurking.

Clearly it is worth examining closely and critically the most basic notion of mathematics, that of proof. On the other hand, it is important to bear in mind that all distinctions and niceties about what precisely constitutes a proof are mere quibbles compared to the enormous gap between any generally accepted version of a proof and the notion of a convincing argument. Compare Euclid, with all his flaws to the most eminent of the ancient exponents of the convincing argument – Aristotle. Much of Aristotle’s reasoning was brilliant, and he certainly convinced most thoughtful people for over a thousand years. In some cases his analyses were exactly right, but in others, such as heavy objects falling faster than light ones, they turned out to be totally wrong. In contrast, there is not to my knowledge a single theorem stated in Euclid’s Elements that in the course of two thousand years turned out to be false.

– Bob Osserman

Exploring the space of mathematical language is hard but offers the potential for infinite data. There’s lots of bootstrapping issues, but at some point soonish we’re gonna run out of novel human text to feed the model.

Simulators also offer this, though from what I’ve heard they tend to hit a gap where the model doesn’t get much better, so we go out into the real world looking for more data to capture all the weird high frequency information.

In an even more speculative vein, I often wonder about embedding language inside a simulator. We do it already. Our speech exists on 2 levels: as some abstract language and as raw audio. I can imagine a very fancy version of Socratic Models that communicate even more like humans do.

At some point for increased capabilities, I think models are going to have to make up and solve their own problems, much like bright children do.

  1. See Tarski for way more on this. 

Related Posts

Minimal Surfaces

November 2, 2023

NTK reparametrization

Kate from Vancouver, please email me

ChatGPT Session: Emotions, Etymology, Hyperfiniteness

Some ChatGPT Sessions

2016 ML thoughts

My biggest takeaway from Redwood Research REMIX

finite, actual infinity, potential infinity

Actions and Flows