806 Proofs, 44 Annotations: What grind Can and Can't Do


Three weeks ago, Leo de Moura posted a challenge: can AI learn to annotate theorems for Lean 4’s grind tactic? The idea is that AI should press buttons on existing automation, not synthesize proof terms from scratch.

I pointed Claude at Mathlib and let it run. After 25+ experiments, the agent found 44 annotations that let grind close 806 proofs across 20+ mathematical domains — group theory, ring arithmetic, power laws, set operations, list combinatorics, logic, ordering, lattices, quantifiers, modular arithmetic, function composition, and more. 100% pass rate across four test suites.

The breakdown:

  • 170 synthetic benchmarks (Bench.lean) — all sorry replaced by grind
  • 193 real Mathlib proofs (MathlibTest.lean) — from Algebra.Group.Basic, Algebra.Ring.Basic, Data.Set.*, Logic.Basic, plus newly discovered domains
  • 100 minimal self-contained proofs (Minimal.lean) — across 14 domains
  • 343 native grind proofs (NewDomains.lean) — requiring zero annotations: CommRing, Lattice, quantifiers, modular arithmetic, Finset, Array, Set distributivity, List functor laws, function composition, Prod/Sum/Sigma types, and more

What the annotations look like

The 44 annotations are unremarkable: mul_inv_cancel, pow_succ, div_eq_mul_inv, etc. What matters is the selection. Mathlib has 200,000+ theorems. The agent found a minimal set that covers a wide goal space without triggering instantiation cascades.

Only one annotation causes blowup: Nat.mul_le_mul [grind ->] at 100+ instantiation chains. This is inherent — any forward-reasoning rule with two <= antecedents blows up O(n^2) on the number of <= facts. This is exactly the constraint design problem Leo described as Level 3.

The real finding: where grind works and where it doesn’t

I tested grind on production code (tinygrad Lean 4 formalization). Result: 1/15 pass. The failures aren’t annotation problems — they’re paradigm mismatches:

Pattern Count Fixable via annotations?
Needs structural induction 8 No — grind has no induction
Custom defs not reduced 4 No — grind uses E-matching, not simp-style reduction
Needs helper lemmas 2 Maybe

Grind excels at algebraic reasoning (groups, rings, modules) where proofs are equational rewrites. For computational libraries (compilers, data structures, ML frameworks) where proofs are primarily definitional unfolding + induction, simp remains the right tool.

Surprise: what grind handles natively

The biggest finding is that 343 proofs need zero annotations. Grind handles these domains out of the box:

  • CommRing: distributivity, negation, squaring, cube expansion, difference of squares
  • Bool/Prop: De Morgan, double negation, excluded middle, S/K combinators
  • Set algebra: union, inter, comm, assoc, distributivity, empty, univ, subset
  • Nat/Int arithmetic: ring identities, ordering, min/max, concrete computation
  • Constructor theory: Option (some != none, map, bind, getD), Prod (eta, swap, map), Sum (injectivity, disjointness, elim)
  • Type wrappers: PLift, ULift, Subtype, Sigma, PSigma, Unit, PUnit
  • List: length, head?, tail, append, zip, map functor laws, cons/nil reasoning
  • Finset: subset, union, intersection, membership, singleton
  • Array: size, push
  • Function: id, const, composition laws
  • Lattice: idempotence, commutativity, associativity of sup/inf
  • Ordering: constructor disjointness
  • Quantifiers: forall/exists manipulation, specialization

Grind’s built-in Boolean algebra reasoning covers Set operations comprehensively. Its ring normalization handles CommRing identities including cube expansions. Constructor injectivity and disjointness are fully automatic.

What grind can’t do (yet)

Some domains I tested that fail:

  • Nat bit ops (land/lor/xor): needs native_decide or decide
  • Symbolic divisibility (a b -> b c -> a c): non-linear constraint
  • Lattice order->join/meet (a <= b -> a sup b = b): doesn’t connect LE to sup/inf
  • List.head (requires nonemptiness proof): type mismatch
  • Nat.factorial: not available without extra imports

The two unsolved goals

bench_int_sq_nonneg (0 <= a * a for integers) needs mul_self_nonneg, but grind’s proof-by-contradiction negates the goal before the lemma can pattern-match. bench_group_pow_neg has a normalization gap where -(n:Int) normalizes to -1 * n. Both are Level 4 challenges — the research frontier.

Code: autoresearch-grind

Related Posts

A Field Guide to Nonstandard Definitions

Sensing and Intuition

What Does Big Mean?

Astrology for Men

a perfectable programming language

An interactive Lean 4 blog post — click through for the full experience.

MBTI and AI

Double Date

Worse Than a Sranc

thanks whole foods lady

Another way of doing big O notation