Visual programming languages have not taken off in the mainstream. Languages like Scratch and Logo have mostly been known as educational tools and not for “serious” programming.

Understanding the large and small structure of code visually is something I’ve found helpful. The large scale structure is often a call graph of how the whole thing is supposed to be structured. The small scale suggests how individual functions are plumbed together, and what each one can do.

Switching between visual and symbolic representations of code is handy for the same reason that switching between pictures and equations is handy. The geometric picture is declarative. Everything is in there. Maybe too much, which is where the symbolic view is useful. Abstract equations strip out extraneous details.

Purely functional languages are in an interesting spot with respect to this idea. Their purity lets you treat every function as a pure i/o black box. This property is preserved under function composition. I imagine writing pure functions graphically, and more complicated combinations of them can be visualized cleanly by “zooming out”.

OpenAI Codex/GitHub Copilot has had an interesting effect on my programming. It suggests code to me, 1 block at a time. Often this block is 80% of what I wanted. I then fix up the block and move on to the next and repeat.

Example that I just tested:

def foldr(xs,op):
return reduce(op,xs[::-1])


I wrote def foldr(xs,op) and autocomplete did the rest. I thought about the candidate completion for a second, realized that the backwards indexing is what makes it foldr instead of foldl, and accepted it.

It makes the bottom up, concrete implementation of a more top down, abstract vision easier.

I think something similar but for visual languages would be useful. I could write functions one at a time, graphically or symbolically, and get computer aid.

For that reason, I wonder how effective co-training models on visual representations of language is useful. Symbols/language describe the real world but live in abstract relation to it. At some point, to define a word, you end up pointing at an object in the real world and saying “it’s like that” (describing color is a natural example). The visual grounds the symbolic.

For that reason, I think super multi-modal training is a major step towards intelligence.

Objects exist, and have all these properties, which each mode helps you understand a bit better, one query at a time.

By Yoneda-style reasoning, any object is not necessarily equal to the set of all its descriptions, but it is equivalent in all observable ways, which is practically as good. Equivalence is context-dependent equality.

Example: with respect to the property of “being 1 thing”, you (taken as an abstract totality) are equivalent to a rug, a bee, and a single hair on your left shoulder. If “how many are there” is the only query you’re allowed to make, all those objects cannot be distinguished, and may as well be the same.

More queries, more understanding.