Unknown among my community, Fabrice Bellard is someone I aspire to be a third as productive as.
Just look at this section:
4 The NC library
In order to ensure that the model is exactly the same in the encoder and in the decoder, we developed a custom C library to implement the various operations needed by the models. It only depends on basic IEEE 754-2008 32 bit floating point operations (fused multiply-add, division, square root) with no reliance on external libraries to ensure we get deterministic results on every CPU and operating system.
He built an entire custom system to evaluate on. In 2019. Rather than go with any of a lot of frameworks available, including at least one in C (Dark).
On x86 CPUs, the AVX2 + FMA instructions are mandatory in order to get correct performance. The basic matrix operations were optimized to be fast with small matrices. It happens in our case because we use small batch sizes. General purpose BLAS routines are much slower in these cases.
He wrote something to beat BLAS.
The library represents the forward evaluation of the model as a bytecode. The bytecode to compute the gradient using backward propagation is automatically computed from this bytecode. Liveness analysis and dead code removal is done to reduce the memory usage and improve cache hits.
Why not add some liveness analysis in my custom system?
Since the decoder needs to evaluate the model incrementally at each timestep (but still does the training on a whole segment as the encoder), the library automatically converts the per-segment forward evaluation bytecode into a per-timestep bytecode. The resulting per-timestep forward evaluation bytecode usually runs slower because smaller matrix multiplications are involved.
And this bit in Section 7:
This experiment would not have been possible without our NC library which ensures deterministic evaluation and training with good performances on standard CPUs.
Wow. That’s dedication to reproducibility, and a lot of ambition.