It’s useful to look at image data before you do any ML with it. Usually I want to look at random samples. So I wrote a tool to do it. Actually, I wrote the same tool 3 times (today afternoon), improving something else each time.

## First try: Bash

Normally I’d run something like

icat \$(find dir1 dir2 -name '*.png' -or '*.jpg' -or '*.jpeg' | shuf -n 100)


where icat is part of my terminal emulator, kitty.

I tried spinning that off into a function, but gave up in disgust since positional and optional argument parsing in Bash aren’t worth the effort of learning.

It also has the standard Bash issues from spaces in path names, and I’m not about to use the weird array syntax for this.

## Next: Python

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse
import random
import subprocess
from pathlib import Path
from typing import List

parser = argparse.ArgumentParser()

args = parser.parse_args()

imgs: List[Path] = []
exts = ["png", "jpg", "jpeg", "gif"]

for dir in args.dirs:
for ext in exts:
imgs.extend(Path(dir).expanduser().resolve().rglob(f"**/*.{ext}"))

samples = random.sample(imgs, k=args.num_samples)

subprocess.call(["icat"] + samples)


This improves on argument parsing. However, it still has (at least) one problem: the sampling is memory intensive since it collects all the paths into a list before sampling from it.

## Last?: RIIR

(Code)

Rust’s trait1 system makes it easy to swap in your own sampling strategy. I imagine shuf uses Fisher-Yates. I have no idea what random.sample uses, but it doesn’t work on generic iterators, only lists.

I used the default one for iterators in the rand crate: reservoir sampling. This is perfect for my use case.

### Benefits

• Uses constant memory beyond storing the samples themselves
• Plus the directory walking is as fast as find, and much faster than Python’s os.walk
• Much faster startup time than Python
• Installing and distributing binaries is truly cross-platform

Unfortunately, actually drawing the images takes a while, but at least picking which images to draw is fast even for big directories.

1. interfaces