It’s useful to look at image data before you do any ML with it. Usually I want to look at random samples. So I wrote a tool to do it. Actually, I wrote the same tool 3 times (today afternoon), improving something else each time.
First try: Bash
Normally I’d run something like
icat $(find dir1 dir2 -name '*.png' -or '*.jpg' -or '*.jpeg' | shuf -n 100)
where icat
is part of my terminal emulator,
kitty.
I tried spinning that off into a function, but gave up in disgust since positional and optional argument parsing in Bash aren’t worth the effort of learning.
It also has the standard Bash issues from spaces in path names, and I’m not about to use the weird array syntax for this.
Next: Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
import random
import subprocess
from pathlib import Path
from typing import List
parser = argparse.ArgumentParser()
parser.add_argument("dirs", type=str, nargs="*", default=["."])
parser.add_argument("-n", "--num-samples", type=int, default=100)
args = parser.parse_args()
imgs: List[Path] = []
exts = ["png", "jpg", "jpeg", "gif"]
for dir in args.dirs:
for ext in exts:
imgs.extend(Path(dir).expanduser().resolve().rglob(f"**/*.{ext}"))
samples = random.sample(imgs, k=args.num_samples)
subprocess.call(["icat"] + samples)
This improves on argument parsing. However, it still has (at least) one problem: the sampling is memory intensive since it collects all the paths into a list before sampling from it.
Last?: RIIR
(Code)
Rust’s trait1 system makes it easy to swap in your own sampling
strategy. I imagine shuf
uses Fisher-Yates. I have no idea what
random.sample
uses, but it doesn’t work on generic iterators, only
lists.
I used the default one for iterators in the rand
crate: reservoir
sampling. This
is perfect for my use case.
Benefits
- Uses constant memory beyond storing the samples themselves
- Plus the directory walking is as fast as
find
, and much faster than Python’sos.walk
- Much faster startup time than Python
- Installing and distributing binaries is truly cross-platform
Unfortunately, actually drawing the images takes a while, but at least picking which images to draw is fast even for big directories.
-
interfaces ↩