← BackImage Pipelines

Perceptual Image Matching

Image fingerprinting pipelines that deduplicate, match, and search images at scale. The same technique that finds a near-duplicate in a 10,000-image catalog also groups variants from a photo shoot, audits migrations, and surfaces visual look-alikes across libraries.

What It Does

A perceptual hashing pipeline takes any set of images and gives you a way to compare them by what they look like, not by file bytes or filenames. Each image becomes a short fingerprint that stays stable across resize, recompression, format change, and small crops.

64-bit fingerprint per image, comparable in a single CPU instruction
Pairwise distance between any two images, so you know how visually close they are
All-pairs match passes that comfortably handle tens of millions of comparisons per run
Configurable thresholds for exact duplicate, near match, or loose visual similarity

How It Works

Every image is reduced to an 8 × 8 grid of relative brightness gradients, then packed into a 64-bit integer. Comparing two fingerprints is one XOR plus one popcount. Two copies of the same photo hash to nearly the same value, even after a round trip through a different host.

matcher — pseudocode

stage 1 │ fingerprint every image on both sides
        │   library  →  fp[]   (8×8 grayscale gradient → 64-bit hash)
        │   inputs   →  fp[]

stage 2 │ all-pairs match
        │   for each input s ∈ inputs:
        │     best  ←  argmin over library of  hamming(fp[s], fp[L])
        │     if hamming(fp[s], fp[best]) ≤ THRESHOLD:
        │        record (s → best, distance)

hot path│ hamming(a, b) = popcount(a XOR b)   one CPU instruction per pair

Perceptual fingerprint, 8 × 8 An 8 by 8 grid representing a perceptual image fingerprint

distance = 0identical fingerprints, identical image

What That Unlocks

Deduplication. Find every duplicate or near-duplicate in a catalog regardless of size, format, or compression.
Visual search. Look up a library by what images look like rather than by the tags someone assigned.
Variant grouping. Cluster near-identical shots from a photo shoot, product set, or stock library automatically.
Migration audit. Verify every image survived the move from one platform to another, intact and in the right place.
Repost detection. Spot when your visuals are reused elsewhere, even after they have been recompressed or re-cropped.

matcher — sample run

$ matcher --threshold 10
[ok]    fingerprinted 8,742 library images
[ok]    fingerprinted 2,488 input images
[run]   21,742,496 pair comparisons in 47s
[match] dist=0   library/img-04201   →  input/q217-04
[match] dist=0   library/img-09112   →  input/q094-01
[match] dist=2   library/img-02877   →  input/q156-03
[match] dist=4   library/img-11403   →  input/q043-02
...
[done]  2,317 matches under threshold

At Scale

Twenty million pair comparisons sounds expensive, but each one is a 64-bit XOR and a popcount. The bottleneck is downloading thumbnails, not the matching itself. A typical match pass finishes in under a minute once the images are cached locally.

8 × 8Hash Grid

64-bitFingerprint Per Image

20M+Comparisons Per Pass

< 1 minMatch Pass Runtime

What I Build

A complete pipeline usually includes:

Source ingestion: targeted scrapers, public APIs, file system walkers, or webhook listeners
Image normalization and fingerprint generation, with a resumable cache so re-runs stay cheap
All-pairs or index-based matching with configurable thresholds
Output: CSVs, dashboards, queue events, or webhook calls for downstream review

Need an Image Pipeline?

Dedupe, match, search, audit. I build pipelines that turn a pile of images into something you can query.

Book a Call

Response within 24 hours