Tax Assist, Receipt Matching and Triage Pipeline

Bank data tells you that money moved. Your inbox often holds why. Joining the two reliably, across a full Australian financial year, is the difference between an evening of receipt-hunting and a polished expense schedule. The work is not glamorous, it is the kind of personal admin that quietly compounds into days lost every year, and it is precisely the shape of problem that yields to a well-built pipeline.

The problem

Preparing deductible expense evidence means matching hundreds of transactions to PDF receipts scattered across Gmail. The naive approach is to search for vendor names, download every PDF that comes back, and reconcile by hand. It does not scale. Worse, broad Gmail queries pull noise: OAuth security notices, login codes, pre-order updates, promotional mail, all sitting beside the genuine invoices and looking superficially similar to a date-bounded keyword search.

A pipeline built around broad queries would either miss real receipts (low recall) or flood the triage queue with junk (low precision). The system needed structured matching across bank and email, scored email evidence so the obvious candidates rose to the top, and a human review surface before anything got bundled into accountant-facing deliverables.

The approach

A multi-stage pipeline where each stage has a clear job and a clear output.

Snapshot. Copy the personal-budget SQLite database into a working folder for the financial year window. The snapshot is read-only, used as a join source only, so the budget platform stays untouched.
Fetch. Gmail recipes pull candidate messages using vendor-pinned searches and date-bounded queries. Metadata lands in a local SQLite store. The fetch is parameterised per vendor, so high-volume senders use tight queries and low-volume vendors get broader ones.
Match. Align transactions to messages by vendor canonical name and date window. Rows bucket into matched, email_only, and transaction_only. The buckets are the unit of triage work.
Score. Composite heuristic scoring in email_score.py ranks receipt likelihood using subject patterns, sender domains, attachment hints, and label signals. Additive weights are transparent, thresholds are tunable, and the scoring is explainable per row rather than opaque per batch.
Triage. FastAPI plus HTMX webapp: dashboard, vendor grouping, bulk actions, per-row keep, skip, or review decisions. The interface is built for speed at the keyboard rather than mouse-heavy clicking.
Validate. Dry-run mode compares matcher output against a lodged prior year to tune precision and recall before trusting a new financial year. The validation step is the safety net that keeps the pipeline honest.

Tax-assist pipeline from budget snapshot through Gmail fetch, match, score, and triage

What the FY25–26 dry-run found

Research on a recent financial year quantified the noise problem in detail. Several hundred fetched messages included dozens of security and marketing subjects that initially scored as receipt candidates when vendor domains were pinned too broadly. The fix was tighter fetch recipes, sharper score thresholds, and the realisation that vendors fell into roughly three behaviour patterns:

Predictable senders. Same domain, same subject pattern, attachment usually present. Score high, low triage cost.
Mixed senders. Same domain, mixed subject lines (receipts, statements, promotions). Need subject-pattern weighting in the score.
Inconsistent senders. Multiple sub-domains, varying subject conventions, occasional attachment. Triage queue absorbs the ambiguity, the scorer does not pretend it knows.

The taxonomy informs the score, the score informs the queue, the queue is where the human spends their twenty minutes once a year.

Conceptual triage interface with keep, skip, review decisions on anonymised rows

Evidence

707 Gmail candidates against ~1,900 transaction-only rows in one financial year, illustrating the asymmetric coverage the matcher has to handle
Human-in-the-loop triage before PDF download or evidence filing
Vendor to ATO category mapping via YAML configuration, kept separate from the scoring logic
Dry-run validation against a lodged prior year delivers measurable precision and recall before the new year is trusted
Accountant deliverables (expense schedules) remain a deterministic export step, the assistant reduces the search and match labour upstream without ever replacing professional tax advice

This is the same pattern I apply in enterprise analytics: messy multi-source joins, explicit scoring, triage queues, validation against a gold standard. Same shape, smaller stakes, strict privacy throughout: no published TFNs, amounts, or real receipt content.