Client File Processor, Eight Days Collapsed to Twenty-Eight Minutes
The team I worked on did not have a data warehouse. We had Excel and Python and a recurring monthly task that involved hundreds of extracts, each running to hundreds of megabytes, all of which had to be reshaped, validated, and dropped into Power BI within a fixed reporting window. It used to take four to eight business days. After the file processor landed, the same work ran in twenty-eight minutes.
The problem
Operating in a People and Culture function inside a regulated industry meant working outside the traditional data-engineering surface. Access to the proper extract-transform-load toolchain belonged to other teams. Our team owned a reporting obligation that required pulling together hundreds of monthly extracts, each subject to its own header conventions, its own date formats, its own quality quirks.
The manual loop looked like this: open each file, check for required headers, verify date formats, eyeball for data quality, save with a specific naming convention, repeat. We were not doing incremental refreshes either. Each cycle reprocessed an entire financial year of data. The work consumed half a working week per cycle, scaled poorly with new report types, and produced exactly the kind of brittle output that any single missed header would invalidate. The visible failure mode was the missed deadline. The invisible one was the slow erosion of trust: when the pipeline takes a week and one mistake breaks the report, leadership starts caching their own copies.
The approach
A configuration-driven Python pipeline that knew nothing about any specific file type and everything about how to learn one from a config.
Detection. The processor scans incoming files, fingerprints their structure (column count, header patterns, date format hints), and matches them against the configuration library. When a new file shape arrives, adding a JSON config is enough. No code change.
Cleaning. Header normalisation handles invisible characters, casing inconsistencies, and the tail of historical formatting decisions that accumulated as the source systems evolved. Date columns are standardised from whatever input format they arrived in to a single canonical representation.
Validation. Built-in quality checks catch null dates, malformed numerics, and structural deviations from the expected shape. A file that fails validation is moved to a quarantine path with a structured error summary, not silently dropped or quietly corrupted.
Naming. Output files inherit a deterministic naming convention based on the date range they cover. The "file_final_final_v2.csv" pattern stops being a problem because the filename now describes the data.
Reporting. Each batch produces a processing summary: counts in, counts out, validation failures, time taken. The summary is the thing that goes into the deploy log, the thing the team checks each cycle, and the thing that surfaces drift before drift becomes a missed deadline.
What the architecture earns
The point of the configuration-driven shape is not elegance, it is durability. New report types arrive frequently. Source systems change column orders, drop fields, add fields. A hardcoded script handles one of those changes and breaks on the next. A config-driven processor absorbs the change in a config file edit, runs the same code path, and produces the same output shape downstream.
The processor is not the most sophisticated piece of infrastructure I have ever built. It is the one I have learned the most from about the value of constraints. Without access to "proper" tooling, the work forced honest decisions: what does this code need to do, what does it not need to do, and what is the simplest design that survives next month's surprise.
Evidence
- Over 60 million rows processed in a single quarter on validated runs
- Eight full business days of manual work reduced to twenty-eight minutes per cycle
- Zero manual intervention required for standard files; quarantine path handles exceptions
- New file types onboarded by adding a JSON config, no code changes required
- Clean validated output feeds directly into Power BI dashboards, with processing summaries archived per cycle
← kipjordan.com