Indigenous Land Management and Educational Outcomes, A Data Wrangling Study
The point of a data wrangling project is rarely the analytical conclusion. The point is the journey from two unrelated, untidy sources to a single tidy dataset that a downstream analyst can pick up without ceremony. This project did that across two public datasets with very different shapes, with attention paid to the documentation choices that would let the work be picked up by a community partner or another analyst later.
The problem
Two datasets, both publicly available, that have not historically been brought together at the analytical level. The Indigenous Land and Forest Estate dataset (2024) documents the connection between Indigenous peoples and land across Australia, broken down by state, ownership and management status, and forest classification. The 2021 Australian Bureau of Statistics Census provides educational attendance demographics for Indigenous Australians, organised by state and gender across six education levels (preschool through tertiary plus "other" and "not stated").
The first dataset is structured and CSV-shaped. The second is an Excel sheet with metadata headers, multi-row formatting, embedded gender labels, and wide columns. Joining them requires reshaping the second into the first's grammar, then thinking carefully about what the joined dataset can and cannot say. Land management practice is one of many factors that bear on educational outcomes, and the project was scrupulous about not claiming causal links the data does not support.
The approach
A staged pipeline with the wrangling moves named and the decisions surfaced.
Standardise state names. The land dataset used abbreviations (NT, NSW, VIC). The census used full names. A mapping dictionary aligned them, lowercased throughout, and dropped records with missing or invalid state values.
Reshape the demographics data. Strip the metadata headers (rows 1 through 4). Rename the columns from the dataframe-default placeholders. Extract the gender label from the "preschool" column where it had been embedded (males / females / persons). Convert the six education-level columns to numeric. Filter out the aggregated "persons" rows, keeping the disaggregated male and female counts.
Pivot to long format. pivot_longer() collapsed the six education-level columns into two: education_level and attendance_count. The dataset expanded from 1,912 rows to 11,472 (1,912 × 6 education levels), each row now naming a single observation rather than a row carrying six observations across its columns.
Deduplicate columns. A programmatic comparison function walked the column pairs and flagged any with greater than 99% value match. The check identified FOR_CATEGORY and FOR_CAT as duplicates carrying the same content under different types. The duplicate was removed, preserving information without preserving redundancy.
Derive enrichment variables. attendance_proportion for each education level as a share of state total. management_intensity as a composite of ownership, management, and co-management indicators (range 0 to 3). attendance_category bucketed by quartile (low / medium / high). State-level baselines computed for proportional comparison. Each derived variable opens a different angle of analysis without committing the dataset to any one.
Resolve missing data. A single variable, IND_FDES (Indigenous Forest Description), held missing values across 9.52% of records. Rather than drop those rows, missing values were replaced with "unspecified forest description." The choice preserves the analytical population, makes the missingness legible to future analysts, and avoids inflating the apparent completeness of the dataset.
Handle outliers and skew. Attendance counts were right-skewed, with extreme high values distorting visualisations. IQR-method outlier detection identified about 13.7% of attendance values as extreme. Median imputation pulled them back toward plausible ranges. A log(x+1) transformation reduced the skewness of the attendance distribution from 1.32 to -0.77, ready for parametric methods downstream.
What the dataset enables
The end state is a tidy, long-format, state-disaggregated dataset where each row carries one observation and the schema is friendly to grouping, summarising, and visualising. The project did not attempt to claim a causal relationship between land management practice and educational participation, and the dataset is shaped so that any downstream analyst making such a claim has to do their own framing work, not inherit ours.
Evidence
- 970 land records and 59 demographic rows integrated into a single tidy dataset of 11,472 rows
- Two-source join on standardised state names, with full lineage preserved
- Programmatic duplicate-column detection function (~80 lines) flagged the FOR_CAT/FOR_CATEGORY redundancy that manual inspection would have missed
- 9.52% missing values resolved through documented placeholder, preserving the analytical population
- Outlier handling reduced attendance distribution skewness from 1.32 to -0.77 after log transformation
- Reproducible from raw source files with a single R script, dependencies declared upfront
← kipjordan.com