Analysing the Relationship Between Indigenous Land Management and Educational Outcomes in Australia

Fair warning: These posts are super dry because they are academic papers. Sorry!

Data Science/Data Wrangling project.

Kip Jordan

A statistical analysis conducted as part of the Data Science Graduate Certificate at RMIT University, Melbourne, Australia.

Data Description

This analysis investigates the relationship between Indigenous land management practices and educational outcomes across Australia by integrating two key datasets:

Indigenous Land and Forest Estate Dataset (2024)

  • Contains 970 records documenting Indigenous peoples' connection to and management of land throughout Australia
  • Captures:
  • Geographical information (STATE)
  • Indigenous land status indicators (IND_OWN, IND_MNG, IND_COMNG)
  • Forest classifications (FOR_CATEGORY, FOR_TYPE)
  • CSV format with numeric identifiers, binary indicators, and descriptive fields

2021 Australian Bureau of Statistics Census Dataset

  • Provides Indigenous Educational Demographics across eight states/territories
  • Contains:
  • Attendance statistics for six educational categories (preschool through tertiary)
  • Gender-disaggregated data
  • Excel format requiring preprocessing due to complex structure with metadata headers

By combining these datasets, we aim to understand how varying approaches to Indigenous land management correlate with educational participation rates in Indigenous communities.

Key Variables

Indigenous Land Dataset

Identifier Variables:
- Rowid, VALUE: Numeric identifiers
- COUNT: Numeric area measurements

Geographic Variables:
- STATE: Character (8 territories)

Indigenous Management Indicators (Binary 0/1):
- IND_OWN: Indigenous ownership
- IND_MNG: Indigenous management
- IND_COMNG: Indigenous co-management
- IND_OSR: Other special rights
- IND_EST: Indigenous estate
- OVERLAP: Overlap indicator

Descriptive Variables (Character):
- IND_DESC: Full management descriptions
- IND_FDES: Forest-specific descriptions
- FOR_CATEGORY/FOR_CAT: Forest classifications
- FOR_TYPE: Detailed forest types

Symbolic Representations (Character):
- SYM_IO: Indigenous ownership symbols
- SYM_IMCM: Management status symbols
- SYM_OSR: Special rights symbols
- SYM_IE: Indigenous estate symbols

Classification:
- IND_CODE: Numeric classification code

Educational Demographics Dataset (59 × 8)

Initial Structure (Pre-cleaning):
- All columns initially character type
- Contains metadata headers in first rows
- Educational statistics organised in wide format

Key Variables (Post-cleaning):
- State identifiers (character)
- Educational attendance counts (numeric):
- Preschool
- Primary
- Secondary
- Tertiary
- Other
- Not stated
- Gender categories (character):
- Males
- Females
- Persons (later filtered)

The success of our transformation is validated by the final dataset structure: the original 1,912 records expanded to 11,472 rows, representing six education levels per original observation. This tidy format provides a robust foundation for investigating how Indigenous land management practices might influence educational participation patterns across different regions and contexts.

=== Structure of Indigenous Land Data ===

spc_tbl_ [970 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Rowid       : num [1:970] 0 1 2 3 4 5 6 7 8 9 ...
 $ VALUE       : num [1:970] 1 2 3 4 5 6 7 8 9 10 ...
 $ COUNT       : num [1:970] 19501614 163784 49413444 4175153 4072599 ...
 $ STATE       : chr [1:970] "NT" "NT" "NT" "NT" ...
 $ IND_OWN     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ IND_MNG     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ IND_COMNG   : num [1:970] 0 0 0 0 0 0 0 0 0 0 ...
 $ IND_OSR     : num [1:970] 0 0 0 0 0 0 0 0 0 0 ...
 $ IND_EST     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ OVERLAP     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ IND_DESC    : chr [1:970] "Non-Indigenous" "Indigenous owned and Indigenous managed (but not subject to other special rights)" "Indigenous owned and Indigenous managed (but not subject to other special rights)" "Indigenous owned and Indigenous managed (but not subject to other special rights)" ...
 $ IND_FDES    : chr [1:970] NA "Indigenous owned and Indigenous managed (but not subject to other special rights)" NA "Indigenous owned and Indigenous managed (but not subject to other special rights)" ...
 $ FOR_CATEGORY: chr [1:970] "Non-forest" "Native forest" "Non-forest" "Native forest" ...
 $ FOR_CAT     : chr [1:970] "Non-forest" "Native forest" "Non-forest" "Native forest" ...
 $ FOR_TYPE    : chr [1:970] "Non-forest" "Rainforest" "Non-forest" "Eucalypt Medium Open" ...
 $ SYM_IO      : chr [1:970] "Non-forest land that is not Indigenous owned" "Forest that is Indigenous owned" "Non-forest land that is Indigenous owned" "Forest that is Indigenous owned" ...
 $ SYM_IMCM    : chr [1:970] "Non-forest land that is not Indigenous managed and not Indigenous co-managed" "Forest that is Indigenous managed" "Non-forest land that is Indigenous managed" "Forest that is Indigenous managed" ...
 $ SYM_OSR     : chr [1:970] "Non-forest land that is not subject to other special rights" "Forest that is not subject to other special rights" "Non-forest land that is not subject to other special rights" "Forest that is not subject to other special rights" ...
 $ SYM_IE      : chr [1:970] "Non-forest land that is not in the total Indigenous estate" "Forest that is in the total Indigenous estate" "Non-forest land that is in the total Indigenous estate" "Forest that is in the total Indigenous estate" ...
 $ IND_CODE    : num [1:970] 0 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   Rowid = col_double(),
  ..   VALUE = col_double(),
  ..   COUNT = col_double(),
  ..   STATE = col_character(),
  ..   IND_OWN = col_double(),
  ..   IND_MNG = col_double(),
  ..   IND_COMNG = col_double(),
  ..   IND_OSR = col_double(),
  ..   IND_EST = col_double(),
  ..   OVERLAP = col_double(),
  ..   IND_DESC = col_character(),
  ..   IND_FDES = col_character(),
  ..   FOR_CATEGORY = col_character(),
  ..   FOR_CAT = col_character(),
  ..   FOR_TYPE = col_character(),
  ..   SYM_IO = col_character(),
  ..   SYM_IMCM = col_character(),
  ..   SYM_OSR = col_character(),
  ..   SYM_IE = col_character(),
  ..   IND_CODE = col_double()
  .. )
 - attr(*, "problems")=<externalptr>
=== Sample of Indigenous Land Data ===

  Rowid VALUE    COUNT STATE IND_OWN IND_MNG IND_COMNG IND_OSR IND_EST OVERLAP
  <dbl> <dbl>    <dbl> <chr>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
1     0     1 19501614 NT          0       0         0       0       0       0
2     1     2   163784 NT          1       1         0       0       1       1
3     2     3 49413444 NT          1       1         0       0       1       1
4     3     4  4175153 NT          1       1         0       0       1       1
5     4     5  4072599 NT          1       1         0       0       1       1
6     5     6    29238 NT          1       1         0       0       1       1
=== Structure of Demographics Data ===

tibble [59 × 8] (S3: tbl_df/tbl/data.frame)
 $ ...1: chr [1:59] "Australian Bureau of Statistics" "Census of Population and Housing: Aboriginal and Torres Strait Islander people data summary, 2021" "Released at 10.00am (Canberra time) 28 June 2022" NA ...
 $ ...2: chr [1:59] NA NA NA NA ...
 $ ...3: chr [1:59] NA NA NA NA ...
 $ ...4: chr [1:59] NA NA NA NA ...
 $ ...5: chr [1:59] NA NA NA NA ...
 $ ...6: chr [1:59] NA NA NA NA ...
 $ ...7: chr [1:59] NA NA NA NA ...
 $ ...8: chr [1:59] NA NA "Contents" "Find out more:" ...
=== Header of Demographics Data ===

  ...1                                 ...2  ...3  ...4  ...5  ...6  ...7  ...8
  <chr>                                <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Australian Bureau of Statistics      <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2 Census of Population and Housing: A… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
3 Released at 10.00am (Canberra time)… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  Cont…
4 <NA>                                 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  Find…
5 TABLE 4. TYPE OF EDUCATIONAL INSTIT… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  Type…
6 Count of Aboriginal and/or Torres S… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  Indi…

Understanding the Data

Our data understanding and preparation process employs a methodical approach to ensure data quality and meaningful relationships in our final dataset.

Indigenous Land Dataset Validation

Initial examination revealed several key characteristics requiring attention:

Baseline: 970 records encompassing 20 variables
Data Quality Issues: Inconsistent state abbreviations (e.g., "NT" vs "NSW") Missing state information in 14 records Mixed case text in character fields
Variable Types: Binary indicators (0/1) for six management classifications Character fields for descriptive and symbolic data Multiple overlapping forest categorisation schemes

Cleaning Implementation:

Standardised state names through mapping dictionary: Created
consistent state naming convention, applied lowercase transformation for
uniformity, and removed records with missing/invalid states.

Converted forest categories to factors: FOR_CATEGORY and FOR_TYPE
converted to factor type, preserved original levels for analysis
integrity.

Validated binary indicators: Confirmed 0/1 values across all
management fields, checked for logical consistency in overlapping
fields.

Demographics Dataset Validation

The raw data presented several structural challenges:

Initial State:

All columns imported as character type Embedded metadata in first 4 rows Non-standardised column names (…1, …2, etc.) Gender information embedded within data values Mixed data types requiring standardisation

Processing Implementation:

Header Processing: Extracted relevant column headers, removed
metadata and explanatory rows, and standardised column naming
convention.

Data Type Conversion: Identified numeric attendance columns,
converted character to numeric values, and handled NA values
appropriately.

Gender Categorisation: Extracted gender information from data,
created explicit gender column, and removed aggregated 'PERSONS'
records.

Data Validation: Verified numeric conversion success, confirmed
gender category completeness, and validated attendance totals
match.

Merge Process and Data Integrity

The integration process required careful attention to maintain data integrity:

Pre-Merge Preparation: We standardised state names across datasets, verified matching state categories existed between them, and confirmed there were no duplicate state-gender combinations that could cause issues during merging.

Merge Implementation: A left join preserved all land management records while matching on standardised state names and retaining all relevant columns from both source datasets.

Post-Merge Validation: The record counts aligned perfectly with expectations, showing 956 cleaned land records expanded to 1,912 final records when combined with 16 demographic records (2 genders × 8 states). Quality checks confirmed data integrity by verifying no unexpected NAs were introduced, binary indicators were preserved, educational totals remained consistent, and state-level relationships stayed intact.

This systematic approach to data understanding and preparation ensures our analysis can reliably examine the relationships between land management practices and educational participation whilst maintaining the integrity of both original datasets.

=== Indigenous Land Data Structure ===

spc_tbl_ [970 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Rowid       : num [1:970] 0 1 2 3 4 5 6 7 8 9 ...
 $ VALUE       : num [1:970] 1 2 3 4 5 6 7 8 9 10 ...
 $ COUNT       : num [1:970] 19501614 163784 49413444 4175153 4072599 ...
 $ STATE       : chr [1:970] "NT" "NT" "NT" "NT" ...
 $ IND_OWN     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ IND_MNG     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ IND_COMNG   : num [1:970] 0 0 0 0 0 0 0 0 0 0 ...
 $ IND_OSR     : num [1:970] 0 0 0 0 0 0 0 0 0 0 ...
 $ IND_EST     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ OVERLAP     : num [1:970] 0 1 1 1 1 1 1 1 1 1 ...
 $ IND_DESC    : chr [1:970] "Non-Indigenous" "Indigenous owned and Indigenous managed (but not subject to other special rights)" "Indigenous owned and Indigenous managed (but not subject to other special rights)" "Indigenous owned and Indigenous managed (but not subject to other special rights)" ...
 $ IND_FDES    : chr [1:970] NA "Indigenous owned and Indigenous managed (but not subject to other special rights)" NA "Indigenous owned and Indigenous managed (but not subject to other special rights)" ...
 $ FOR_CATEGORY: chr [1:970] "Non-forest" "Native forest" "Non-forest" "Native forest" ...
 $ FOR_CAT     : chr [1:970] "Non-forest" "Native forest" "Non-forest" "Native forest" ...
 $ FOR_TYPE    : chr [1:970] "Non-forest" "Rainforest" "Non-forest" "Eucalypt Medium Open" ...
 $ SYM_IO      : chr [1:970] "Non-forest land that is not Indigenous owned" "Forest that is Indigenous owned" "Non-forest land that is Indigenous owned" "Forest that is Indigenous owned" ...
 $ SYM_IMCM    : chr [1:970] "Non-forest land that is not Indigenous managed and not Indigenous co-managed" "Forest that is Indigenous managed" "Non-forest land that is Indigenous managed" "Forest that is Indigenous managed" ...
 $ SYM_OSR     : chr [1:970] "Non-forest land that is not subject to other special rights" "Forest that is not subject to other special rights" "Non-forest land that is not subject to other special rights" "Forest that is not subject to other special rights" ...
 $ SYM_IE      : chr [1:970] "Non-forest land that is not in the total Indigenous estate" "Forest that is in the total Indigenous estate" "Non-forest land that is in the total Indigenous estate" "Forest that is in the total Indigenous estate" ...
 $ IND_CODE    : num [1:970] 0 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   Rowid = col_double(),
  ..   VALUE = col_double(),
  ..   COUNT = col_double(),
  ..   STATE = col_character(),
  ..   IND_OWN = col_double(),
  ..   IND_MNG = col_double(),
  ..   IND_COMNG = col_double(),
  ..   IND_OSR = col_double(),
  ..   IND_EST = col_double(),
  ..   OVERLAP = col_double(),
  ..   IND_DESC = col_character(),
  ..   IND_FDES = col_character(),
  ..   FOR_CATEGORY = col_character(),
  ..   FOR_CAT = col_character(),
  ..   FOR_TYPE = col_character(),
  ..   SYM_IO = col_character(),
  ..   SYM_IMCM = col_character(),
  ..   SYM_OSR = col_character(),
  ..   SYM_IE = col_character(),
  ..   IND_CODE = col_double()
  .. )
 - attr(*, "problems")=<externalptr>
=== Summary of Indigenous Land Data ===

     Rowid           VALUE           COUNT              STATE
 Min.   :  0.0   Min.   :  1.0   Min.   :        1   Length:970
 1st Qu.:242.2   1st Qu.:243.2   1st Qu.:       58   Class :character
 Median :484.5   Median :485.5   Median :     1969   Mode  :character
 Mean   :484.5   Mean   :485.5   Mean   :   792696
 3rd Qu.:726.8   3rd Qu.:727.8   3rd Qu.:    43011
 Max.   :969.0   Max.   :970.0   Max.   :139186321
    IND_OWN          IND_MNG         IND_COMNG         IND_OSR
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000
 Mean   :0.3732   Mean   :0.3103   Mean   :0.4258   Mean   :0.4546
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
    IND_EST          OVERLAP         IND_DESC           IND_FDES
 Min.   :0.0000   Min.   :0.0000   Length:970         Length:970
 1st Qu.:1.0000   1st Qu.:0.0000   Class :character   Class :character
 Median :1.0000   Median :1.0000   Mode  :character   Mode  :character
 Mean   :0.8567   Mean   :0.5381
 3rd Qu.:1.0000   3rd Qu.:1.0000
 Max.   :1.0000   Max.   :1.0000
 FOR_CATEGORY         FOR_CAT            FOR_TYPE            SYM_IO
 Length:970         Length:970         Length:970         Length:970
 Class :character   Class :character   Class :character   Class :character
 Mode  :character   Mode  :character   Mode  :character   Mode  :character



   SYM_IMCM           SYM_OSR             SYM_IE             IND_CODE
 Length:970         Length:970         Length:970         Min.   :0.000
 Class :character   Class :character   Class :character   1st Qu.:2.000
 Mode  :character   Mode  :character   Mode  :character   Median :4.000
                                                          Mean   :4.527
                                                          3rd Qu.:8.000
                                                          Max.   :9.000
Unique states after mapping:

[1] "northern territory"           "queensland"
[3] "western australia"            "south australia"
[5] "new south wales"              "victoria"
[7] "australian capital territory" "tasmania"
Land data factor validation:

FOR_CATEGORY     FOR_TYPE
    "factor"     "factor"
=== Demographics Data Structure ===

tibble [59 × 8] (S3: tbl_df/tbl/data.frame)
 $ ...1: chr [1:59] "Australian Bureau of Statistics" "Census of Population and Housing: Aboriginal and Torres Strait Islander people data summary, 2021" "Released at 10.00am (Canberra time) 28 June 2022" NA ...
 $ ...2: chr [1:59] NA NA NA NA ...
 $ ...3: chr [1:59] NA NA NA NA ...
 $ ...4: chr [1:59] NA NA NA NA ...
 $ ...5: chr [1:59] NA NA NA NA ...
 $ ...6: chr [1:59] NA NA NA NA ...
 $ ...7: chr [1:59] NA NA NA NA ...
 $ ...8: chr [1:59] NA NA "Contents" "Find out more:" ...
=== Summary of Demographics Data ===

     ...1               ...2               ...3               ...4
 Length:59          Length:59          Length:59          Length:59
 Class :character   Class :character   Class :character   Class :character
 Mode  :character   Mode  :character   Mode  :character   Mode  :character
     ...5               ...6               ...7               ...8
 Length:59          Length:59          Length:59          Length:59
 Class :character   Class :character   Class :character   Class :character
 Mode  :character   Mode  :character   Mode  :character   Mode  :character
Demographics data validation:

female   male
     8      8
Numeric columns check:

 preschool    primary  secondary   tertiary      other not_stated      total
 "numeric"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"
=== Final Dataset Summary ===

[1] "Land data rows:"     "956"                 "\n"
[4] "Demographics rows:"  "16"                  "\n"
[7] "Combined data rows:" "1912"                "\n"
=== Key Statistics for Educational Demographics ===

  gender preschool_mean preschool_sd primary_mean primary_sd secondary_mean
  <chr>           <dbl>        <dbl>        <dbl>      <dbl>          <dbl>
1 female          1511.        1663.        7353.      7363.          5236.
2 male            1674.        1868.        7751.      7797.          5217.
=== Sample of Final Combined Dataset ===

  Rowid VALUE    COUNT STATE   IND_OWN IND_MNG IND_COMNG IND_OSR IND_EST OVERLAP
  <dbl> <dbl>    <dbl> <chr>     <dbl>   <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
1     0     1 19501614 northe…       0       0         0       0       0       0
2     0     1 19501614 northe…       0       0         0       0       0       0
3     1     2   163784 northe…       1       1         0       0       1       1
4     1     2   163784 northe…       1       1         0       0       1       1
5     2     3 49413444 northe…       1       1         0       0       1       1
6     2     3 49413444 northe…       1       1         0       0       1       1

Tidy & Manipulate Data I

Our combined dataset required significant restructuring to align with tidy data principles and support our analytical objectives. The initial structure presented several challenges that would have hindered our ability to analyse relationships between land management practices and educational outcomes. Most notably, the wide format of the education data, with six separate columns from preschool to not stated, made it difficult to analyse trends across education levels. Additional data quality issues included redundant categorical variables, mixed data types, and inconsistent NA handling.

To address these challenges, we implemented a comprehensive transformation strategy centered around three key operations:

First, we employed pivot_longer() to convert the data from wide to long format, creating a single 'education_level' variable and standardising attendance counts in one column. This pivotal transformation enabled direct comparisons across educational levels. The operation was implemented by pivoting the six education columns (preschool, primary, secondary, tertiary, other, and not_stated) into two new columns: education_level and attendance_count.

Our second focus was column deduplication, systematically comparing column contents to identify and remove 100% matches through programmatic means. This process significantly reduced data complexity while maintaining data integrity and improving analysis efficiency. Finally, we standardised the data formats using mutate(across()) operations to ensure consistent factor levels and standardised text case throughout the dataset.

The success of our transformation strategy is clearly demonstrated in the validation results. Our original dataset of 1,912 records was transformed into a tidy format of 11,472 rows, with each row representing one education level (calculated as 1,912 × 6 education levels). Importantly, we maintained complete data lineage and achieved our transformation goals with no information loss.

This new tidy structure provides significant analytical advantages, simplifying statistical analysis, enhancing visualisation capabilities, and improving our ability to identify patterns and model relationships between land management and education. The clean, consistent structure of our final dataset maintains all original information while providing a more robust foundation for our analytical objectives.

=== Column Comparison Analysis ===

* Duplicate columns found: 'FOR_CATEGORY' and 'FOR_CAT'
  - Data types: factor and character
  - Match rate: 100.00%
  - Exact matches: 1912 out of 1912 rows
  - NA matches: 0

  - Sample comparison (first 5 rows):
   FOR_CATEGORY       FOR_CAT
1    non-forest    non-forest
2    non-forest    non-forest
3 native forest native forest
4 native forest native forest
5    non-forest    non-forest

Columns identified for removal: FOR_CAT
=== Transformation Summary ===

• Removed Column:  FOR_CAT
 • Remaining Columns:  Rowid, VALUE, COUNT, STATE, IND_OWN, IND_MNG, IND_COMNG, IND_OSR, IND_EST, OVERLAP, IND_DESC, IND_FDES, FOR_CATEGORY, FOR_TYPE, SYM_IO, SYM_IMCM, SYM_OSR, SYM_IE, IND_CODE, total, gender, education_level, attendance_count
=== Data Distribution ===

• Education Level Counts:

 preschool    primary  secondary   tertiary      other not_stated
      1912       1912       1912       1912       1912       1912
=== Row Count Validation ===

• Original Rows:  1912
 • Final Tidy Rows:  11472
 • Expected Rows:  11472  ( 1912  × 6 education levels)
=== Sample Output ===

  STATE              education_level attendance_count
  <chr>              <fct>                      <dbl>
1 northern territory preschool                    672
2 northern territory primary                     4131
3 northern territory secondary                   2328

Tidy & Manipulate Data II

To deepen our understanding of the relationship between Indigenous land management and educational outcomes, we created several derived variables that will enable more nuanced analysis. These transformations were designed to normalize attendance data across different state populations, create meaningful categorisations of management approaches, and enable both categorical and continuous analysis options.

For attendance-based metrics, we developed the attendance_proportion variable to represent each education level's share of total attendance within a state. This enables fair comparison between states of different population sizes and helps identify which education levels are over or under-represented. We also created attendance_category, which classifies attendance counts into meaningful groups (no attendance, low, medium, high) based on quartile analysis. This facilitates categorical analysis and visualization while helping identify patterns in extreme cases.

To better understand Indigenous management practices, we introduced two key indicators. The high_indigenous_control variable serves as a binary indicator for strong Indigenous management presence by combining ownership (IND_OWN) and management (IND_MNG) indicators. This allows comparison of educational outcomes between high and low Indigenous control areas. Additionally, we created management_intensity as a composite score (0-3) that incorporates ownership, management, and co-management status. This provides a continuous measure of Indigenous involvement that enables correlation analysis with educational outcomes.

For state-level analysis, we calculated state_total_attendance to serve as a baseline for proportional calculations and enable identification of state-level patterns. We also computed avg_attendance_by_level to determine mean attendance for each education level within states, helping identify state-specific educational patterns and compare educational access across regions.

These derived variables support our subsequent analysis by enabling both broad categorical comparisons and detailed numerical analysis, while accounting for population differences between states. They provide multiple perspectives on Indigenous management involvement and support investigation of relationships between land management approaches and educational outcomes. The distribution summaries show varying levels of Indigenous involvement across different regions and reveal patterns in educational participation across different levels of schooling.

=== New Variable Summaries ===
 attendance_proportion management_intensity    attendance_category
 Min.   :0.0000362     Min.   :0.000        no attendance:   0
 1st Qu.:0.0001945     1st Qu.:0.000        low          :2813
 Median :0.0004601     Median :1.000        medium       :5677
 Mean   :0.0006974     Mean   :1.107        high         :2982
 3rd Qu.:0.0009473     3rd Qu.:2.000
 Max.   :0.0086731     Max.   :2.000
=== Management Intensity Distribution ===

   0    1    2
3048 4152 4272
=== Attendance Categories by Education Level ===

             no attendance  low medium high
  preschool              0  660   1070  182
  primary                0   42    824 1046
  secondary              0   42   1084  786
  tertiary               0  481    856  575
  other                  0 1308    604    0
  not_stated             0  280   1239  393

Initial Data Scan

Our data quality assessment employed a systematic approach focused on identifying three key types of potential issues: missing values, data inconsistencies, and obvious errors. This thorough examination ensures we understand and address any data quality concerns before proceeding with our analysis.

Initial Scan Results

The initial scan revealed a remarkably clean dataset, with missing data limited to a single variable: IND_FDES (Indigenous Forest Description). This affected 1,092 records, representing 9.52% of the dataset. All other variables were complete. Our integrity checks provided further confidence in the data quality, confirming that:
- Attendance counts contained no negative values
- All attendance proportions fell within the valid range of 0-1
- Management intensity scores remained within their expected bounds of 0-3

Missing Data Strategy

When confronted with the missing forest descriptions, we carefully considered our options and ultimately chose to replace NA values with "unspecified forest description" rather than removing the affected rows. This decision emerged from a thoughtful analysis of several key factors.

First, the relatively low missing data rate of less than 10% suggested that complete case deletion would be unnecessarily aggressive. Additionally, we recognised that IND_FDES serves as a supplementary descriptive field rather than a critical analysis variable, making imputation a reasonable approach. The preservation of our full dataset was particularly important given the valuable educational attendance data contained in these rows.

Our chosen strategy prioritised documentation clarity by using a replacement value that explicitly indicates the original missing status while maintaining data integrity. This approach allows future analysts to easily identify which records contained missing descriptions while still enabling those records to be included in analyses.

Strategy Success Metrics

The implementation of our missing data strategy proved highly effective:
- Achieved complete resolution of missing values while preserving our full dataset
- Quality of data remained intact, with all attendance data and management information preserved
- Clear documentation of previous gaps through our chosen replacement value ensures transparency
- Logical integration with existing descriptions maintains the dataset's analytical utility

This conservative approach to handling missing data exemplifies our commitment to maintaining maximum analytical power while ensuring transparent documentation of data limitations. By carefully balancing these competing demands, we've created a robust foundation for our subsequent analyses.

=== Data Quality Analysis ===

Missing Values:
         Variable Missing_Count Missing_Percent
IND_FDES IND_FDES          1092            9.52

Data Integrity Issues:
- Negative attendance counts: 0
- Invalid proportions: 0
- Invalid management scores: 0
=== Missing Value Resolution ===

Remaining missing values in IND_FDES: 0

Secondary Data Scan

Our outlier detection and handling strategy evolved through several iterations to find the most effective approach. The final methodology focused on univariate outlier detection using the Interquartile Range (IQR) method, combined with median imputation for outlier resolution.

Initial Approaches Considered

Our initial attempt using the Z-score method (±3 standard deviations) proved problematic, as it was overly sensitive to extreme values in our educational data and would have inappropriately removed legitimate but unusual attendance patterns. We then explored Winsorisation with group-level bounds, grouping by education level, but this approach introduced new challenges - the complex implementation led to inconsistent results and made it difficult to maintain important relationships within the data groups.

Final Methodology Selected

We ultimately chose a simpler, more robust approach combining IQR detection with median imputation. This method proved more robust to extreme values compared to z-scores and was particularly well-suited for our skewed educational attendance data. By replacing outliers with median values, our resolution strategy preserved the overall data distribution while effectively removing extreme influences. Most importantly, this approach maintained the critical relationships between variables while providing a straightforward and reliable way to handle outliers.

Results of Implementation

The outlier detection and resolution process yielded meaningful improvements while preserving underlying patterns in our educational attendance data. Our attendance count data saw significant refinement, with modifications to approximately 13.7% of values bringing the upper range down from over 21,000 to a more representative 10,902, while preserving important lower attendance patterns. The attendance proportions required more subtle handling, with adjustments to 8.9% of values ensuring all proportions remained within valid bounds. Average attendance levels showed similar patterns to the raw counts, with modifications bringing the upper range from about 21,000 down to 8,691. The consistent modification rate across related variables (13.7% for both attendance counts and averages) suggests the method effectively identified and handled true outliers without over-processing the data, while maintaining consistency between metrics.

=== Outlier Resolution Summary ===

Variable: attendance_count_clean
Original range: 66.00 to 21731.00
Cleaned range: 66.00 to 10902.00
Values modified: 1572 (13.7%)

Variable: attendance_proportion_clean
Original range: 0.00 to 0.01
Cleaned range: 0.00 to 0.00
Values modified: 1022 (8.9%)

Variable: avg_attendance_by_level_clean
Original range: 70.50 to 21005.00
Cleaned range: 70.50 to 8691.00
Values modified: 1572 (13.7%)

Transform

To improve the analytical utility of our attendance data, we applied a natural logarithmic transformation to the attendance_count_clean variable. This transformation was chosen primarily to address the significant right-skewness present in the original data (skewness: 1.321) and to create a more normally distributed variable.

The effectiveness of this transformation is clearly demonstrated through several key distribution metrics. Most notably, we observed a substantial reduction in skewness, with the original data's right-skewness of 1.321 being transformed to a more symmetrical distribution with a skewness of -0.774. This shift towards symmetry indicates a much closer approximation to a normal distribution.

Further evidence of the transformation's success can be seen in the dramatic decrease of the Q3/Q1 ratio from 4.001 to 1.204, demonstrating that our transformed data now exhibits a more balanced distribution of values. Additionally, the standard deviation showed remarkable improvement, reducing from 2,164.388 in the original data to a much more manageable 1.035 in the transformed data, making the variable's spread significantly more interpretable and suitable for analysis.

We specifically chose the log(x + 1) transformation for several important reasons. This approach effectively handles the wide range of attendance counts (from 66 to 10,902) whilst preserving zero values through the addition of 1 before transformation. Perhaps most importantly, it creates a more intuitive scale for comparing relative differences in attendance numbers.

This transformed variable now provides a more reliable foundation for subsequent statistical analyses, particularly those that assume normality or require more balanced distributions. The improved characteristics of our transformed data will enable more robust and reliable statistical inference in our further analyses.

=== Distribution Analysis ===

Original Attendance Distribution:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     66     882    2476    2650    3529   10902
Skewness: 1.321
Standard Deviation: 2164.388
Q3/Q1 Ratio: 4.001

Log Transformed Distribution:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.205   6.783   7.815   7.471   8.169   9.297
Skewness: -0.774
Standard Deviation: 1.035
Q3/Q1 Ratio: 1.204

Reflection

This data wrangling assignment has been both challenging and rewarding, offering valuable opportunities to enhance my understanding of data preparation, transformation, and integration—skills critical for any aspiring data scientist.

The Indigenous Land and Forest Estate dataset was straightforward in its structured format, but the education demographics dataset required considerable effort to clean and make analysis-ready. Dealing with metadata embedded in the rows and the wide format of the data reinforced the importance of tidying data for usability. The process of transforming these datasets sharpened my ability to work methodically and adapt tools like pivot_longer() and mutate() for specific tasks.

A particularly rewarding aspect was developing the unique column-checking function. During initial exploratory analysis, several columns appeared to duplicate information. Rather than relying on manual inspection, I implemented a programmatic solution to compare column contents systematically. The function iteratively checked for near-perfect matches between column pairs, accounting for both exact and NA value matches. This approach not only reduced the risk of manual oversight but also improved efficiency in identifying redundant columns.

Handling missing values and outliers presented significant challenges. Instead of omitting records with missing values, which would have diminished analytical power, I opted to use placeholders. For outliers, after exploring methods like Z-scores, I adopted the IQR method combined with median imputation to balance dataset refinement while maintaining original patterns.

Throughout this assignment, I developed several technical skills in data tidying, validation, and problem-solving. The project highlighted the importance of visualisation and validation as tools to confirm that data transformations support analytical objectives. Creating derived metrics like attendance proportions and management intensity added depth to the analysis, enabling more meaningful comparisons and insights.

While the assignment successfully met its objectives, I see opportunities for further exploration, such as integrating spatial data to analyse regional trends. This project has strengthened my confidence in tackling messy datasets and turning them into reliable foundations for analysis.

References

Australian Bureau of Statistics 2021, Aboriginal and Torres Strait Islander People: Census , Australian Bureau of Statistics, viewed 15 March 2024, https://www.abs.gov.au/statistics/people/aboriginal-and-torres-strait-islander-peoples/aboriginal-and-torres-strait-islander-people-census/2021 .

Department of Agriculture, Fisheries and Forestry 2024, Australia’s Indigenous Land and Forest Estate 2024 , Australian Government, viewed 15 March 2024, https://data.gov.au/data/dataset/australia-s-indigenous-land-and-forest-estate-2024 .

library(readr)
  library(dplyr)
  library(readxl)
  library(tidyr)
  library(moments)
indigenous_land_raw <- read_csv("C:/Users/kipjo/OneDrive/Documents/GitHub/rmit_gradcert_data_science/data_wrangling/assignment3/data/ind_for24_attributes.csv")

cat("=== Structure of Indigenous Land Data ===\n", capture.output(str(indigenous_land_raw)), sep="\n")
cat("=== Sample of Indigenous Land Data ===\n", capture.output(head(indigenous_land_raw)), sep="\n")
raw_demographics <- read_excel(
  path = "C:/Users/kipjo/OneDrive/Documents/GitHub/rmit_gradcert_data_science/data_wrangling/assignment3/data/Aboriginal and Torres Strait Islander people data summary.xlsx",
  sheet = "Table 4",
  col_names = FALSE
)

cat("=== Structure of Demographics Data ===\n", capture.output(str(raw_demographics)), sep="\n")
cat("=== Header of Demographics Data ===\n", capture.output(head(raw_demographics)), sep="\n")
numeric_columns <- c("preschool", "primary", "secondary", "tertiary", "other", "not_stated", "total")

state_mappings <- setNames(
  c("northern territory", "new south wales", "victoria", "queensland",
    "south australia", "western australia", "tasmania", "australian capital territory"),
  c("NT", "NSW", "VIC", "QLD", "SA", "WA", "TAS", "ACT")
)

cat("=== Indigenous Land Data Structure ===\n", capture.output(str(indigenous_land_raw)), sep="\n")
cat("=== Summary of Indigenous Land Data ===\n", capture.output(summary(indigenous_land_raw)), sep="\n")
indigenous_land_clean <- indigenous_land_raw %>%
  filter(!is.na(STATE), STATE != "N/A") %>%
  mutate(
    STATE = state_mappings[STATE],
    across(where(is.character), tolower),
    FOR_CATEGORY = factor(FOR_CATEGORY),
    FOR_TYPE = factor(FOR_TYPE)
  )

cat("Unique states after mapping:\n", capture.output(unique(indigenous_land_clean$STATE)), sep="\n")
cat("Land data factor validation:\n", capture.output(sapply(indigenous_land_clean[c("FOR_CATEGORY", "FOR_TYPE")], class)), sep="\n")
cat("=== Demographics Data Structure ===\n", capture.output(str(raw_demographics)), sep="\n")
cat("=== Summary of Demographics Data ===\n", capture.output(summary(raw_demographics)), sep="\n")
indigenous_edu_demographics <- raw_demographics %>%
  select(
    state = `...1`,
    preschool = `...2`,
    primary = `...3`,
    secondary = `...4`,
    tertiary = `...5`,
    other = `...6`,
    not_stated = `...7`,
    total = `...8`
  ) %>%
  mutate(
    gender = case_when(
      preschool == "MALES" ~ "male",
      preschool == "FEMALES" ~ "female",
      preschool == "PERSONS" ~ "persons",
      TRUE ~ NA_character_
    )
  ) %>%
  fill(gender) %>%
  mutate(across(where(is.character), tolower)) %>%
  filter(
    state %in% tolower(unname(state_mappings)),
    !if_all(everything(), is.na)
  ) %>%
  mutate(across(all_of(numeric_columns), as.numeric)) %>%
  filter(gender != "persons")

cat("Demographics data validation:\n", capture.output(table(indigenous_edu_demographics$gender)), sep="\n")
cat("Numeric columns check:\n", capture.output(sapply(indigenous_edu_demographics[numeric_columns], class)), sep="\n")
indigenous_combined <- indigenous_land_clean %>%
  left_join(indigenous_edu_demographics, by = c("STATE" = "state"))

cat("=== Final Dataset Summary ===\n", capture.output(c("Land data rows:", nrow(indigenous_land_clean), "\n", "Demographics rows:", nrow(indigenous_edu_demographics), "\n", "Combined data rows:", nrow(indigenous_combined), "\n")), sep="\n")
cat("=== Key Statistics for Educational Demographics ===\n", capture.output(indigenous_edu_demographics %>%
  group_by(gender) %>%
  summarise(
    across(all_of(numeric_columns),
           list(
             mean = ~mean(., na.rm = TRUE),
             sd = ~sd(., na.rm = TRUE)
           ))
  )), sep="\n")
cat("=== Sample of Final Combined Dataset ===\n", capture.output(head(indigenous_combined)), sep="\n")
compare_columns_values <- function(df) {
  cat("=== Column Comparison Analysis ===\n")
  cols <- names(df)
  column_status <- character(length(cols))
  names(column_status) <- cols

  safe_compare <- function(x, y) {
    x_char <- as.character(x)
    y_char <- as.character(y)
    matches <- sum(x_char == y_char, na.rm = TRUE)
    na_matches <- sum(is.na(x) & is.na(y))
    total_matches <- matches + na_matches
    return(list(
      matches = matches,
      na_matches = na_matches,
      total_matches = total_matches,
      total_rows = length(x)
    ))
  }

  duplicates <- list()

  for(i in 1:(length(cols)-1)) {
    col1_name <- cols[i]

    for(j in (i+1):length(cols)) {
      col2_name <- cols[j]
      comparison <- safe_compare(df[[col1_name]], df[[col2_name]])
      match_percent <- (comparison$total_matches / comparison$total_rows) * 100

      if(match_percent > 99) {
        duplicates[[length(duplicates) + 1]] <- col2_name
        column_status[col1_name] <- sprintf("Duplicate match with %s", col2_name)
        column_status[col2_name] <- sprintf("Will be removed (duplicate of %s)", col1_name)

        cat(sprintf("\n* Duplicate columns found: '%s' and '%s'\n", col1_name, col2_name))
        cat(sprintf("  - Data types: %s and %s\n", class(df[[col1_name]])[1], class(df[[col2_name]])[1]))
        cat(sprintf("  - Match rate: %.2f%%\n", match_percent))
        cat(sprintf("  - Exact matches: %d out of %d rows\n",
            comparison$matches, comparison$total_rows))
        cat(sprintf("  - NA matches: %d\n", comparison$na_matches))

        cat("\n  - Sample comparison (first 5 rows):\n")
        sample_df <- head(data.frame(
          col1 = as.character(df[[col1_name]]),
          col2 = as.character(df[[col2_name]])
        ), 5)
        names(sample_df) <- c(col1_name, col2_name)
        print(sample_df)
        cat("\n")
      }
    }
  }

  if(length(duplicates) > 0) {
    cat("\nColumns identified for removal:", paste(unique(unlist(duplicates)), collapse = ", "), "\n")
  }

  return(invisible(duplicates))
}

duplicate_cols <- compare_columns_values(indigenous_combined)
indigenous_tidy <- indigenous_combined %>%
  pivot_longer(
    cols = c(preschool, primary, secondary, tertiary, other, not_stated),
    names_to = "education_level",
    values_to = "attendance_count"
  ) %>%
  select(-all_of(unique(unlist(duplicate_cols)))) %>%
  mutate(
    education_level = factor(
      education_level,
      levels = c("preschool", "primary", "secondary", "tertiary", "other", "not_stated")
    ),
    across(where(is.character), tolower)
  )

cat("=== Transformation Summary ===\n",
    capture.output(
      cat("• Removed Column: ", paste(unique(unlist(duplicate_cols))), "\n",
          "• Remaining Columns: ", paste(names(indigenous_tidy), collapse = ", "), "\n")
    ), sep="\n")
cat("=== Data Distribution ===\n",
    capture.output(
      cat("• Education Level Counts:\n"),
      print(table(indigenous_tidy$education_level))
    ), sep="\n")
cat("=== Row Count Validation ===\n",
    capture.output(
      cat("• Original Rows: ", nrow(indigenous_combined), "\n",
          "• Final Tidy Rows: ", nrow(indigenous_tidy), "\n",
          "• Expected Rows: ", nrow(indigenous_combined) * 6,
          " (", nrow(indigenous_combined), " × 6 education levels)\n")
    ), sep="\n")
cat("=== Sample Output ===\n",
    capture.output(
      print(head(indigenous_tidy[c("STATE", "education_level", "attendance_count")], 3))
    ), sep="\n")
indigenous_tidy_enhanced <- indigenous_tidy %>%
  group_by(STATE) %>%
  mutate(
    state_total_attendance = sum(attendance_count, na.rm = TRUE)
  ) %>%
  mutate(
    attendance_proportion = attendance_count / state_total_attendance,
    high_indigenous_control = (IND_OWN == 1 | IND_MNG == 1),

    management_intensity = IND_OWN + IND_MNG + IND_COMNG
  ) %>%
  group_by(STATE, education_level) %>%
  mutate(
    avg_attendance_by_level = mean(attendance_count, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  mutate(
    attendance_category = case_when(
      attendance_count == 0 ~ "no attendance",
      attendance_count < quantile(attendance_count, 0.25, na.rm = TRUE) ~ "low",
      attendance_count < quantile(attendance_count, 0.75, na.rm = TRUE) ~ "medium",
      TRUE ~ "high"
    ),
    attendance_category = factor(attendance_category,
                               levels = c("no attendance", "low", "medium", "high"))
  )

cat("=== New Variable Summaries ===",
    capture.output(
      summary(indigenous_tidy_enhanced[c("attendance_proportion", "management_intensity",
                                       "attendance_category")])
    ), sep="\n")
cat("=== Management Intensity Distribution ===",
    capture.output(
      print(table(indigenous_tidy_enhanced$management_intensity))
    ), sep="\n")
cat("=== Attendance Categories by Education Level ===",
    capture.output(
      print(table(indigenous_tidy_enhanced$education_level,
                 indigenous_tidy_enhanced$attendance_category))
    ), sep="\n")
missing_value_analysis <- function(df) {
  missing_stats <- sapply(df, function(x) sum(is.na(x)))
  missing_percent <- round((missing_stats / nrow(df)) * 100, 2)

  missing_df <- data.frame(
    Variable = names(missing_stats),
    Missing_Count = missing_stats,
    Missing_Percent = missing_percent
  )

  missing_df <- missing_df[order(-missing_df$Missing_Count), ]
  return(missing_df[missing_df$Missing_Count > 0, ])
}

cat("=== Data Quality Analysis ===",
    capture.output({
      cat("\nMissing Values:\n")
      print(missing_value_analysis(indigenous_tidy_enhanced))
      cat("\nData Integrity Issues:\n")
      neg_counts <- sum(indigenous_tidy_enhanced$attendance_count < 0, na.rm = TRUE)
      invalid_props <- sum(indigenous_tidy_enhanced$attendance_proportion > 1 |
                         indigenous_tidy_enhanced$attendance_proportion < 0, na.rm = TRUE)
      invalid_mgmt <- sum(indigenous_tidy_enhanced$management_intensity > 3 |
                        indigenous_tidy_enhanced$management_intensity < 0, na.rm = TRUE)

      cat("- Negative attendance counts:", neg_counts, "\n")
      cat("- Invalid proportions:", invalid_props, "\n")
      cat("- Invalid management scores:", invalid_mgmt, "\n")
    }), sep="\n")
indigenous_tidy_clean <- indigenous_tidy_enhanced %>%
  mutate(
    IND_FDES = case_when(
      is.na(IND_FDES) ~ "unspecified forest description",
      TRUE ~ IND_FDES
    )
  )

cat("=== Missing Value Resolution ===\n",
    capture.output({
      remaining_missing <- sum(is.na(indigenous_tidy_clean$IND_FDES))
      cat("Remaining missing values in IND_FDES:", remaining_missing, "\n")
    }), sep="\n")
indigenous_tidy_clean <- indigenous_tidy_clean %>%
  mutate(
    attendance_count_clean = ifelse(
      attendance_count > quantile(attendance_count, 0.75) + 1.5*IQR(attendance_count) |
      attendance_count < quantile(attendance_count, 0.25) - 1.5*IQR(attendance_count),
      median(attendance_count),
      attendance_count
    ),

    avg_attendance_by_level_clean = ifelse(
      avg_attendance_by_level > quantile(avg_attendance_by_level, 0.75) + 1.5*IQR(avg_attendance_by_level) |
      avg_attendance_by_level < quantile(avg_attendance_by_level, 0.25) - 1.5*IQR(avg_attendance_by_level),
      median(avg_attendance_by_level),
      avg_attendance_by_level
    ),

    attendance_proportion_clean = ifelse(
      attendance_proportion > quantile(attendance_proportion, 0.95) |
      attendance_proportion < quantile(attendance_proportion, 0.05),
      median(attendance_proportion),
      attendance_proportion
    )
  )

cat("=== Outlier Resolution Summary ===",
    capture.output({
      for(col_pair in list(
        c("attendance_count", "attendance_count_clean"),
        c("attendance_proportion", "attendance_proportion_clean"),
        c("avg_attendance_by_level", "avg_attendance_by_level_clean")
      )) {
        orig_col <- col_pair[1]
        clean_col <- col_pair[2]

        cat("\nVariable:", clean_col, "\n")
        cat("Original range:",
            sprintf("%.2f to %.2f\n",
                   min(indigenous_tidy_clean[[orig_col]], na.rm = TRUE),
                   max(indigenous_tidy_clean[[orig_col]], na.rm = TRUE)))
        cat("Cleaned range:",
            sprintf("%.2f to %.2f\n",
                   min(indigenous_tidy_clean[[clean_col]], na.rm = TRUE),
                   max(indigenous_tidy_clean[[clean_col]], na.rm = TRUE)))

        modified <- sum(indigenous_tidy_clean[[orig_col]] !=
                       indigenous_tidy_clean[[clean_col]], na.rm = TRUE)
        cat("Values modified:", modified,
            sprintf("(%.1f%%)\n", modified/nrow(indigenous_tidy_clean)*100))
      }
    }), sep="\n")
indigenous_transformed <- indigenous_tidy_clean %>%
  mutate(
    attendance_count_log = log(attendance_count_clean + 1)
  )
cat("=== Distribution Analysis ===",
    capture.output({
      cat("\nOriginal Attendance Distribution:\n")
      print(summary(indigenous_transformed$attendance_count_clean))
      cat("Skewness:", round(moments::skewness(indigenous_transformed$attendance_count_clean), 3), "\n")
      cat("Standard Deviation:", round(sd(indigenous_transformed$attendance_count_clean), 3), "\n")

      q_orig <- quantile(indigenous_transformed$attendance_count_clean,
                        probs = c(0.25, 0.75))
      cat("Q3/Q1 Ratio:", round(q_orig[2]/q_orig[1], 3), "\n")

      cat("\nLog Transformed Distribution:\n")
      print(summary(indigenous_transformed$attendance_count_log))
      cat("Skewness:", round(moments::skewness(indigenous_transformed$attendance_count_log), 3), "\n")
      cat("Standard Deviation:", round(sd(indigenous_transformed$attendance_count_log), 3), "\n")

      q_trans <- quantile(indigenous_transformed$attendance_count_log,
                         probs = c(0.25, 0.75))
      cat("Q3/Q1 Ratio:", round(q_trans[2]/q_trans[1], 3), "\n")
    }), sep="\n")