Synthetic Data Generation

Fair warning: These posts are super dry because they are academic papers. Sorry!

A Synthetic Analysis of Ride-Sharing Service Data

Kip Jordan

A statistical analysis conducted as part of the Data Science Graduate Certificate at RMIT University, Melbourne, Australia

Project Overview

This project demonstrates techniques for generating and working with synthetic data in R, using a ride-sharing service scenario as an example. The aim was to create realistic, interconnected datasets that could be used to practise data wrangling and analysis techniques.

Key aspects of this project include:

  • Generation of synthetic data with realistic properties and relationships
  • Creation of multiple related datasets that can be merged and analysed
  • Introduction of common data quality issues like missing values and outliers
  • Implementation of data preprocessing and cleaning techniques

The project serves as a practical example of how to create test data that maintains logical consistency whilst incorporating real-world data characteristics. This synthetic data generation approach is valuable for:

  • Testing data processing pipelines
  • Developing and validating analysis methods
  • Teaching and learning data science concepts
  • Prototyping solutions without using sensitive production data

Data Description

I created synthetic data to simulate Uber's ride-sharing service operations. Here we generated two primary datasets to represent typical ride-sharing data structures:

Driver Dataset (50 records):
- driver_id: Unique identifier for each driver
- driver_name: Randomly generated driver names
- vehicle_type: Vehicle categories (Sedan, SUV, Premium)
- driver_rating: Rating scores between 3.5 and 5.0
- join_date: Dates throughout 2023 when drivers joined the platform

Ride Details Dataset (75 records):
- ride_id: Unique identifier for each ride
- driver_id: Links to the driver dataset
- trip_distance: Distance in kilometers (correlated with fare)
- fare_amount: Trip cost (based on distance plus base rate)
- payment_method: Payment type (Card, Cash, Digital Wallet)
- ride_date: Dates in early 2024 when rides occurred

The data includes realistic features such as:
- Correlated variables (trip distance and fare amount)
- Missing values (3 in each dataset)
- Outliers in numeric variables (ratings, fares, and distances)
- A mix of data types (numeric, categorical, dates)
Sample rows from Driver Dataset:

driver_id driver_name vehicle_type driver_rating join_date
1 S9700 SUV 3.76 2023-06-24
2 J3582 Premium 4.27 2023-11-17
3 Y9938 Premium 3.95 2023-08-26
4 Q8007 SUV 4.20 2023-05-17
5 R1410 SUV 3.58 2023-01-19
6 R9173 SUV 3.97 2023-12-31

Sample rows from Ride Details Dataset:

ride_id driver_id trip_distance fare_amount payment_method ride_date
1 46 6.35 16.94 Cash 2024-01-18
2 36 18.42 36.87 Card 2024-02-08
3 45 18.97 43.86 Card 2024-02-24
4 49 15.69 39.18 Cash 2024-01-16
5 15 9.06 20.09 Cash 2024-01-30
6 47 4.70 12.66 Card 2024-01-11

Merge

I merged the two datasets using the common driver_id field as the key. Here we performed a left join to ensure all rides were retained while matching them with their corresponding driver information. I chose this approach as it's logical for a ride-sharing service where every ride must have an associated driver.

The columns were then reordered to improve readability, placing:
- Identifying information first (ride_id, driver_id, driver_name)
- Driver characteristics (vehicle_type, rating)
- Ride-specific details (distance, fare, payment_method, and dates)

The merged dataset contains 75 rows (matching our ride records) and 10 columns, combining all relevant information from both source datasets. This merged dataset was exported to Excel for documentation purposes.

Input dataset dimensions:
- Rides dataset: 75 rows by 6 columns
- Drivers dataset: 50 rows by 5 columns
Output dataset dimensions:
- Merged dataset: 75 rows by 10 columns
ride_id driver_id driver_name vehicle_type driver_rating trip_distance fare_amount payment_method ride_date join_date
1 46 M9325 Premium 4.92 6.35 16.94 Cash 2024-01-18 2023-08-17
2 36 O3996 SUV NA 18.42 36.87 Card 2024-02-08 2023-06-06
3 45 E3108 Sedan 3.56 18.97 43.86 Card 2024-02-24 2023-05-18
4 49 I8972 Sedan 4.34 15.69 39.18 Cash 2024-01-16 2023-09-23
5 15 P1728 Sedan 3.79 9.06 20.09 Cash 2024-01-30 2023-05-07
6 47 K2411 SUV 4.38 4.70 12.66 Card 2024-01-11 2023-10-28

Understand

I examined the merged dataset to understand its structure and ensure appropriate data types were assigned. My initial inspection revealed a mix of character and numeric variables that required optimisation. Here we converted categorical variables (vehicle_type and payment_method) to factors to improve memory efficiency and ensure proper categorical handling in analyses.

Date fields were verified to confirm proper formatting and logical ranges, with ride dates falling in early 2024 and join dates throughout 2023. The data structure inspection confirmed 10 variables with appropriate types for their content.

Variable Details Summary Statistics
ride_id Integer Min: 1.0
1st Qu: 19.5
Median: 38.0
Mean: 38.0
3rd Qu: 56.5
Max: 75.0
driver_id Integer Min: 2.00
1st Qu: 15.00
Median: 23.00
Mean: 25.36
3rd Qu: 37.50
Max: 50.00
driver_name Character Length: 75
Class: character
Mode: character
vehicle_type Factor (3 levels) Levels: "Premium", "Sedan", "SUV"
driver_rating Numeric Min: 3.560
1st Qu: 3.825
Median: 4.310
Mean: 4.281
3rd Qu: 4.718
Max: 4.980
NA's: 9
trip_distance Numeric Min: 2.500
1st Qu: 6.965
Median: 11.470
Mean: 11.901
3rd Qu: 16.025
Max: 50.000
fare_amount Numeric Min: 8.90
1st Qu: 17.20
Median: 28.43
Mean: 29.24
3rd Qu: 37.57
Max: 150.00
NA's: 3
payment_method Factor (3 levels) Levels: "Card", "Cash", "Digital Wallet"
ride_date Date Min: 2024-01-01
1st Qu: 2024-01-15
Median: 2024-01-29
Mean: 2024-01-28
3rd Qu: 2024-02-09
Max: 2024-03-01
join_date Date Min: 2023-01-24
1st Qu: 2023-05-08
Median: 2023-06-26
Mean: 2023-07-13
3rd Qu: 2023-08-19
Max: 2023-12-31

Manipulate Data

To enhance our analysis capabilities, I created several derived variables from the existing data. Here we calculated a driver experience metric as the number of days between each ride_date and the driver's join_date, providing insight into driver tenure.

I analysed fare efficiency through a fare-per-kilometre calculation, while estimated trip durations were computed assuming an average urban speed of 30 km/h.

Here's a sample of the enhanced dataset with the derived variables:

ride_id driver_id driver_name vehicle_type driver_rating trip_distance fare_amount payment_method ride_date join_date driver_experience_days fare_per_km estimated_duration_mins
1 46 M9325 Premium 4.92 6.35 16.94 Cash 2024-01-18 2023-08-17 154 2.67 13
2 36 O3996 SUV NA 18.42 36.87 Card 2024-02-08 2023-06-06 247 2.00 37
3 45 E3108 Sedan 3.56 18.97 43.86 Card 2024-02-24 2023-05-18 282 2.31 38
4 49 I8972 Sedan 4.34 15.69 39.18 Cash 2024-01-16 2023-09-23 115 2.50 31
5 15 P1728 Sedan 3.79 9.06 20.09 Cash 2024-01-30 2023-05-07 268 2.22 18
6 47 K2411 SUV 4.38 4.70 12.66 Card 2024-01-11 2023-10-28 75 2.69 9

Summary statistics for the key derived variables:

Statistic driver_experience_days fare_per_km estimated_duration_mins
Min. 20.0 0.450 5.00
1st Qu. 160.5 2.288 14.00
Median 210.0 2.420 23.00
Mean 199.8 2.688 23.83
3rd Qu. 264.5 2.703 32.00
Max. 382.0 7.570 100.00
NA's 3 - -

Scan I

I systematically examined the dataset for missing values, revealing gaps in both driver ratings and fare amounts. Here we found these missing values represented less than 5% of the total observations.

To maintain data integrity, I replaced missing numeric values with their respective column medians, a common approach that preserves the overall distribution of the data.

Missing Values Analysis
Column Number of Missing Values Percentage Missing
driver_rating 9 12%
fare_amount 3 4%
fare_per_km 3 4%
Rows with Missing Values:
ride_id driver_id driver_name vehicle_type driver_rating trip_distance fare_amount payment_method ride_date join_date driver_experience_days fare_per_km estimated_duration_mins
2 36 O3996 SUV NA 18.42 36.87 Card 2024-02-08 2023-06-06 247 2.00 37
13 28 S6114 Sedan NA 50.00 22.72 Card 2024-02-16 2023-09-27 142 0.45 100
14 8 Z6948 Sedan NA 8.89 18.13 Card 2024-01-14 2023-12-07 38 2.04 18
17 9 X1450 SUV 4.91 9.92 NA Card 2024-01-29 2023-08-08 174 NA 20
19 8 Z6948 Sedan NA 19.55 45.63 Cash 2024-01-28 2023-12-07 52 2.33 39
22 8 Z6948 Sedan NA 19.65 42.91 Cash 2024-02-01 2023-12-07 56 2.18 39
37 24 Q8500 Sedan 4.73 8.92 NA Cash 2024-01-08 2023-05-08 245 NA 18
43 8 Z6948 Sedan NA 7.59 16.52 Card 2024-01-10 2023-12-07 34 2.18 15
50 8 Z6948 Sedan NA 14.73 36.20 Digital Wallet 2024-01-11 2023-12-07 35 2.46 29
58 36 O3996 SUV NA 17.26 39.95 Digital Wallet 2024-01-27 2023-06-06 235 2.31 35
60 28 S6114 Sedan NA 3.37 15.98 Digital Wallet 2024-02-08 2023-09-27 134 4.74 7
65 14 W3657 Sedan 3.64 13.32 NA Digital Wallet 2024-01-17 2023-04-21 271 NA 27
After Cleaning - Remaining Missing Values:
Column Missing Values
ride_id 0
driver_id 0
driver_name 0
vehicle_type 0
driver_rating 0
trip_distance 0
fare_amount 0
payment_method 0
ride_date 0
join_date 0
driver_experience_days 0
fare_per_km 0
estimated_duration_mins 0

Scan II

I performed outlier detection using multiple methodologies, including the Interquartile Range (IQR) method and z-scores. Here we generated box plots for visual inspection of all numeric variables. I identified extreme values in fare amounts and trip distances, with some fares exceeding three standard deviations from the mean. These outliers were handled by capping values at three standard deviations from the mean, balancing the need to maintain data integrity while minimising the impact of extreme values.

The boxplot visualisations reveal several important patterns in our ride-sharing data:

Driver Ratings (3.5-4.8):
- Most drivers maintain ratings between 4.0 and 4.7
- The median rating is approximately 4.3
- There are no significant low-rating outliers, suggesting consistent service quality

Trip Distance (2-50km):
- Most trips fall within 5-15km range
- There's a notable outlier at around 50km, representing unusually long journeys
- The distribution is right-skewed, typical for urban ride-sharing services

Fare Amount:
- The core fare range clusters between $20-40 - There's an extreme outlier near $120, likely corresponding to the long-distance trip
- The distribution shows expected correlation with trip distance

Fare per Kilometer:
- Most fares cluster tightly around $2.5 per km - Several high outliers between $4-7 per km suggest premium pricing scenarios
- One low outlier below $1 per km might indicate a promotional fare

Estimated Duration:
- Typical rides last between 15-30 minutes
- The outlier at 100 minutes corresponds to the long-distance journey
- The distribution aligns with expected urban travel patterns

These patterns informed our decision to cap values at three standard deviations from the mean, preserving the overall data structure while managing extreme values appropriately.

Figure 5

Detection Method Variable Number of Outliers
IQR Method driver_rating 0
trip_distance 1
fare_amount 1
fare_per_km 12
estimated_duration_mins 1
Z-Score > 3 driver_rating 0
trip_distance 1
fare_amount 1
fare_per_km 1
estimated_duration_mins 1
Rows with extreme fare_amount or trip_distance:
ride_id trip_distance fare_amount fare_per_km
13 50.00 22.72 0.45
68 19.82 150.00 7.57
Outliers after handling (IQR method):

Transform

I applied log transformations to fare-related and distance variables to address their right-skewed distributions. Here we chose the transformation log(x + 1) to handle potential zero values while improving the normality of the distributions. My comparison of original and transformed distributions demonstrated improved symmetry and reduced skewness, particularly in the fare and distance variables.

The histogram comparisons between original and log-transformed distributions reveal significant improvements in data normality:

Fare Distribution:
- Original: Shows right-skewed distribution with fares concentrated between $10-40, and sparse high-value outliers up to $80
- Log-transformed: Achieves more symmetric distribution centred around 3.5, making patterns more interpretable
- The transformation particularly helps in managing the long right tail of high-value fares

Distance Distribution:
- Original: Exhibits positive skew with most trips under 20km and rare long-distance outliers
- Log-transformed: Creates more balanced distribution centred around 2.5, suggesting log-normal trip patterns
- Better represents the natural distribution of urban journey distances

Fare per KM Distribution:
- Original: Highly concentrated around $2-3/km with several high-value outliers
- Log-transformed: Produces more normal distribution around 1.0-1.5
- Helps identify pricing anomalies more effectively

The log transformations successfully:
- Reduce the impact of extreme values without removing them
- Create more symmetrical distributions suitable for statistical analysis
- Maintain the underlying relationships in the data while improving interpretability

Figure 6

Statistic Original Fare Log Fare Original Distance Log Distance
Min. 8.90 2.293 2.500 1.253
1st Qu. 17.71 2.929 6.965 2.075
Median 28.43 3.382 11.470 2.523
Mean 28.31 3.285 11.674 2.413
3rd Qu. 37.13 3.641 16.025 2.835
Max. 82.77 4.428 33.008 3.527

Summary Statistics

I generated comprehensive summary statistics to understand key business metrics across different segments of the ride-sharing service. Here we analysed patterns by vehicle type, revealing varying patterns in average fares and trip distances. My analysis of payment methods showed clear preferences in payment options. Overall statistics provide insights into total revenue, average fares, and driver performance metrics.

The visualisations highlight the distribution of fares across vehicle types and the relative frequency of different payment methods, offering clear insights into operational patterns.

The visualisations reveal several key insights about our ride-sharing service:

Fare Distribution by Vehicle Type:
- Premium vehicles command higher median fares, reflecting their luxury positioning
- SUVs show moderate fare levels with less variability than Premium vehicles
- Sedan fares are most consistent, clustering around the lower price range
- All vehicle types show some outliers for longer or premium-priced journeys

Payment Method Preferences:
- Digital Wallet emerges as the preferred payment method, suggesting strong digital adoption
- Card payments maintain a strong second position, indicating customer preference for cashless transactions
- Cash payments, while less common, are still significant, showing the importance of maintaining payment flexibility

These patterns suggest a well-segmented service with clear price differentiation across vehicle types and strong digital payment adoption among users.

Figure 7

Figure 8

Summary by Vehicle Type:

Vehicle Type Total Rides Avg Fare Avg Distance Avg Rating Total Revenue
Sedan 43 28.7 12.0 4.19 1,232.00
SUV 19 29.8 12.1 4.29 566.00
Premium 13 24.9 9.83 4.58 324.00

Summary by Payment Method:

Payment Method Total Rides Avg Fare Total Revenue Percentage
Digital Wallet 26 28.9 750.00 34.7
Card 25 29.1 727.00 33.3
Cash 24 26.9 646.00 32.0

Overall Statistics:

Metric Value
Total Rides 75.00
Total Drivers 35.00
Total Revenue 2,123.03
Average Fare 28.31
Average Distance 11.67
Average Rating 4.28

References

CRAN (2024) The Comprehensive R Archive Network , viewed November 2024, https://cran.r-project.org/ .

R Core Team (2024) R: A Language and Environment for Statistical Computing , R Foundation for Statistical Computing, Vienna, Austria, viewed November 2024, https://www.R-project.org/ .

Wickham, H, François, R, Henry, L & Müller, K (2024) dplyr: A Grammar of Data Manipulation , R package version 1.1.4, viewed November 2024, https://CRAN.R-project.org/package=dplyr .

Wickham, H (2016) ggplot2: Elegant Graphics for Data Analysis , Springer-Verlag New York, viewed November 2024, https://ggplot2.tidyverse.org .

Wickham, H & Henry, L (2024) tidyr: Tidy Messy Data , R package version 1.3.0, viewed November 2024, https://CRAN.R-project.org/package=tidyr .

Grolemund, G & Wickham, H (2011) lubridate: Make Dealing with Dates a Little Easier , R package version 1.9.3, viewed November 2024, https://CRAN.R-project.org/package=lubridate .

setwd(paste0("C:/Users/kipjo/OneDrive/Documents/GitHub/",
             "rmit_gradcert_data_science/data_wrangling/assignment2"))

library(dplyr)      # For my data manipulation
library(tidyr)      # For my data tidying
library(lubridate)  # For my date handling
library(ggplot2)    # For my visualisations
library(readr)      # For my data reading/writing
library(openxlsx)   # For my Excel file writing
set.seed(3229642)

n_drivers <- 50
drivers_df <- data.frame(
  driver_id = 1:n_drivers,
  driver_name = paste0(sample(LETTERS, n_drivers, replace=TRUE),
                      sample(1000:9999, n_drivers)),
  vehicle_type = sample(c("Sedan", "SUV", "Premium"), n_drivers, replace=TRUE),
  driver_rating = round(runif(n_drivers, 3.5, 5), 2),
  join_date = as.Date("2023-01-01") + sample(0:365, n_drivers, replace=TRUE),
  stringsAsFactors = FALSE
)

n_rides <- 75

base_distance <- runif(n_rides, 2, 20)

base_rate <- 5
per_km_rate <- 2
fare_noise <- rnorm(n_rides, mean=0, sd=2)
fare_amount <- base_rate + (base_distance * per_km_rate) + fare_noise

rides_df <- data.frame(
  ride_id = 1:n_rides,
  driver_id = sample(1:n_drivers, n_rides, replace=TRUE),
  trip_distance = round(base_distance, 2),
  fare_amount = round(fare_amount, 2),
  payment_method = sample(c("Card", "Cash", "Digital Wallet"), n_rides, replace=TRUE),
  ride_date = as.Date("2024-01-01") + sample(0:60, n_rides, replace=TRUE)
)

drivers_df$driver_rating[sample(1:n_drivers, 3)] <- NA
rides_df$fare_amount[sample(1:n_rides, 3)] <- NA

drivers_df$driver_rating[sample(1:n_drivers, 1)] <- 1.5

rides_df$fare_amount[sample(1:n_rides, 1)] <- 150
rides_df$trip_distance[sample(1:n_rides, 1)] <- 50

head(drivers_df)
head(rides_df)

dir.create("outputs", showWarnings = FALSE)

write.xlsx(drivers_df, "outputs/drivers_dataset.xlsx")
write.xlsx(rides_df, "outputs/rides_dataset.xlsx")

cat("Files written successfully:\n",
    "- Drivers dataset: outputs/drivers_dataset.xlsx\n",
    "- Rides dataset: outputs/rides_dataset.xlsx\n")
cat("Input dataset dimensions:\n",
    "- Rides dataset:", dim(rides_df)[1], "rows by", dim(rides_df)[2], "columns\n",
    "- Drivers dataset:", dim(drivers_df)[1], "rows by", dim(drivers_df)[2], "columns\n\n")

merged_df <- rides_df %>%
  left_join(drivers_df, by = "driver_id")

merged_df <- merged_df %>%
  select(ride_id, driver_id, driver_name, vehicle_type, driver_rating,
         trip_distance, fare_amount, payment_method, ride_date, join_date)

cat("Output dataset dimensions:\n",
    "- Merged dataset:", dim(merged_df)[1], "rows by", dim(merged_df)[2], "columns\n")

head(merged_df)

write.xlsx(merged_df, "outputs/merged_dataset.xlsx")
str(merged_df)

summary(merged_df)

cat("\nUnique values in vehicle_type:\n")
unique(merged_df$vehicle_type)

cat("\nUnique values in payment_method:\n")
unique(merged_df$payment_method)

merged_df <- merged_df %>%
  mutate(
    vehicle_type = as.factor(vehicle_type),
    payment_method = as.factor(payment_method)
  )

cat("\nDate ranges:\n")
cat("Ride dates:", min(merged_df$ride_date), "to", max(merged_df$ride_date), "\n")
cat("Join dates:", min(merged_df$join_date), "to", max(merged_df$join_date), "\n")

cat("\nUpdated structure after conversions:\n")
str(merged_df)
merged_df <- merged_df %>%
  mutate(
    driver_experience_days = as.numeric(ride_date - join_date),
    fare_per_km = round(fare_amount / trip_distance, 2),
    estimated_duration_mins = round(trip_distance / 30 * 60, 0)
  )

head(merged_df)

summary(merged_df[c("driver_experience_days", "fare_per_km", "estimated_duration_mins")])
missing_summary <- sapply(merged_df, function(x) sum(is.na(x)))
cat("Number of missing values in each column:\n")
print(missing_summary[missing_summary > 0])

missing_percentage <- sapply(merged_df, function(x) mean(is.na(x)) * 100)
cat("\nPercentage of missing values in each column:\n")
print(missing_percentage[missing_percentage > 0])

rows_with_na <- merged_df[!complete.cases(merged_df), ]
cat("\nRows containing missing values:\n")
print(rows_with_na)

merged_df_clean <- merged_df %>%
  mutate(
    driver_rating = ifelse(is.na(driver_rating),
                          median(driver_rating, na.rm = TRUE),
                          driver_rating),
    fare_amount = ifelse(is.na(fare_amount),
                        median(fare_amount, na.rm = TRUE),
                        fare_amount),
    fare_per_km = ifelse(is.na(fare_per_km),
                        median(fare_per_km, na.rm = TRUE),
                        fare_per_km)
  )

cat("\nRemaining missing values after cleaning:\n")
print(sapply(merged_df_clean, function(x) sum(is.na(x))))
numeric_cols <- c("driver_rating", "trip_distance", "fare_amount",
                 "fare_per_km", "estimated_duration_mins")

identify_outliers <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  outliers <- x < lower_bound | x > upper_bound
  return(sum(outliers, na.rm = TRUE))
}

cat("Number of outliers in each numeric column (IQR method):\n")
sapply(merged_df_clean[numeric_cols], identify_outliers)

par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))  # Adjusted margins for larger titles
for(col in numeric_cols) {
  boxplot(merged_df_clean[[col]],
          main = paste("Distribution of", gsub("_", " ", col)),
          ylab = col,
          sub = "Boxplot for outlier detection",
          col = "grey80")  # Grey fill for boxes
}

par(mfrow = c(1, 1))
z_scores <- sapply(merged_df_clean[numeric_cols], scale)
colnames(z_scores) <- paste0(numeric_cols, "_zscore")

extreme_values <- apply(abs(z_scores) > 3, 2, sum, na.rm = TRUE)
cat("\nNumber of extreme values (|z| > 3) in each numeric column:\n")
print(extreme_values)

cat("\nRows with extreme fare_amount or trip_distance:\n")
merged_df_clean[abs(z_scores[, "fare_amount_zscore"]) > 3 |
                abs(z_scores[, "trip_distance_zscore"]) > 3,
                c("ride_id", "trip_distance", "fare_amount", "fare_per_km")]

merged_df_clean <- merged_df_clean %>%
  mutate(across(all_of(numeric_cols),
                ~ ifelse(abs(scale(.)) > 3,
                        sign(scale(.)) * 3 * sd(., na.rm = TRUE) + mean(., na.rm = TRUE),
                        .)))

cat("\nOutliers after handling (IQR method):\n")
sapply(merged_df_clean[numeric_cols], identify_outliers)
merged_df_clean <- merged_df_clean %>%
  mutate(
    log_fare = log(fare_amount + 1),

    log_distance = log(trip_distance + 1),

    log_fare_per_km = log(fare_per_km + 1)
  )

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))  # Adjust margins for better fit

hist(merged_df_clean$fare_amount,
     main = "Original Fare Distribution",
     xlab = "Fare Amount",
     col = "grey80")
hist(merged_df_clean$trip_distance,
     main = "Original Distance Distribution",
     xlab = "Trip Distance",
     col = "grey80")
hist(merged_df_clean$fare_per_km,
     main = "Original Fare per KM Distribution",
     xlab = "Fare per KM",
     col = "grey80")

hist(merged_df_clean$log_fare,
     main = "Log-Transformed Fare Distribution",
     xlab = "Log Fare",
     col = "grey80")
hist(merged_df_clean$log_distance,
     main = "Log-Transformed Distance Distribution",
     xlab = "Log Distance",
     col = "grey80")
hist(merged_df_clean$log_fare_per_km,
     main = "Log-Transformed Fare per KM Distribution",
     xlab = "Log Fare per KM",
     col = "grey80")
par(mfrow = c(1, 1))

summary_stats <- data.frame(
  Original_Fare = summary(merged_df_clean$fare_amount),
  Log_Fare = summary(merged_df_clean$log_fare),
  Original_Distance = summary(merged_df_clean$trip_distance),
  Log_Distance = summary(merged_df_clean$log_distance)
)

print(summary_stats)
vehicle_summary <- merged_df_clean %>%
  group_by(vehicle_type) %>%
  summarise(
    total_rides = n(),
    avg_fare = mean(fare_amount, na.rm = TRUE),
    avg_distance = mean(trip_distance, na.rm = TRUE),
    avg_rating = mean(driver_rating, na.rm = TRUE),
    total_revenue = sum(fare_amount, na.rm = TRUE)
  ) %>%
  arrange(desc(total_rides))

payment_summary <- merged_df_clean %>%
  group_by(payment_method) %>%
  summarise(
    total_rides = n(),
    avg_fare = mean(fare_amount, na.rm = TRUE),
    total_revenue = sum(fare_amount, na.rm = TRUE),
    percentage = n() / nrow(merged_df_clean) * 100
  ) %>%
  arrange(desc(total_rides))

overall_stats <- list(
  total_rides = nrow(merged_df_clean),
  total_drivers = n_distinct(merged_df_clean$driver_id),
  total_revenue = sum(merged_df_clean$fare_amount, na.rm = TRUE),
  avg_fare = mean(merged_df_clean$fare_amount, na.rm = TRUE),
  avg_distance = mean(merged_df_clean$trip_distance, na.rm = TRUE),
  avg_rating = mean(merged_df_clean$driver_rating, na.rm = TRUE)
)

cat("Summary by Vehicle Type:\n")
print(vehicle_summary)

cat("\nSummary by Payment Method:\n")
print(payment_summary)

cat("\nOverall Statistics:\n")
print(data.frame(Metric = names(overall_stats),
                Value = unlist(overall_stats)))

ggplot(merged_df_clean, aes(x = vehicle_type, y = fare_amount)) +
  geom_boxplot(fill = "grey80") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  labs(title = "Fare Distribution by Vehicle Type",
       subtitle = "Premium vehicles show higher median fares with greater variability",
       x = "Vehicle Type",
       y = "Fare Amount ($)")
ggplot(payment_summary, aes(x = payment_method, y = total_rides)) +
  geom_bar(stat = "identity", fill = "grey60") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  labs(title = "Number of Rides by Payment Method",
       subtitle = "Digital Wallet payments are most frequent, followed by Card and Cash",
       x = "Payment Method",
       y = "Total Rides")