Synthetic Data Generation
Fair warning: These posts are super dry because they are academic papers. Sorry!
A Synthetic Analysis of Ride-Sharing Service Data
Kip Jordan
A statistical analysis conducted as part of the Data Science Graduate Certificate at RMIT University, Melbourne, Australia
Project Overview
This project demonstrates techniques for generating and working with synthetic data in R, using a ride-sharing service scenario as an example. The aim was to create realistic, interconnected datasets that could be used to practise data wrangling and analysis techniques.
Key aspects of this project include:
- Generation of synthetic data with realistic properties and relationships
- Creation of multiple related datasets that can be merged and analysed
- Introduction of common data quality issues like missing values and outliers
- Implementation of data preprocessing and cleaning techniques
The project serves as a practical example of how to create test data that maintains logical consistency whilst incorporating real-world data characteristics. This synthetic data generation approach is valuable for:
- Testing data processing pipelines
- Developing and validating analysis methods
- Teaching and learning data science concepts
- Prototyping solutions without using sensitive production data
Data Description
I created synthetic data to simulate Uber's ride-sharing service operations. Here we generated two primary datasets to represent typical ride-sharing data structures:
Driver Dataset (50 records):
- driver_id: Unique identifier for each driver
- driver_name: Randomly generated driver names
- vehicle_type: Vehicle categories (Sedan, SUV, Premium)
- driver_rating: Rating scores between 3.5 and 5.0
- join_date: Dates throughout 2023 when drivers joined the platform
Ride Details Dataset (75 records):
- ride_id: Unique identifier for each ride
- driver_id: Links to the driver dataset
- trip_distance: Distance in kilometers (correlated with fare)
- fare_amount: Trip cost (based on distance plus base rate)
- payment_method: Payment type (Card, Cash, Digital Wallet)
- ride_date: Dates in early 2024 when rides occurred
The data includes realistic features such as:
- Correlated variables (trip distance and fare amount)
- Missing values (3 in each dataset)
- Outliers in numeric variables (ratings, fares, and distances)
- A mix of data types (numeric, categorical, dates)
Sample rows from Driver Dataset:
driver_id | driver_name | vehicle_type | driver_rating | join_date | |
---|---|---|---|---|---|
1 | S9700 | SUV | 3.76 | 2023-06-24 | |
2 | J3582 | Premium | 4.27 | 2023-11-17 | |
3 | Y9938 | Premium | 3.95 | 2023-08-26 | |
4 | Q8007 | SUV | 4.20 | 2023-05-17 | |
5 | R1410 | SUV | 3.58 | 2023-01-19 | |
6 | R9173 | SUV | 3.97 | 2023-12-31 |
Sample rows from Ride Details Dataset:
ride_id | driver_id | trip_distance | fare_amount | payment_method | ride_date | |
---|---|---|---|---|---|---|
1 | 46 | 6.35 | 16.94 | Cash | 2024-01-18 | |
2 | 36 | 18.42 | 36.87 | Card | 2024-02-08 | |
3 | 45 | 18.97 | 43.86 | Card | 2024-02-24 | |
4 | 49 | 15.69 | 39.18 | Cash | 2024-01-16 | |
5 | 15 | 9.06 | 20.09 | Cash | 2024-01-30 | |
6 | 47 | 4.70 | 12.66 | Card | 2024-01-11 |
Merge
I merged the two datasets using the common driver_id field as the key. Here we performed a left join to ensure all rides were retained while matching them with their corresponding driver information. I chose this approach as it's logical for a ride-sharing service where every ride must have an associated driver.
The columns were then reordered to improve readability, placing:
- Identifying information first (ride_id, driver_id, driver_name)
- Driver characteristics (vehicle_type, rating)
- Ride-specific details (distance, fare, payment_method, and dates)
The merged dataset contains 75 rows (matching our ride records) and 10 columns, combining all relevant information from both source datasets. This merged dataset was exported to Excel for documentation purposes.
Input dataset dimensions:
- Rides dataset: 75 rows by 6 columns
- Drivers dataset: 50 rows by 5 columns
Output dataset dimensions:
- Merged dataset: 75 rows by 10 columns
ride_id | driver_id | driver_name | vehicle_type | driver_rating | trip_distance | fare_amount | payment_method | ride_date | join_date | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 46 | M9325 | Premium | 4.92 | 6.35 | 16.94 | Cash | 2024-01-18 | 2023-08-17 | |
2 | 36 | O3996 | SUV | NA | 18.42 | 36.87 | Card | 2024-02-08 | 2023-06-06 | |
3 | 45 | E3108 | Sedan | 3.56 | 18.97 | 43.86 | Card | 2024-02-24 | 2023-05-18 | |
4 | 49 | I8972 | Sedan | 4.34 | 15.69 | 39.18 | Cash | 2024-01-16 | 2023-09-23 | |
5 | 15 | P1728 | Sedan | 3.79 | 9.06 | 20.09 | Cash | 2024-01-30 | 2023-05-07 | |
6 | 47 | K2411 | SUV | 4.38 | 4.70 | 12.66 | Card | 2024-01-11 | 2023-10-28 |
Understand
I examined the merged dataset to understand its structure and ensure appropriate data types were assigned. My initial inspection revealed a mix of character and numeric variables that required optimisation. Here we converted categorical variables (vehicle_type and payment_method) to factors to improve memory efficiency and ensure proper categorical handling in analyses.
Date fields were verified to confirm proper formatting and logical ranges, with ride dates falling in early 2024 and join dates throughout 2023. The data structure inspection confirmed 10 variables with appropriate types for their content.
Variable | Details | Summary Statistics |
---|---|---|
ride_id | Integer | Min: 1.0 1st Qu: 19.5 Median: 38.0 Mean: 38.0 3rd Qu: 56.5 Max: 75.0 |
driver_id | Integer | Min: 2.00 1st Qu: 15.00 Median: 23.00 Mean: 25.36 3rd Qu: 37.50 Max: 50.00 |
driver_name | Character | Length: 75 Class: character Mode: character |
vehicle_type | Factor (3 levels) | Levels: "Premium", "Sedan", "SUV" |
driver_rating | Numeric | Min: 3.560 1st Qu: 3.825 Median: 4.310 Mean: 4.281 3rd Qu: 4.718 Max: 4.980 NA's: 9 |
trip_distance | Numeric | Min: 2.500 1st Qu: 6.965 Median: 11.470 Mean: 11.901 3rd Qu: 16.025 Max: 50.000 |
fare_amount | Numeric | Min: 8.90 1st Qu: 17.20 Median: 28.43 Mean: 29.24 3rd Qu: 37.57 Max: 150.00 NA's: 3 |
payment_method | Factor (3 levels) | Levels: "Card", "Cash", "Digital Wallet" |
ride_date | Date | Min: 2024-01-01 1st Qu: 2024-01-15 Median: 2024-01-29 Mean: 2024-01-28 3rd Qu: 2024-02-09 Max: 2024-03-01 |
join_date | Date | Min: 2023-01-24 1st Qu: 2023-05-08 Median: 2023-06-26 Mean: 2023-07-13 3rd Qu: 2023-08-19 Max: 2023-12-31 |
Manipulate Data
To enhance our analysis capabilities, I created several derived variables from the existing data. Here we calculated a driver experience metric as the number of days between each ride_date and the driver's join_date, providing insight into driver tenure.
I analysed fare efficiency through a fare-per-kilometre calculation, while estimated trip durations were computed assuming an average urban speed of 30 km/h.
Here's a sample of the enhanced dataset with the derived variables:
ride_id | driver_id | driver_name | vehicle_type | driver_rating | trip_distance | fare_amount | payment_method | ride_date | join_date | driver_experience_days | fare_per_km | estimated_duration_mins | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 46 | M9325 | Premium | 4.92 | 6.35 | 16.94 | Cash | 2024-01-18 | 2023-08-17 | 154 | 2.67 | 13 | |
2 | 36 | O3996 | SUV | NA | 18.42 | 36.87 | Card | 2024-02-08 | 2023-06-06 | 247 | 2.00 | 37 | |
3 | 45 | E3108 | Sedan | 3.56 | 18.97 | 43.86 | Card | 2024-02-24 | 2023-05-18 | 282 | 2.31 | 38 | |
4 | 49 | I8972 | Sedan | 4.34 | 15.69 | 39.18 | Cash | 2024-01-16 | 2023-09-23 | 115 | 2.50 | 31 | |
5 | 15 | P1728 | Sedan | 3.79 | 9.06 | 20.09 | Cash | 2024-01-30 | 2023-05-07 | 268 | 2.22 | 18 | |
6 | 47 | K2411 | SUV | 4.38 | 4.70 | 12.66 | Card | 2024-01-11 | 2023-10-28 | 75 | 2.69 | 9 |
Summary statistics for the key derived variables:
Statistic | driver_experience_days | fare_per_km | estimated_duration_mins |
---|---|---|---|
Min. | 20.0 | 0.450 | 5.00 |
1st Qu. | 160.5 | 2.288 | 14.00 |
Median | 210.0 | 2.420 | 23.00 |
Mean | 199.8 | 2.688 | 23.83 |
3rd Qu. | 264.5 | 2.703 | 32.00 |
Max. | 382.0 | 7.570 | 100.00 |
NA's | 3 | - | - |
Scan I
I systematically examined the dataset for missing values, revealing gaps in both driver ratings and fare amounts. Here we found these missing values represented less than 5% of the total observations.
To maintain data integrity, I replaced missing numeric values with their respective column medians, a common approach that preserves the overall distribution of the data.
Missing Values Analysis
Column | Number of Missing Values | Percentage Missing |
---|---|---|
driver_rating | 9 | 12% |
fare_amount | 3 | 4% |
fare_per_km | 3 | 4% |
Rows with Missing Values:
ride_id | driver_id | driver_name | vehicle_type | driver_rating | trip_distance | fare_amount | payment_method | ride_date | join_date | driver_experience_days | fare_per_km | estimated_duration_mins |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 36 | O3996 | SUV | NA | 18.42 | 36.87 | Card | 2024-02-08 | 2023-06-06 | 247 | 2.00 | 37 |
13 | 28 | S6114 | Sedan | NA | 50.00 | 22.72 | Card | 2024-02-16 | 2023-09-27 | 142 | 0.45 | 100 |
14 | 8 | Z6948 | Sedan | NA | 8.89 | 18.13 | Card | 2024-01-14 | 2023-12-07 | 38 | 2.04 | 18 |
17 | 9 | X1450 | SUV | 4.91 | 9.92 | NA | Card | 2024-01-29 | 2023-08-08 | 174 | NA | 20 |
19 | 8 | Z6948 | Sedan | NA | 19.55 | 45.63 | Cash | 2024-01-28 | 2023-12-07 | 52 | 2.33 | 39 |
22 | 8 | Z6948 | Sedan | NA | 19.65 | 42.91 | Cash | 2024-02-01 | 2023-12-07 | 56 | 2.18 | 39 |
37 | 24 | Q8500 | Sedan | 4.73 | 8.92 | NA | Cash | 2024-01-08 | 2023-05-08 | 245 | NA | 18 |
43 | 8 | Z6948 | Sedan | NA | 7.59 | 16.52 | Card | 2024-01-10 | 2023-12-07 | 34 | 2.18 | 15 |
50 | 8 | Z6948 | Sedan | NA | 14.73 | 36.20 | Digital Wallet | 2024-01-11 | 2023-12-07 | 35 | 2.46 | 29 |
58 | 36 | O3996 | SUV | NA | 17.26 | 39.95 | Digital Wallet | 2024-01-27 | 2023-06-06 | 235 | 2.31 | 35 |
60 | 28 | S6114 | Sedan | NA | 3.37 | 15.98 | Digital Wallet | 2024-02-08 | 2023-09-27 | 134 | 4.74 | 7 |
65 | 14 | W3657 | Sedan | 3.64 | 13.32 | NA | Digital Wallet | 2024-01-17 | 2023-04-21 | 271 | NA | 27 |
After Cleaning - Remaining Missing Values:
Column | Missing Values |
---|---|
ride_id | 0 |
driver_id | 0 |
driver_name | 0 |
vehicle_type | 0 |
driver_rating | 0 |
trip_distance | 0 |
fare_amount | 0 |
payment_method | 0 |
ride_date | 0 |
join_date | 0 |
driver_experience_days | 0 |
fare_per_km | 0 |
estimated_duration_mins | 0 |
Scan II
I performed outlier detection using multiple methodologies, including the Interquartile Range (IQR) method and z-scores. Here we generated box plots for visual inspection of all numeric variables. I identified extreme values in fare amounts and trip distances, with some fares exceeding three standard deviations from the mean. These outliers were handled by capping values at three standard deviations from the mean, balancing the need to maintain data integrity while minimising the impact of extreme values.
The boxplot visualisations reveal several important patterns in our ride-sharing data:
Driver Ratings (3.5-4.8):
- Most drivers maintain ratings between 4.0 and 4.7
- The median rating is approximately 4.3
- There are no significant low-rating outliers, suggesting consistent service quality
Trip Distance (2-50km):
- Most trips fall within 5-15km range
- There's a notable outlier at around 50km, representing unusually long journeys
- The distribution is right-skewed, typical for urban ride-sharing services
Fare Amount:
- The core fare range clusters between $20-40
- There's an extreme outlier near $120, likely corresponding to the long-distance trip
- The distribution shows expected correlation with trip distance
Fare per Kilometer:
- Most fares cluster tightly around $2.5 per km
- Several high outliers between $4-7 per km suggest premium pricing scenarios
- One low outlier below $1 per km might indicate a promotional fare
Estimated Duration:
- Typical rides last between 15-30 minutes
- The outlier at 100 minutes corresponds to the long-distance journey
- The distribution aligns with expected urban travel patterns
These patterns informed our decision to cap values at three standard deviations from the mean, preserving the overall data structure while managing extreme values appropriately.
Detection Method | Variable | Number of Outliers |
---|---|---|
IQR Method | driver_rating | 0 |
trip_distance | 1 | |
fare_amount | 1 | |
fare_per_km | 12 | |
estimated_duration_mins | 1 | |
Z-Score > 3 | driver_rating | 0 |
trip_distance | 1 | |
fare_amount | 1 | |
fare_per_km | 1 | |
estimated_duration_mins | 1 |
Rows with extreme fare_amount or trip_distance:
ride_id | trip_distance | fare_amount | fare_per_km | |
---|---|---|---|---|
13 | 50.00 | 22.72 | 0.45 | |
68 | 19.82 | 150.00 | 7.57 |
Outliers after handling (IQR method):
Transform
I applied log transformations to fare-related and distance variables to address their right-skewed distributions. Here we chose the transformation log(x + 1) to handle potential zero values while improving the normality of the distributions. My comparison of original and transformed distributions demonstrated improved symmetry and reduced skewness, particularly in the fare and distance variables.
The histogram comparisons between original and log-transformed distributions reveal significant improvements in data normality:
Fare Distribution:
- Original: Shows right-skewed distribution with fares concentrated between $10-40, and sparse high-value outliers up to $80
- Log-transformed: Achieves more symmetric distribution centred around 3.5, making patterns more interpretable
- The transformation particularly helps in managing the long right tail of high-value fares
Distance Distribution:
- Original: Exhibits positive skew with most trips under 20km and rare long-distance outliers
- Log-transformed: Creates more balanced distribution centred around 2.5, suggesting log-normal trip patterns
- Better represents the natural distribution of urban journey distances
Fare per KM Distribution:
- Original: Highly concentrated around $2-3/km with several high-value outliers
- Log-transformed: Produces more normal distribution around 1.0-1.5
- Helps identify pricing anomalies more effectively
The log transformations successfully:
- Reduce the impact of extreme values without removing them
- Create more symmetrical distributions suitable for statistical analysis
- Maintain the underlying relationships in the data while improving interpretability
Statistic | Original Fare | Log Fare | Original Distance | Log Distance |
---|---|---|---|---|
Min. | 8.90 | 2.293 | 2.500 | 1.253 |
1st Qu. | 17.71 | 2.929 | 6.965 | 2.075 |
Median | 28.43 | 3.382 | 11.470 | 2.523 |
Mean | 28.31 | 3.285 | 11.674 | 2.413 |
3rd Qu. | 37.13 | 3.641 | 16.025 | 2.835 |
Max. | 82.77 | 4.428 | 33.008 | 3.527 |
Summary Statistics
I generated comprehensive summary statistics to understand key business metrics across different segments of the ride-sharing service. Here we analysed patterns by vehicle type, revealing varying patterns in average fares and trip distances. My analysis of payment methods showed clear preferences in payment options. Overall statistics provide insights into total revenue, average fares, and driver performance metrics.
The visualisations highlight the distribution of fares across vehicle types and the relative frequency of different payment methods, offering clear insights into operational patterns.
The visualisations reveal several key insights about our ride-sharing service:
Fare Distribution by Vehicle Type:
- Premium vehicles command higher median fares, reflecting their luxury positioning
- SUVs show moderate fare levels with less variability than Premium vehicles
- Sedan fares are most consistent, clustering around the lower price range
- All vehicle types show some outliers for longer or premium-priced journeys
Payment Method Preferences:
- Digital Wallet emerges as the preferred payment method, suggesting strong digital adoption
- Card payments maintain a strong second position, indicating customer preference for cashless transactions
- Cash payments, while less common, are still significant, showing the importance of maintaining payment flexibility
These patterns suggest a well-segmented service with clear price differentiation across vehicle types and strong digital payment adoption among users.
Summary by Vehicle Type:
Vehicle Type | Total Rides | Avg Fare | Avg Distance | Avg Rating | Total Revenue |
---|---|---|---|---|---|
Sedan | 43 | 28.7 | 12.0 | 4.19 | 1,232.00 |
SUV | 19 | 29.8 | 12.1 | 4.29 | 566.00 |
Premium | 13 | 24.9 | 9.83 | 4.58 | 324.00 |
Summary by Payment Method:
Payment Method | Total Rides | Avg Fare | Total Revenue | Percentage |
---|---|---|---|---|
Digital Wallet | 26 | 28.9 | 750.00 | 34.7 |
Card | 25 | 29.1 | 727.00 | 33.3 |
Cash | 24 | 26.9 | 646.00 | 32.0 |
Overall Statistics:
Metric | Value |
---|---|
Total Rides | 75.00 |
Total Drivers | 35.00 |
Total Revenue | 2,123.03 |
Average Fare | 28.31 |
Average Distance | 11.67 |
Average Rating | 4.28 |
References
CRAN (2024) The Comprehensive R Archive Network , viewed November 2024, https://cran.r-project.org/ .
R Core Team (2024) R: A Language and Environment for Statistical Computing , R Foundation for Statistical Computing, Vienna, Austria, viewed November 2024, https://www.R-project.org/ .
Wickham, H, François, R, Henry, L & Müller, K (2024) dplyr: A Grammar of Data Manipulation , R package version 1.1.4, viewed November 2024, https://CRAN.R-project.org/package=dplyr .
Wickham, H (2016) ggplot2: Elegant Graphics for Data Analysis , Springer-Verlag New York, viewed November 2024, https://ggplot2.tidyverse.org .
Wickham, H & Henry, L (2024) tidyr: Tidy Messy Data , R package version 1.3.0, viewed November 2024, https://CRAN.R-project.org/package=tidyr .
Grolemund, G & Wickham, H (2011) lubridate: Make Dealing with Dates a Little Easier , R package version 1.9.3, viewed November 2024, https://CRAN.R-project.org/package=lubridate .
setwd(paste0("C:/Users/kipjo/OneDrive/Documents/GitHub/",
"rmit_gradcert_data_science/data_wrangling/assignment2"))
library(dplyr) # For my data manipulation
library(tidyr) # For my data tidying
library(lubridate) # For my date handling
library(ggplot2) # For my visualisations
library(readr) # For my data reading/writing
library(openxlsx) # For my Excel file writing
set.seed(3229642)
n_drivers <- 50
drivers_df <- data.frame(
driver_id = 1:n_drivers,
driver_name = paste0(sample(LETTERS, n_drivers, replace=TRUE),
sample(1000:9999, n_drivers)),
vehicle_type = sample(c("Sedan", "SUV", "Premium"), n_drivers, replace=TRUE),
driver_rating = round(runif(n_drivers, 3.5, 5), 2),
join_date = as.Date("2023-01-01") + sample(0:365, n_drivers, replace=TRUE),
stringsAsFactors = FALSE
)
n_rides <- 75
base_distance <- runif(n_rides, 2, 20)
base_rate <- 5
per_km_rate <- 2
fare_noise <- rnorm(n_rides, mean=0, sd=2)
fare_amount <- base_rate + (base_distance * per_km_rate) + fare_noise
rides_df <- data.frame(
ride_id = 1:n_rides,
driver_id = sample(1:n_drivers, n_rides, replace=TRUE),
trip_distance = round(base_distance, 2),
fare_amount = round(fare_amount, 2),
payment_method = sample(c("Card", "Cash", "Digital Wallet"), n_rides, replace=TRUE),
ride_date = as.Date("2024-01-01") + sample(0:60, n_rides, replace=TRUE)
)
drivers_df$driver_rating[sample(1:n_drivers, 3)] <- NA
rides_df$fare_amount[sample(1:n_rides, 3)] <- NA
drivers_df$driver_rating[sample(1:n_drivers, 1)] <- 1.5
rides_df$fare_amount[sample(1:n_rides, 1)] <- 150
rides_df$trip_distance[sample(1:n_rides, 1)] <- 50
head(drivers_df)
head(rides_df)
dir.create("outputs", showWarnings = FALSE)
write.xlsx(drivers_df, "outputs/drivers_dataset.xlsx")
write.xlsx(rides_df, "outputs/rides_dataset.xlsx")
cat("Files written successfully:\n",
"- Drivers dataset: outputs/drivers_dataset.xlsx\n",
"- Rides dataset: outputs/rides_dataset.xlsx\n")
cat("Input dataset dimensions:\n",
"- Rides dataset:", dim(rides_df)[1], "rows by", dim(rides_df)[2], "columns\n",
"- Drivers dataset:", dim(drivers_df)[1], "rows by", dim(drivers_df)[2], "columns\n\n")
merged_df <- rides_df %>%
left_join(drivers_df, by = "driver_id")
merged_df <- merged_df %>%
select(ride_id, driver_id, driver_name, vehicle_type, driver_rating,
trip_distance, fare_amount, payment_method, ride_date, join_date)
cat("Output dataset dimensions:\n",
"- Merged dataset:", dim(merged_df)[1], "rows by", dim(merged_df)[2], "columns\n")
head(merged_df)
write.xlsx(merged_df, "outputs/merged_dataset.xlsx")
str(merged_df)
summary(merged_df)
cat("\nUnique values in vehicle_type:\n")
unique(merged_df$vehicle_type)
cat("\nUnique values in payment_method:\n")
unique(merged_df$payment_method)
merged_df <- merged_df %>%
mutate(
vehicle_type = as.factor(vehicle_type),
payment_method = as.factor(payment_method)
)
cat("\nDate ranges:\n")
cat("Ride dates:", min(merged_df$ride_date), "to", max(merged_df$ride_date), "\n")
cat("Join dates:", min(merged_df$join_date), "to", max(merged_df$join_date), "\n")
cat("\nUpdated structure after conversions:\n")
str(merged_df)
merged_df <- merged_df %>%
mutate(
driver_experience_days = as.numeric(ride_date - join_date),
fare_per_km = round(fare_amount / trip_distance, 2),
estimated_duration_mins = round(trip_distance / 30 * 60, 0)
)
head(merged_df)
summary(merged_df[c("driver_experience_days", "fare_per_km", "estimated_duration_mins")])
missing_summary <- sapply(merged_df, function(x) sum(is.na(x)))
cat("Number of missing values in each column:\n")
print(missing_summary[missing_summary > 0])
missing_percentage <- sapply(merged_df, function(x) mean(is.na(x)) * 100)
cat("\nPercentage of missing values in each column:\n")
print(missing_percentage[missing_percentage > 0])
rows_with_na <- merged_df[!complete.cases(merged_df), ]
cat("\nRows containing missing values:\n")
print(rows_with_na)
merged_df_clean <- merged_df %>%
mutate(
driver_rating = ifelse(is.na(driver_rating),
median(driver_rating, na.rm = TRUE),
driver_rating),
fare_amount = ifelse(is.na(fare_amount),
median(fare_amount, na.rm = TRUE),
fare_amount),
fare_per_km = ifelse(is.na(fare_per_km),
median(fare_per_km, na.rm = TRUE),
fare_per_km)
)
cat("\nRemaining missing values after cleaning:\n")
print(sapply(merged_df_clean, function(x) sum(is.na(x))))
numeric_cols <- c("driver_rating", "trip_distance", "fare_amount",
"fare_per_km", "estimated_duration_mins")
identify_outliers <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
outliers <- x < lower_bound | x > upper_bound
return(sum(outliers, na.rm = TRUE))
}
cat("Number of outliers in each numeric column (IQR method):\n")
sapply(merged_df_clean[numeric_cols], identify_outliers)
par(mfrow = c(2, 3), mar = c(4, 4, 3, 1)) # Adjusted margins for larger titles
for(col in numeric_cols) {
boxplot(merged_df_clean[[col]],
main = paste("Distribution of", gsub("_", " ", col)),
ylab = col,
sub = "Boxplot for outlier detection",
col = "grey80") # Grey fill for boxes
}
par(mfrow = c(1, 1))
z_scores <- sapply(merged_df_clean[numeric_cols], scale)
colnames(z_scores) <- paste0(numeric_cols, "_zscore")
extreme_values <- apply(abs(z_scores) > 3, 2, sum, na.rm = TRUE)
cat("\nNumber of extreme values (|z| > 3) in each numeric column:\n")
print(extreme_values)
cat("\nRows with extreme fare_amount or trip_distance:\n")
merged_df_clean[abs(z_scores[, "fare_amount_zscore"]) > 3 |
abs(z_scores[, "trip_distance_zscore"]) > 3,
c("ride_id", "trip_distance", "fare_amount", "fare_per_km")]
merged_df_clean <- merged_df_clean %>%
mutate(across(all_of(numeric_cols),
~ ifelse(abs(scale(.)) > 3,
sign(scale(.)) * 3 * sd(., na.rm = TRUE) + mean(., na.rm = TRUE),
.)))
cat("\nOutliers after handling (IQR method):\n")
sapply(merged_df_clean[numeric_cols], identify_outliers)
merged_df_clean <- merged_df_clean %>%
mutate(
log_fare = log(fare_amount + 1),
log_distance = log(trip_distance + 1),
log_fare_per_km = log(fare_per_km + 1)
)
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1)) # Adjust margins for better fit
hist(merged_df_clean$fare_amount,
main = "Original Fare Distribution",
xlab = "Fare Amount",
col = "grey80")
hist(merged_df_clean$trip_distance,
main = "Original Distance Distribution",
xlab = "Trip Distance",
col = "grey80")
hist(merged_df_clean$fare_per_km,
main = "Original Fare per KM Distribution",
xlab = "Fare per KM",
col = "grey80")
hist(merged_df_clean$log_fare,
main = "Log-Transformed Fare Distribution",
xlab = "Log Fare",
col = "grey80")
hist(merged_df_clean$log_distance,
main = "Log-Transformed Distance Distribution",
xlab = "Log Distance",
col = "grey80")
hist(merged_df_clean$log_fare_per_km,
main = "Log-Transformed Fare per KM Distribution",
xlab = "Log Fare per KM",
col = "grey80")
par(mfrow = c(1, 1))
summary_stats <- data.frame(
Original_Fare = summary(merged_df_clean$fare_amount),
Log_Fare = summary(merged_df_clean$log_fare),
Original_Distance = summary(merged_df_clean$trip_distance),
Log_Distance = summary(merged_df_clean$log_distance)
)
print(summary_stats)
vehicle_summary <- merged_df_clean %>%
group_by(vehicle_type) %>%
summarise(
total_rides = n(),
avg_fare = mean(fare_amount, na.rm = TRUE),
avg_distance = mean(trip_distance, na.rm = TRUE),
avg_rating = mean(driver_rating, na.rm = TRUE),
total_revenue = sum(fare_amount, na.rm = TRUE)
) %>%
arrange(desc(total_rides))
payment_summary <- merged_df_clean %>%
group_by(payment_method) %>%
summarise(
total_rides = n(),
avg_fare = mean(fare_amount, na.rm = TRUE),
total_revenue = sum(fare_amount, na.rm = TRUE),
percentage = n() / nrow(merged_df_clean) * 100
) %>%
arrange(desc(total_rides))
overall_stats <- list(
total_rides = nrow(merged_df_clean),
total_drivers = n_distinct(merged_df_clean$driver_id),
total_revenue = sum(merged_df_clean$fare_amount, na.rm = TRUE),
avg_fare = mean(merged_df_clean$fare_amount, na.rm = TRUE),
avg_distance = mean(merged_df_clean$trip_distance, na.rm = TRUE),
avg_rating = mean(merged_df_clean$driver_rating, na.rm = TRUE)
)
cat("Summary by Vehicle Type:\n")
print(vehicle_summary)
cat("\nSummary by Payment Method:\n")
print(payment_summary)
cat("\nOverall Statistics:\n")
print(data.frame(Metric = names(overall_stats),
Value = unlist(overall_stats)))
ggplot(merged_df_clean, aes(x = vehicle_type, y = fare_amount)) +
geom_boxplot(fill = "grey80") +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(title = "Fare Distribution by Vehicle Type",
subtitle = "Premium vehicles show higher median fares with greater variability",
x = "Vehicle Type",
y = "Fare Amount ($)")
ggplot(payment_summary, aes(x = payment_method, y = total_rides)) +
geom_bar(stat = "identity", fill = "grey60") +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(title = "Number of Rides by Payment Method",
subtitle = "Digital Wallet payments are most frequent, followed by Card and Cash",
x = "Payment Method",
y = "Total Rides")