- Project Description
- Technologies and Tools
- Data Understanding
- Business Problem and Goals
- Expected Benefits
- Conclusions
- Proposed Solutions
- Key Business Insights
- Model
- Results
- Contact
This project was developed as part of the Google Advanced Data Analytics Professional Certificate, applying concepts and techniques covered in the program.
The New York City Taxi and Limousine Commission (TLC) sought an innovative solution to strategically leverage its operational data, enabling accurate taxi fare predictions before the trip begins.
In collaboration with Automatidata, a regression-based predictive model was developed to estimate trip costs based on variables such as location, time, distance, service type, and other contextual factors.
This initiative promotes passenger transparency, urban transportation optimization, and data-driven decision-making.
- Python: Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn, XGBoost
- Environments: Jupyter Notebook, Anaconda (terminal), Visual Studio Code
- Techniques: Machine learning algorithms (classification and regression), statistical analysis
The dataset, provided by the TLC, contains trip records from over 200,000 taxi and limousine licenses. It includes:
- Trip duration and distance
- Pickup and drop-off locations
- Fare amounts
- Payment methods
Exploratory data analysis identified relevant patterns and correlations, supporting the understanding of the main factors influencing fare amounts.
Note: This dataset was created for educational purposes and may not reflect the actual behavior of New York City taxi drivers.
Urban mobility in New York faces significant challenges, especially regarding fare pricing, where lack of predictability can lead to dissatisfaction and disputes between passengers and drivers.
Initial data analysis revealed anomalies, such as rides with a reported fare but zero distance. These inconsistencies reduce prediction accuracy and must be addressed either by adjusting the algorithm or removing such entries.
This project aims to build a reliable predictive system that anticipates fares before a ride starts, providing clarity and trust for passengers, drivers, and regulators.
Urban mobility is a constant challenge in large cities such as New York, where taxi services play a crucial role in the daily transportation of passengers.
The New York City Taxi and Limousine Commission (TLC) collects vast amounts of trip data, including location, distance, time, and fare information.
However, a lack of transparency in fare pricing and frequent disputes between passengers and drivers remain recurring issues.
In this scenario, Automatidata, a consultancy specialized in transforming data into practical solutions, was hired to develop a regression-based predictive model capable of estimating trip fares even before the ride begins.
- Extract strategic insights from the data
- Estimate fares before the ride begins
- Identify passengers likely to tip over 20%
- Build a machine learning model for fare prediction
- Transparency and trust: More precise fare estimates to reduce disputes and improve relationships
- Regulatory efficiency: Align fares with actual demand, adjusting prices based on time, region, and seasonality
- Data-driven strategic planning: Support evidence-based public policies and business strategies
- Enhanced user experience: Fairer and more predictable trip costs for both passengers and drivers
- Scalability: Adaptable to other cities and transportation modes
The model demonstrated strong potential for accurate fare predictions, identifying key fare determinants through machine learning and statistical methods.
Next Steps:
- Refine the model with additional data
- Deploy in real-world scenarios
- Gather feedback from stakeholders
- Encourage credit card payments to potentially increase driver revenue
- Integrate the model into mobile applications for real-time fare estimation
- Conduct beta testing to refine predictions
- Expand dataset and engineer additional features
- Ride duration was the most influential factor in fare prediction
- Credit card payments are positively associated with higher revenues
- The model performed reliably in multiple scenarios
This project tested and implemented multiple regression models, including Multiple Linear Regression and other advanced regression techniques.
Evaluation Metrics:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (R²)
- Mean Absolute Error (MAE)
Additional Analysis:
- Hypothesis testing to evaluate variable significance
- Feature engineering to improve prediction accuracy
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| RF (CV) | 0.675 | 0.757 | 0.714 | 0.680 |
| RF (Test) | 0.675 | 0.779 | 0.723 | 0.687 |
| XGB (CV) | 0.673 | 0.724 | 0.698 | 0.670 |
| XGB (Test) | 0.676 | 0.748 | 0.710 | 0.678 |
📌 Visualizations, charts, and detailed analysis can be found in the Jupyter notebooks.
🔎 What does this mean in practice?
These results indicate that the system can reliably estimate fares, which, in a real-world scenario, could:
- Reduce fare disputes between passengers and drivers.
- Increase customer satisfaction with clear fare predictions before the ride.
- Support fairer public policies and dynamic pricing strategies.
LinkedIn: www.linkedin.com/in/lucaspimentabarretto
GitHub: https://github.com/LucasPBar
Gmail: lucaspimenta1805@gmail.com