A comprehensive real-time analytics platform for tracking Caltrain performance, featuring automated data collection, processing, and interactive visualizations.
Project Overview
The Caltrain Performance Tracker addresses a real need for transparent, accessible transit performance data in the San Francisco Bay Area. While Caltrain provides basic scheduling information, there was no comprehensive public platform for analyzing historical performance trends, delay patterns, and reliability metrics.
This platform collects real-time train location data every minute from the 511.org GTFS-RT API, processes it to determine actual arrival times, and generates comprehensive analytics about on-time performance, delay patterns, and service reliability.
Key Features
Real-time Data Collection
Automated system collects train position data every minute from the 511.org API, storing over 200MB of historical location and timing data.
Performance Analytics
Comprehensive analysis of on-time performance, delay patterns by time of day, day of week, and station location with interactive visualizations.
Web Dashboard
Clean, responsive web interface with interactive Plotly charts for exploring transit performance data and trends over time.
REST API
FastAPI-based REST API provides programmatic access to all performance data and metrics for integration with other applications.
Technical Architecture
The system is designed for reliability and scalability, with clear separation between data collection, processing, and presentation layers.
Data Layer
- PostgreSQL database
- SQLAlchemy ORM
- Alembic migrations
- GTFS static data
Backend Services
- FastAPI web framework
- Prefect workflow orchestration
- Automated data collection
- Real-time processing
Frontend & Deployment
- Responsive web interface
- Interactive Plotly charts
- Docker containerization
- Self-hosted infrastructure
Methodology
Data Collection
The system fetches real-time train location data from the GTFS-RT Vehicle Monitoring API every minute. Each data point includes train ID, current position (lat/long), destination station, and timestamp, which is stored in a PostgreSQL database for analysis.
Arrival Detection
Since the raw data only contains train positions, the system calculates arrival times by determining when each train reaches minimum distance to its destination station using the Haversine formula for geographic distance calculations.
Performance Metrics
On-time performance is calculated per train and station, with delays categorized as: on-time (0-4 minutes), minor delays (5-14 minutes), and major delays (15+ minutes). The system also tracks commute time patterns for morning (6-9 AM) and evening (3:30-7:30 PM) periods.
Live Demo Integration
The Caltrain tracker is self-hosted and integrated with this website. You can view live performance data and historical trends through the embedded dashboard below, or visit the full application for detailed analytics.
Impact & Insights
This project demonstrates practical application of data engineering principles to solve real-world transit challenges. Key insights from the data include:
- Peak hour delays are significantly higher during evening commute periods
- Weather conditions have measurable impact on system performance
- Certain stations consistently experience longer delays due to infrastructure constraints
- Weekend service reliability differs substantially from weekday patterns
The platform has been running continuously since early 2024, collecting valuable transit performance data that could inform infrastructure improvements and service optimization.