Analyzed over 300K data points from Strava cycling sessions using Python and Tableau to map routes, track performance, and uncover seasonal insights.
Analysing Historical Data
- Extracted and aggregated 300K+ data points from GPX files including coordinates, timestamps, speed, elevation, heart rate, and distance metrics using Python and Pandas; ensured data integrity for analysis across all activities.
-
Designed and implemented interactive dashboards in Tableau that tracked routes and performance metrics for 100+ cycling sessions and gain insights in to seasonal habits.
Currently Working On
Goal
The goal of this project is to implement a robust ELT pipeline. Things to consider:
- version control
- development flow
- file project structure
- unit testing
- logging
- documentation
- virtual environments/dependency management
- orchestration
- general best practices for data engineering
- containerization
- supporting downstream analytics / reporting / ML
Overall pipeline / Data Flow
Strava API –> Python / Lambda –> S3 -> Redshift + DBT –> Web Application (Gradio) or Tableau Dashboard
Architecture Components
Data Ingestion Layer
- Strava API as the data source
- AWS Lambda function triggered weekly on schedule with Airflow
- Lambda extracts data and stores in S3 raw zone
- Lambda uses AWS Secrets Manager for Strava API credentials
Storage Layer
- S3 Buckets- Raw Data for staging
- Redshift Serverless for Data Warehousing
- AWS Secrets Manager for Redshift Credentials
Orchestration & Transformation Layer
- Apache Airflow on ECS
- Amazon RDS for PostgreSQL database as a metadata store
- Amazon ElastiCache for Redis as a Celery backend
- Docker images for Airflow Scheduler, Worker and Web Server are stored in Amazon ECR
- DBT Core
- Runs data transformations in Redshift
- Manages data models and documentation
- Version controlled transformations
- Incremental data processing with Macros
Application Layer
- Gradio Web Application on AWS Fargate (ECS)
- Containerized Web Application
- Connects to redshift for data queries
- Another alternative is to create dashboards in AWS Quicksight or Tableau
Networking & Security
- VPC with private and public subnets
- Security groups for service isolation
- AWS IAM roles for service permissions
- Application Load Balancer in public subnet
- Services running in private subnets
- Infra managed with AWS CDK
Logs & Monitoring
- Amazon CloudWatch for logs and metrics
- AWS Container Insights for Fargate monitoring