Bike Ride Insights


Analyzed over 300K data points from Strava cycling sessions using Python and Tableau to map routes, track performance, and uncover seasonal insights.


Analysing Historical Data


  • Extracted and aggregated 300K+ data points from GPX files including coordinates, timestamps, speed, elevation, heart rate, and distance metrics using Python and Pandas; ensured data integrity for analysis across all activities.
  • Designed and implemented interactive dashboards in Tableau that tracked routes and performance metrics for 100+ cycling sessions and gain insights in to seasonal habits.

    NYC - Cycling routes


Currently Working On


Goal


The goal of this project is to implement a robust ELT pipeline. Things to consider:

  • version control
  • development flow
  • file project structure
  • unit testing
  • logging
  • documentation
  • virtual environments/dependency management
  • orchestration
  • general best practices for data engineering
  • containerization
  • supporting downstream analytics / reporting / ML

Overall pipeline / Data Flow


Strava API –> Python / Lambda –> S3 -> Redshift + DBT –> Web Application (Gradio) or Tableau Dashboard


Architecture Components


Data Ingestion Layer

  • Strava API as the data source
  • AWS Lambda function triggered weekly on schedule with Airflow
  • Lambda extracts data and stores in S3 raw zone
  • Lambda uses AWS Secrets Manager for Strava API credentials

Storage Layer

  • S3 Buckets- Raw Data for staging
  • Redshift Serverless for Data Warehousing
  • AWS Secrets Manager for Redshift Credentials

Orchestration & Transformation Layer

  • Apache Airflow on ECS
    • Amazon RDS for PostgreSQL database as a metadata store
    • Amazon ElastiCache for Redis as a Celery backend
    • Docker images for Airflow Scheduler, Worker and Web Server are stored in Amazon ECR
  • DBT Core
    • Runs data transformations in Redshift
    • Manages data models and documentation
    • Version controlled transformations
    • Incremental data processing with Macros

Application Layer

  • Gradio Web Application on AWS Fargate (ECS)
    • Containerized Web Application
    • Connects to redshift for data queries
  • Another alternative is to create dashboards in AWS Quicksight or Tableau

Networking & Security

  • VPC with private and public subnets
  • Security groups for service isolation
  • AWS IAM roles for service permissions
  • Application Load Balancer in public subnet
  • Services running in private subnets
  • Infra managed with AWS CDK

Logs & Monitoring

  • Amazon CloudWatch for logs and metrics
  • AWS Container Insights for Fargate monitoring