Automate end-to-end data workflow, integrating Python, PySpark, SparkML, Airflow, and BigQuery to streamline the analysis of NYC restaurant data using a dynamic Streamlit dashboard to deliver actionable insights, visualizing restaurant performance, violation trends, and geospatial patterns across boroughs, cuisines, and seasons.
Tech Stack: Python, Spark, SQL, GCP - GCS, BigQuery, Dataproc, Composer, Cloud Functions.
Analyzing NYC Restaurant Inspections: Patterns and Predictions
Why?
Open portal that can foster collaboration between inspection officers, restaurant owners, regulators to improve diner’s experience and ultimately protect the public health.
Objectives
- Understand how restaurant violation results have changed over the years and across seasons.
- Violation codes and their relationship with factors like cuisine type, borough, seasons.
- Discover patterns across a restaurant chains.
Dataset
DOHMH New York City Restaurant Inspection Results:
- Inspection results for restaurants across NYC
- Source: NYC Open Data
- ~210K rows, 27 columns
- API endpoint available
Keys points to note:
- Conditional temporal analysis is allowed.
- Violation codes, scores & grades(may) are assigned after inspection.
- API record limit of 50K per request (Requires Pagination)
Architecture
Google Cloud Platform
Airflow DAGs
Analysis
- Geospatial analysis of the most recent grade given to a restaurant.
- Find the most recent inspection results for each restaurant based and filter using instructions defined by NYC Open Data (such as score value, type of instruction and grade allowed).
- Using this ‘most_recent_letter_grade’ dataframe build a map using ‘folium’
-
‘Most_recent_letter_grade’ is used as the reference dataframe for all future analysis.
- Restaurant chain analysis
- Chain-Specific Data Collection
- Compiling data for each restaurant chain (DBA) to analyze inspection patterns across multiple locations.
-
Violation Code Frequency per Chain Assessing which violation codes were inspected and how frequently they occurred across various locations within each restaurant chain
- Ranking top 5 violation codes across different cuisines
- Quantifying occurrences for each specific violation code across different cuisine types, providing insights into the frequency of particular violations.
-
Ranking violations for cach cuisine to identify most common health code infractions for cuisine type
- Tracking trends across years for violation code
- Grouping data by violation code and inspection year
-
Calculating the total number of occurrences for each violation code in each year
- Seasonal trends across violations
- Categorizing months to seasons - Winter, Spring, Summer, and Autumn.
- Categorization is applied to through a User Defined Function (UDF), adding a “SEASON” column to the dataset, which maps each inspection month to its corresponding season.
-
Finally group by season for the entered violation code.
- Identifying top five violations across Boroughs
- Grouping by violation code and borough.
-
Window function to rank the top five most common violations partitioned by borough.
Future enhancements
- Build fault tolerance into the pipeline.
- Consider using different file formats (Parquet?) for archiving data.
- Continued scrapping will allow us to build a unified historical data warehouse.
- Address stop words within DBA and handle cuisine misclassification in source data such as ‘Pizza’ instead of ‘Italian’.
Contributors
- Amey Kolhe - apk9563@nyu.edu
- Mihir Chhatre - mc9164@nyu.edu
- Nachiket Khare - nk3559@nyu.edu