
This data engineering project involves processing and analyzing NYC parking violations data, which consists of approximately 50 million records. Here’s a brief overview of the workflow:
- Data Source:
- The data originates from NYC OpenData, which provides a large dataset on parking violations in New York City.
- Data Ingestion and Storage:
- Apache Airflow is used to orchestrate the data pipeline. Airflow automates the process of extracting the parking violations data from the NYC OpenData portal.
- The extracted data is then stored in Amazon S3 (Simple Storage Service), which serves as the staging area for the raw data.
- Data Warehousing:
- From Amazon S3, the data is loaded into Snowflake, a cloud-based data warehousing solution. This process involves transforming the data into a structured format suitable for analysis.
- Data Transformation:
- dbt (data build tool) is employed for transforming the data within Snowflake. dbt allows for the transformation of raw data into a more refined state, ready for analysis.
- Data Visualization and Reporting:
- Once the data is transformed and stored in Snowflake, Tableau is used for creating visualizations, dashboards, and reports. This step enables stakeholders to derive insights from the parking violations data.
- Programming Languages and Tools:
- Python and SQL are the primary programming languages used throughout the project. Python is likely used for scripting and automation tasks, while SQL is used for querying and managing the data within Snowflake.
- The entire workflow operates within a Docker container, ensuring a consistent environment for all components of the project.
- The project runs on a Linux operating system, providing a robust and scalable platform for the data pipeline.
After connecting Docker→Airflow:
Checklist
<aside>
💡 (Only Once when docker started)
Add Admin Variables.
Add Connections:
- [x] c (Not required as added in env but good practice)
- AWS_ACCESS_KEY_ID: ############
- AWS_SECRET_ACCESS_KEY: ############
- [x] snowflake_default
- PARKING_SCHEMA
- ###########
- ###########
- PARKING_WAREHOUSE
- PARKING_DB
- ACCOUNTADMIN
After Every successful of Extract & Load (Run) Next Year info.
Add Variables:
Keys:parking_violations_file_name
: values from notebook for the year
Keys:parking_violations_url
: values from notebook for the year
</aside>
Only First Time: Snowflake→AWS Integration (AWS_External_Id) is required to be added in the trust policy.
After all the successful runs of the EL Pipeline for the years run the DBT DAG. (Compile, Run, Test, Docs generate)
Some Screenshots of the Project
Airflow Dags (check dags)