Technologies Used In Project:
- HTTP Source (Git): To fetch data from the source repository.
- Azure Data Factory (ADF): For seamless data transfer and orchestration.
- Microsoft Azure: As the foundational cloud platform.
- Databricks & PySpark: For data transformation and advanced processing.
- Azure Datalate : To store raw and transformed data.
- Azure Synapse Analytics: To build and manage a robust data warehouse.
- Power BI: For creating interactive dashboards and visualisations.
Other Key Concepts & Learnings:
- Databricks File System (DBFS)
- Databricks Utilities: To streamline operations and manage resources.
- Delta Tables : CRUD Operations & Internals
- Delta Table Optimization: Leveraged techniques like Z-Order By and Vacuum to optimize performance and manage stale files.
- Versioning & Time Travel: Explored historical data states for insights and debugging.
- Incremental Loading with Auto Loader: handling of streaming and batch data.
- Workflow Design: Designed scalable workflows for job orchestration.
- Databricks Jobs: Scheduled and managed jobs seamlessly for automation.
- Unity Catalog
- Resouce Group, Resource,Storage,Container,Microsoft Entra Id (Service Principle),IAM Role,Managed Identity,Compute Creation,Hive managed and external tables creation, Dynamic file loading using Iteration and loops in ADF etc.
Original DataSet Link: https://www.kaggle.com/datasets/ukveteran/adventure-works