The ADF Universal Framework is an open-source project designed to provide a comprehensive and flexible solution for building scalable and efficient data integration workflows using Azure Data Factory (ADF).
Whether you are dealing with data ingestion, transformation, or loading, this framework aims to streamline your ETL processes and empower data engineers and developers with a set of powerful capabilities.
And integrated various solutions, optimized and adjusted for the best outcome. Appreciate the contributions from open-source contributors.
This project primarily encompasses the following aspects:
- ADF Universal Orchestrator Framework
- ADF Universal Task Solution
- CI/CD Solution For ADF Universal Solution
- DataOps For The Modern Data Warehouse
The solution uses these components:
Component | Link |
---|---|
Azure Data Factory (ADF) | Azure Data Factory |
Azure Databricks | Azure Databricks |
Azure Data Lake Storage (ADLS) | Azure Data Lake Storage |
Azure Synapse Analytics | Azure Synapse Analytics |
Azure Key Vault | Azure Key Vault |
Azure DevOps | Azure DevOps |
Power BI | Power BI |
Azure SQL Database | Azure SQL Database |
Microsoft Purview | Microsoft Purview |
Azure Key Vault | Azure Key Vault |
Self-Hosted IR | Self-Hosted IR |
Self-Hosted Agent | Self-Hosted Agent |
To get started with the ADF Universal Framework, please refer to the documentation for detailed instructions, examples, and best practices.
ADF master framework is the main portal to control the workflow and dependencies for all task pipeline
- Metadata Management:
- Offer metadata storage and management to trace the sources, processing, and destinations of data.
- Support data lineage and impact analysis to help understand and manage data workflows.
- Task Scheduling and Execution:
- Feature a robust task scheduling engine capable of executing data flow tasks according to a defined schedule.
- Provide monitoring and logging capabilities to track task execution status and performance.
- Parameterization and Configuration:
- Allow parameterization of tasks and data flows to enhance reusability and flexibility.
- Provide configuration options for dynamic adjustments based on environment and requirements.
- Error Handling and Fault Tolerance:
- Have a robust error-handling mechanism to capture and manage errors occurring in data flows.
- Support fault tolerance mechanisms, allowing for task retries and recovery after failures.
- Security and Authentication:
- Integrate authentication and authorization mechanisms to ensure data security.
- Support encryption, access control, and protection of sensitive information.
- Monitoring and Alerting:
- Provide real-time monitoring and alerting capabilities to track task performance and runtime status.
- Integrate logging and auditing features to assist in issue troubleshooting and compliance requirements.
- Scalability and Customization:
- Demonstrate good scalability, integrating with third-party tools and services.
- Provide custom activity and plugin mechanisms to adapt to diverse business requirements.
- Version Control and Collaboration:
- Support version control for managing and tracking changes in data workflows.
- Provide collaboration and team development features to facilitate collaborative work among multiple team members.
Back to Top ⬆
ADF task framework is aiming to build common pipeline which makes developer can use it easily by config metadata.
This pipeline should different kind of ingestion and data processing
- Data Connection and Source/Destination Adapters:
- Ability to connect to various data stores and source systems, including relational databases, NoSQL databases, and cloud storage.
- Provide a wide range of data source and destination adapters to support different data formats and protocols.
- Data Flow Processing:
- Support data transformation, cleansing, and processing to meet business requirements.
- Offer a rich set of data processing activities such as data splitting, merging, aggregation, filtering, and more.
- Support multiple interfaces, such as Azure Synapse and Azure Databricks
- Parameterization and Configuration:
- Allow parameterization of tasks and data flows to enhance reusability and flexibility.
- Provide configuration options for dynamic adjustments based on environment and requirements.
- Metadata Management:
- Offer metadata storage and management to trace the sources, processing, and destinations of data.
- Support data lineage and impact analysis to help understand and manage data workflows.
- Version Control and Collaboration:
- Support version control for managing and tracking changes in data workflows.
- Provide collaboration and team development features to facilitate collaborative work among multiple team members.
- A development data factory is created and configured with Azure Repos Git. All developers should have permission to author Data Factory resources like pipelines and datasets.
- A developer creates a feature branch to make a change. They debug their pipeline runs with their most recent changes.
- After a developer is satisfied with their changes, they create a pull request from their feature branch to the main or collaboration branch to get their changes reviewed by peers.
- After a pull request is approved and changes are merged in the main branch, the changes get published to the development factory.
- When the team is ready to deploy the changes to a test or UAT (User Acceptance Testing) factory, the team goes to their Azure Pipelines release and deploys the desired version of the development factory to UAT.
This deployment takes place as part of an Azure Pipelines task and uses Resource Manager template parameters to apply the appropriate configuration. - After the changes have been verified in the test factory, deploy to the production factory by using the next task of the pipelines release.
Note:
Only the development factory is associated with a git repository.
The test and production factories shouldn't have a git repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource Management template.
- Each user makes changes in their private branches.
- Push to master isn't allowed. Users must create a pull request to make changes.
- The Azure DevOps pipeline build is triggered every time a new commit is made to master. It validates the resources and generates an ARM template as an artifact if validation succeeds.
- The DevOps Release pipeline is configured to create a new release and deploy the ARM template each time a new build is available.
We follow below release workflow, more details please read this documentation
Back to Top ⬆
Contributions to the project are welcome! If you have ideas for improvements, feature requests, or bug reports, feel free to open an issue or submit a pull request.
Let's collaborate to make data integration with Azure Data Factory more efficient and scalable!
Back to Top ⬆
ADF Universal Framework version life cycle:
Version | Current Patch/Minor | State | First Release | Limited Support | EOL/Terminated |
---|---|---|---|---|---|
2 | 2.1.0 | Supported | Jun 30, 2024 | TBD | TBD |
1.4 | 1.4.3 | EOL | May 31, 2024 | Dec 31, 2024 | Dec 31, 2024 |
1.3 | 1.3.0 | EOL | Apr 30, 2024 | Dec 31, 2024 | Dec 31, 2024 |
1.2 | 1.2.5 | EOL | Mar 31, 2024 | Dec 31, 2024 | Dec 31, 2024 |
1.1 | 1.1.1 | EOL | Feb 28, 2024 | Dec 31, 2024 | Dec 31, 2024 |
- CI/CD lifecycle - Continuous integration and delivery in Azure Data Factory
- How to setup Self-hosted Windows agents
- Register an agent using a personal access token (PAT)
- Run the agent - interactively
- Run the agent - service
- CI/CD flow - Continuous deployment improvements
- Walkthrough of CICD in Azure Data Factory (ADF)
- DataOps for the modern data warehouse
- metadata
Actual difficulties encountered:
- Debugging Azure copy activity delimiters
- Skip the use of row count
- IR configuration
- Configuration and usage of key vault
- Use the Get metadata component to determine the existence and date time of the file
- Parameterized universal pipeline
Expected functional points:
- The pipeline is more universal, and all configured items are configured in the control table
- Customize the content of sending emails
- Use only one main call to complete the pipeline call
- The running status of the pipeline can be recorded in the log table
- Able to monitor the running status of the pipeline that needs to be executed in real-time
- Implementation of pipeline error rerun mechanism