Open
Description
Objective
Define, measure, and improve the reliability and fault tolerance of an example workflow based on Prefect.
Requirements
- Implement tests that quantify the reliability and fault tolerance of the example Prefect workflow
- Create a test system and associated CI capabilities
- Create simulated failures to test the testing system developed
Prerequisites
- Select and complete Prefect Tutorial #11
- Develop an example Prefect workflow based on a typical pipeline #13
Definition of Done
- The team has implemented tests that quantify the reliability and fault tolerance of the example Prefect workflow
- The team has simulated failures in the operation of the example Prefect workflow to demonstrate the usefulness of the tests
Key Decision Points
- How to measure reliability and fault tolerance?
- Appropriate resolution of measurement quantities?
- What are realistic simulations of failure?
Artifacts
- Initial definitions of reliability and fault tolerance against which to implement tests for monitoring.
- Test system and associated CI capabilities
- Passing tests that function as the basis for a monitoring system of workflows based on Prefect.
Success Criteria
There are established definitions and initial measurements that quantify the reliability and fault tolerance of workflows based on Prefect.
Potential Challenges
- Commonly used monitoring signals (latency, traffic, errors, saturation, time-to-recovery) might be difficult to quantify using workflows that only mock behavior of domain applications, i.e., sleep functions on a instead of actual workloads.
- Appropriate measurement resolution still undefined without knowing the details of integration with other services (such as user interfaces, resource pools, user demand).
- Without a well understood model of real incidents that might occur in a future working system, simulated failures might provide unrealistic constraints on the development of example workflows.