ECFault

ECFault is a distributed virtualization-based fault injection framework for stress testing of erasure coding implementations in open-source distributed storage systems (DSS), for example, Ceph, HDFS, DAOS).

Introduction to ECFault

ECFault includes four major components:

ECFault Coordinator: Coordinator manages the erasure coding configurations of the DSS and sends out control requests to ECFault workers for EC-oriented DSS manipulation. A submodule named EC Manager controls the EC-related configurations in DSS. For example, in case of Ceph, the EC manager can precisely create a erasure-coded pool with desired specifications including EC plugin (e.g., Jerasure), EC parameters (e.g., 𝑘 and 𝑚), chunk size, etc. Besides EC-specific configurations, it also controls other relevant system features that may affect the EC op- erations, such as the number of placement groups in the erasure-coded pool (i.e., pg_num).

ECFault Worker: Workers listen to requests from Coordinator to finish two major jobs: (1) Virtual disk provisioning to the DSS storage servers, which decouples the storage devices from the target DSS servers to allow easy control of storage states; (2) DSS manipulation, which includes a set of submodules to inject a variety of faults to trigger the EC operations in the target DSS under different workloads and configurations. ECFault Worker currently supports following types of failures:

Node failure
Device failure
Block failure

Monitor: Monitor is co-located with the metadata node containing system information (e.g., system topology, object map, erasure code parameters) of the cluster. It collects disk I/O and network traffic statistics and send them to Coordinator through Kakfa to analysis erasure coding’s performance.

Workload: Workload includes a series of configurable I/O workloads for three Ceph interfaces:

RADOS
RBD
CephFS

ECFault Initiation Guidance

Steps to initiate the tool：

(Optional) Required to be able to run sudo commands

Install dependencies:

sudo apt install nvme-cli configshell-fb nvmetcli
sudo apt install -y protobuf-compiler
pip install kafka-python
pip install iostat-tool
pip install grpcio

Create a virtual NVMe device
```
./nvmebk_create.sh
```

Connect to the virtual NVMe device on target operating system:

modprobe nvme-fabrics
nvme discover -t tcp -a <ip_address> -s 4420
nvme connect -t tcp -n nvmet-0 -a <ip_address> -s 4420

Create a DSS cluster using virtual disks as usual
Inject a fault to the DSS with ECFault worker:
```
python /src/worker.py
```
Observe erasure coding recovery process in DSS
Clean up virtual NVMe devices:
```
./nvmebk_remove.sh
```

Name	Name	Last commit message	Last commit date
Latest commit RunzhouHan new Feb 7, 2024 1dab5bc · Feb 7, 2024 History 8 Commits
doc	doc	new	Feb 7, 2024
etc/addr	etc/addr	v2.0	Jul 31, 2023
lib	lib	v2.0	Jul 31, 2023
sbin	sbin	v2.0	Jul 31, 2023
src	src	new	Feb 7, 2024
test	test	new	Feb 7, 2024
.gitmodules	.gitmodules	first commit	Mar 17, 2023
README.md	README.md	Update README.md	Feb 7, 2024
nvmbk_create.sh	nvmbk_create.sh	v2.0	Jul 31, 2023
nvmbk_multi_disk_create.sh	nvmbk_multi_disk_create.sh	v2.0	Jul 31, 2023
nvmbk_remove.sh	nvmbk_remove.sh	v2.0	Jul 31, 2023
symlink.py	symlink.py	first commit	Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECFault

Introduction to ECFault

ECFault Initiation Guidance

About

Releases

Packages

Languages

RunzhouHan/ECFault

Folders and files

Latest commit

History

Repository files navigation

ECFault

Introduction to ECFault

ECFault Initiation Guidance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages