ClarAVy

ClarAVy (pronounced like 'clarify') is a command-line application that summarizes the antivirus detections for a malicious file(s). ClarAVy now supports labeling malware by family and outputs a confidence score for each of its family predictions. It also tags malware according to category/behavior (e.g. ransomware, downloader, autorun), file properties (e.g. elf, pdf, java), exploited vulnerability, (e.g. cve_2017_0144, ms08_067), packer (e.g. upx, themida), and threat group (e.g. lazarusgroup, apt1).

Installation:

ClarAVy can be installed as a pip package. Python 3.10+ is required. After installation, ClarAVy can be run from the command line using the claravy command.

pip install git+https://github.com/FutureComputing4AI/ClarAVy

Usage:

ClarAVy accepts .jsonl files containing antivirus scan results in the VirusTotal v2 or v3 API format as input. One or more .jsonl files can be given using the -f flag. Use the -d flag to pass a directory of .jsonl files instead.

claravy -f examples/v3_scan.jsonl

claravy -f examples/v2_scan.jsonl -f examples/v3_scan.jsonl

claravy -d examples/

By default, ClarAVy writes results to stdout. The -o flag causes the results to be written to a file instead. ClarAVy uses nearly the same output format as AVClass2:

claravy -f examples/v3_scan.jsonl -o out_file.txt

$ cat out_file.txt
cb327e327196d5f49e711a4d8df07dbc        60/72   FAM:wannacry|88.20%,BEH:ransom|15,BEH:exploit|7,VULN:cve_2017_0147|5

Each line of output has the file's hash, the AV detection ratio, a family label, and tags. The family label is followed by ClarAVy's confidence in its prediction, and each tag is followed by the number of AV products which support that tag. Like AVClass, ClarAVy outputs 'SINGLETON' if a file's AV detections are not informative enough to predict a family.

Customizing ClarAVy Preferences

ClarAVy's default configuration files are located in the claravy/data/ directory. These files can be easily changed to the user's preferences.

Changing the preferred set of AV products

The file claravy/data/default_avs.txt lists the set of 103 antivirus products that ClarAVy supports by default. This file also lists if each AV product is known to be associated with any other AVs (due to sub-licensing another AV's engine, being owned by the same company, or having a sharing partnership). If multiple AV products with known associations agree on a tag, they will be counted as a single vote in total. Use the -av flag if you want to use a different set of supported AV products.

claravy -f examples/v3_scan.jsonl -av my_av_file.txt

Customizing the token taxonomy

ClarAVy automatically tokenizes and parses each AV detection using a set of over 900 parsing rules. When ClarAVy encounters an unfamiliar token, it uses its parsing results to determine whether it is a malware family, category, file property, etc. A list of known tokens can be found in claravy/data/default_taxonomy.txt. It was generated by running ClarAVy on approximately 40 million VirusTotal reports. Then, we manually reviewed and edited the taxonomy to our own preferences. Use the -tax flag if you want to define your own custom token taxonomy:

claravy -f examples/v3_scan.jsonl -tax my_taxonomy_file.txt

Customizing alias preferences

ClarAVy also automatically identifies aliases, which are tokens that have different spellings but identical meanings (e.g. ransom and ransomware). We generated the alias mapping in claravy/data/default_aliases.txt using the same method as the token taxonomy. You can use the -al flag to define your own alias preferences:

claravy -f examples/v3_scan.jsonl -al my_alias_file.txt

Adjusting tagging thresholds

You can choose custom thresholds for how many antivirus products must agree in order to output a tag. By default, this is 5 for behavior and file tags, and 1 for vulnerability, packer, and threat group tags. Raising these thresholds increases accuracy but may also result in missed tags. The -bt, -ft, -vt, -pt, and -gt flags set the voting thresholds for behavior, file, vulnerability, packer, and threat group tags respectively.

claravy -f examples/v3_scan.jsonl -bt 10 -ft 10 -vt 3 -pt 3 -gt 3

Processing Lots of Data

ClarAVy supports multiprocessing and can handle tens of millions of scan reports. The --num-processes flag sets the number of workers for parsing antivirus scans in parallel, and the --batch-size flag sets the number of scans that each worker processes at a time. Increasing the number of workers and the batch size will improve runtime for large sets of scan reports, but it will also consume more memory and increase I/O.

claravy -f examples/v3_scan.jsonl --num-processes 8 --batch-size 4096

How is ClarAVy Different From AVClass?

AVClass is a similar tool which also use antivirus scan data to tag malware. ClarAVy distinguishes itself from AVClass2 with its comprehensive antivirus label parsing. Antivirus products output labels in many different types of formats, and certain types of tokens tend to appear in predible locations within those formats. ClarAVy uses the format of an antivirus label to select an appropriate parsing function, which then applies basic pattern matching to determine the type of each token in the label. ClarAVy supports 103 common antivirus products and can parse over 900 different antivirus label formats. ClarAVy's parsers have coverage for 99.5% of the 1.1 billion antivirus labels we tested the tool with. When ClarAVy encounters a rare antivirus label format it does not support, it is able to make inferences about tokens it has parsed elsewhere.

ClarAVy also uses different strategies for identifying token aliases and for ranking tags produced by antivirus products with known correlations between them. It uses a Variational Bayesian approach for intelligently inferring the most likely malware family. To learn more about how ClarAVy predicts family labels, please refer to our paper: https://arxiv.org/abs/2502.02759. If you use ClarAVy for your research, please cite it using:

@inproceedings{joyce2025claravy,
      title={ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling},
      author={Robert J. Joyce and Derek Everett and Maya Fuchs and Edward Raff and James Holt},
      year={2025},
      booktitle={Companion of the 16th ACM/SPEC International Conference on Performance Engineering (WWW Companion '25)},
}

ClarAVy was used to label 5.5 million files in the MalDICT dataset. If you plan to use ClarAVy to label large malware datasets, please use our other paper: https://arxiv.org/abs/2310.11706

@inproceedings{joyce2023maldict,
  title={MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers}, 
      author={Robert J. Joyce and Edward Raff and Charles Nicholas and James Holt},
      year={2023},
  booktitle={Proceedings of the Conference on Applied Machine Learning in Information Security},
  pages={105-121},
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
claravy		claravy
examples		examples
.gitattributes		.gitattributes
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ClarAVy

Installation:

Usage:

Customizing ClarAVy Preferences

Changing the preferred set of AV products

Customizing the token taxonomy

Customizing alias preferences

Adjusting tagging thresholds

Processing Lots of Data

How is ClarAVy Different From AVClass?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

FutureComputing4AI/ClarAVy

Folders and files

Latest commit

History

Repository files navigation

ClarAVy

Installation:

Usage:

Customizing ClarAVy Preferences

Changing the preferred set of AV products

Customizing the token taxonomy

Customizing alias preferences

Adjusting tagging thresholds

Processing Lots of Data

How is ClarAVy Different From AVClass?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages