The focus of this project is to fine-tune smaller language models on Ethereum smart contract vulnerabilities to demonstrate comprehension of Solidity syntax.
We want to compare different small code generation models to determine which models perform better in understanding Solidity syntax for vulnerability detection.
We wanted to do something that involved:
- Determining if smart contract code has any vulnerabilities or not
- Which light-weight AI models are best for Solidity syntax and understanding given Solidity code -> Benchmarking competing models in the shortlist
- Low GPU-usage AI models
- Small models < 500 million parameters
Model Name | Size |
---|---|
GraphCodeBERT | 125M |
CodeBERT | 125M |
CodeT5 | 60M-220M |
PolyCoder | 160M |
- Smart Contract Code Vulnerability Detection
- Verified Smart Contracts by Storhaug et al. Licensed under MIT License
- Slither Audited Smart Contracts by Martina Rossini Licensed under MIT License
The 'small-plain-text' was used and was filtered from 14k+ rows to 5k+ rows by merging with the Verified Smart Contracts dataset subset of address-license. Again, only Unlicense type source code was used. A total of ~5k rows were used for fine-tuning.