Skip to content

Contains a series of hardcoded processes & LLM tag pruning features as final preparation to be manually overviewed by the user for training. It's pupose is to process data from a messy format of being webscraped from "any" website and other data source. This pipeline contains all the pieces to completely automate data curation for the user.

License

Notifications You must be signed in to change notification settings

x-CK-x/Clean-Tags-Utility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clean-Tags-Utility

Contains a series of hardcoded processes & LLM tag pruning features as final preparation to be manually overviewed by the user for training. It's pupose is to process data from a messy format of being webscraped from "any" website and other data source. This pipeline contains all the pieces to completely automate data curation for the user.

Use case:

The user wants to use messy unformated data from various webscraped sites, possibly in combination with their own carefully curated data. Or data from the https://github.com/x-CK-x/Joy-Captioner-Inference or https://github.com/x-CK-x/Model-Builder-DCT tools. The user may want to merge the aforemetioned data in a way that makes sense. The user may want to prune the data after being merged base on a set of rules specific to the model they are training. The user may have the data in the format to load into the data curation tool for final review: https://github.com/x-CK-x/Dataset-Curation-Tool The user may have the data in the exact format to train (except for the trigger "instance token/prompt") for LoRA training

This tool hold the implemented solutions to all of these use cases ^^

(IMPORTANT) LLM USAGE w/ HuggingFace models is via API token/s, i.e. you need to get access to the gated models on HF and go to your api tokens in your settings

The only LLM model not gated with special access is the Phi model.

About

Contains a series of hardcoded processes & LLM tag pruning features as final preparation to be manually overviewed by the user for training. It's pupose is to process data from a messy format of being webscraped from "any" website and other data source. This pipeline contains all the pieces to completely automate data curation for the user.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published