Contains a series of hardcoded processes & LLM tag pruning features as final preparation to be manually overviewed by the user for training. It's pupose is to process data from a messy format of being webscraped from "any" website and other data source. This pipeline contains all the pieces to completely automate data curation for the user.
The user wants to use messy unformated data from various webscraped sites, possibly in combination with their own carefully curated data. Or data from the https://github.com/x-CK-x/Joy-Captioner-Inference or https://github.com/x-CK-x/Model-Builder-DCT tools. The user may want to merge the aforemetioned data in a way that makes sense. The user may want to prune the data after being merged base on a set of rules specific to the model they are training. The user may have the data in the format to load into the data curation tool for final review: https://github.com/x-CK-x/Dataset-Curation-Tool The user may have the data in the exact format to train (except for the trigger "instance token/prompt") for LoRA training
(IMPORTANT) LLM USAGE w/ HuggingFace models is via API token/s, i.e. you need to get access to the gated models on HF and go to your api tokens in your settings
The only LLM model not gated with special access is the Phi model.