Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs
LatestMajor Updates
- 💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
- 💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (
meta
,stats
) #514 #518 - Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
- 🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
- 🚀 Support Ray Actor mode for GPU-based OPs. #511
New OPs
Post-tuning OPs for fine-grained analysis of dialog data. #513
Mapper
dialog_intent_detection_mapper
: Mapper to generate user's intent labels in feed back dialog data.dialog_sentiment_detection_mapper
: Mapper to generate user's sentiment labels in feed back dialog data.dialog_sentiment_intensity_mapper
: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.dialog_topic_detection_mapper
: Mapper to generate user's topic labels in feed back dialog data.query_intent_detection_mapper
: Mapper to predict user's Intent label in a query.query_sentiment_detection_mapper
: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.query_topic_detection_mapper
: Mapper to predict user's topic label in a query.
Aggregator
meta_tags_aggregator
: Merge similar meta tags to one tag.
Selector
tags_specified_field_selector
: Select samples based on the tags of specified field.
Grouper
naive_reverse_grouper
: Split bathed sample to samples.
Bug Fixed
- Fix the wrong argument passing in
generate_qa_from_example_mapper
. #517 - Update the out-of-date Dingding QR code on the main page. #513
Acknowledgement
- @jackylee-ch made their first contribution to help fix several invalid links in the document. #521
Full Changelog: v1.0.2...v1.0.3