Skip to content

Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Latest
Compare
Choose a tag to compare
@HYLcool HYLcool released this 03 Jan 10:59
· 7 commits to main since this release
87efd5e

Major Updates

  • 💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
  • 💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
    • Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
    • Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (meta, stats) #514 #518
    • Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
  • 🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
  • 🚀 Support Ray Actor mode for GPU-based OPs. #511

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

  • dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.
  • dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.
  • dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default
    prompt) in feed back dialog data.
  • dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.
  • query_intent_detection_mapper: Mapper to predict user's Intent label in a query.
  • query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and
    'positive') in a query.
  • query_topic_detection_mapper: Mapper to predict user's topic label in a query.

Aggregator

  • meta_tags_aggregator: Merge similar meta tags to one tag.

Selector

  • tags_specified_field_selector: Select samples based on the tags of specified field.

Grouper

  • naive_reverse_grouper: Split bathed sample to samples.

Bug Fixed

  • Fix the wrong argument passing in generate_qa_from_example_mapper. #517
  • Update the out-of-date Dingding QR code on the main page. #513

Acknowledgement

  • @jackylee-ch made their first contribution to help fix several invalid links in the document. #521

Full Changelog: v1.0.2...v1.0.3