New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom.py注册自定义数据集报错 #896
Comments
建议使用--custom_dataset_info方式注册数据集:https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0.md |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
···python
#custom.py
from typing import Optional, Tuple
from datasets import Dataset as HfDataset
from modelscope import MsDataset
from swift.llm import get_dataset, register_dataset, get_dataset_from_repo
from swift.utils import get_logger
logger = get_logger()
class CustomDatasetName:
stsb_en = 'stsb-en'
def _preprocess_stsb(dataset: HfDataset) -> HfDataset:
prompt = """Task: Based on the given two sentences, provide a similarity score between 0.0 and 5.0.
Sentence 1: {text1}
Sentence 2: {text2}
Similarity score: """
query = []
response = []
for d in dataset:
query.append(prompt.format(text1=d['text1'], text2=d['text2']))
response.append(f"{d['label']:.1f}")
return HfDataset.from_dict({'query': query, 'response': response})
register_dataset(CustomDatasetName.stsb_en, 'huangjintao/stsb', None, _preprocess_stsb, get_dataset_from_repo)
if name == 'main':
# test dataset
train_dataset, val_dataset = get_dataset([CustomDatasetName.stsb_en],
check_dataset_strategy='warning')
print(f'train_dataset: {train_dataset}')
print(f'val_dataset: {val_dataset}')
The text was updated successfully, but these errors were encountered: