-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating your own dataset load_dataset issue #692
Comments
When I split into 1k line files, and run load_dataset on each, it all works fine! To make this easier to solve, here is my poison payload, zipped up |
Also, if I remove pull_requests from the json, the filtered jsonl loads just fine too. e.g. import json
filtered_lines = []
with open("datasets-issues.jsonl", "r") as f:
for line in f:
data = json.loads(line.strip()) # Parse each line as JSON
if not data.get("pull_request"): # Check if "pull_request" key is absent
filtered_lines.append(line)
# Write the filtered lines to a new file
with open("filtered_jsonl.jsonl", "w") as f:
f.writelines(filtered_lines) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://huggingface.co/learn/nlp-course/chapter5/5?fw=pt
https://discuss.huggingface.co/t/chapter-5-questions/11744/83?u=fancellu
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
barfs with
Someone else saw the same too in Sept 2023
The text was updated successfully, but these errors were encountered: