Creating your own dataset load_dataset issue #692

fancellu · 2024-03-15T20:51:37Z

https://huggingface.co/learn/nlp-course/chapter5/5?fw=pt

https://discuss.huggingface.co/t/chapter-5-questions/11744/83?u=fancellu

issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")

barfs with

TypeError: Couldn't cast array of type timestamp[s] to null

Someone else saw the same too in Sept 2023

fancellu · 2024-03-15T21:39:36Z

When I split into 1k line files, and run load_dataset on each, it all works fine!

To make this easier to solve, here is my poison payload, zipped up

datasets-issues.zip

fancellu · 2024-03-16T07:48:32Z

Also, if I remove pull_requests from the json, the filtered jsonl loads just fine too. e.g.

import json

filtered_lines = []
with open("datasets-issues.jsonl", "r") as f:  
  for line in f:    
    data = json.loads(line.strip())  # Parse each line as JSON
    if not data.get("pull_request"):  # Check if "pull_request" key is absent
      filtered_lines.append(line)

# Write the filtered lines to a new file
with open("filtered_jsonl.jsonl", "w") as f:
  f.writelines(filtered_lines)

fancellu mentioned this issue Mar 15, 2024

Error when loading a big local json file huggingface/datasets#6656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating your own dataset load_dataset issue #692

Creating your own dataset load_dataset issue #692

fancellu commented Mar 15, 2024 •

edited

Loading

fancellu commented Mar 15, 2024

fancellu commented Mar 16, 2024

Creating your own dataset load_dataset issue #692

Creating your own dataset load_dataset issue #692

Comments

fancellu commented Mar 15, 2024 • edited Loading

fancellu commented Mar 15, 2024

fancellu commented Mar 16, 2024

fancellu commented Mar 15, 2024 •

edited

Loading