Detect character encoding of datasets #17

brandomr · 2023-08-21T14:52:52Z

@YohannParis reports an issue with this dataset when trying to profile it since it's not utf-8.
us-counties-2023.csv

The service errors with:

{
  "id": "extraction-97066793-d536-4dd1-92bc-600a11415aa7",
  "status": "failed",
  "result": {
    "created_at": "2023-08-21T13:35:52.195837",
    "enqueued_at": "2023-08-21T13:35:52.195896",
    "started_at": "2023-08-21T13:35:52.216881",
    "job_result": null,
    "job_error": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/rq/worker.py\", line 1428, in perform_job\n    rv = job.perform()\n  File \"/usr/local/lib/python3.10/site-packages/rq/job.py\", line 1278, in perform\n    self._result = self._execute()\n  File \"/usr/local/lib/python3.10/site-packages/rq/job.py\", line 1315, in _execute\n    result = self.func(*self.args, **self.kwargs)\n  File \"/workers/./operations.py\", line 249, in data_card\n    dataset_response, dataset_dataframe, dataset_csv_string = get_dataset_from_tds(\n  File \"/workers/./utils.py\", line 148, in get_dataset_from_tds\n    dataframe = pandas.read_csv(dataset_file)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 912, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 577, in _read\n    parser = TextFileReader(filepath_or_buffer, **kwds)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 1407, in __init__\n    self._engine = self._make_engine(f, self.engine)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 1679, in _make_engine\n    return mapping[engine](f, **self.options)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py\", line 93, in __init__\n    self._reader = parsers.TextReader(src, **kwds)\n  File \"pandas/_libs/parsers.pyx\", line 550, in pandas._libs.parsers.TextReader.__cinit__\n  File \"pandas/_libs/parsers.pyx\", line 639, in pandas._libs.parsers.TextReader._get_header\n  File \"pandas/_libs/parsers.pyx\", line 850, in pandas._libs.parsers.TextReader._tokenize_rows\n  File \"pandas/_libs/parsers.pyx\", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status\n  File \"pandas/_libs/parsers.pyx\", line 2021, in pandas._libs.parsers.raise_parser_error\nUnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 87107: invalid continuation byte\n"
  }
}

This can be addressed by dynamically detecting the encoding prior to reading the CSV in pandas. See this notebook for reference on how to do this with chardet.

The text was updated successfully, but these errors were encountered:

YohannParis · 2023-08-21T15:21:45Z

Yes, it would be helpful to get this information as part of metadata of a Dataset on TDS instead of testing on ta1-service?

brandomr · 2023-08-21T16:32:34Z

Yes, it would be helpful to get this information as part of metadata of a Dataset on TDS instead of testing on ta1-service?

This is our perennial issue--since TDS doesn't ever "touch" the data it has no way to pull the encoding and store it. Would have to be TA1 service or the HMI server

brandomr self-assigned this Aug 21, 2023

brandomr added the integration label Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect character encoding of datasets #17

Detect character encoding of datasets #17

brandomr commented Aug 21, 2023

YohannParis commented Aug 21, 2023

brandomr commented Aug 21, 2023

Detect character encoding of datasets #17

Detect character encoding of datasets #17

Comments

brandomr commented Aug 21, 2023

YohannParis commented Aug 21, 2023

brandomr commented Aug 21, 2023