Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect character encoding of datasets #17

Open
brandomr opened this issue Aug 21, 2023 · 2 comments
Open

Detect character encoding of datasets #17

brandomr opened this issue Aug 21, 2023 · 2 comments
Assignees

Comments

@brandomr
Copy link
Contributor

@YohannParis reports an issue with this dataset when trying to profile it since it's not utf-8.
us-counties-2023.csv

The service errors with:

{
  "id": "extraction-97066793-d536-4dd1-92bc-600a11415aa7",
  "status": "failed",
  "result": {
    "created_at": "2023-08-21T13:35:52.195837",
    "enqueued_at": "2023-08-21T13:35:52.195896",
    "started_at": "2023-08-21T13:35:52.216881",
    "job_result": null,
    "job_error": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/rq/worker.py\", line 1428, in perform_job\n    rv = job.perform()\n  File \"/usr/local/lib/python3.10/site-packages/rq/job.py\", line 1278, in perform\n    self._result = self._execute()\n  File \"/usr/local/lib/python3.10/site-packages/rq/job.py\", line 1315, in _execute\n    result = self.func(*self.args, **self.kwargs)\n  File \"/workers/./operations.py\", line 249, in data_card\n    dataset_response, dataset_dataframe, dataset_csv_string = get_dataset_from_tds(\n  File \"/workers/./utils.py\", line 148, in get_dataset_from_tds\n    dataframe = pandas.read_csv(dataset_file)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 912, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 577, in _read\n    parser = TextFileReader(filepath_or_buffer, **kwds)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 1407, in __init__\n    self._engine = self._make_engine(f, self.engine)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 1679, in _make_engine\n    return mapping[engine](f, **self.options)\n  File \"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py\", line 93, in __init__\n    self._reader = parsers.TextReader(src, **kwds)\n  File \"pandas/_libs/parsers.pyx\", line 550, in pandas._libs.parsers.TextReader.__cinit__\n  File \"pandas/_libs/parsers.pyx\", line 639, in pandas._libs.parsers.TextReader._get_header\n  File \"pandas/_libs/parsers.pyx\", line 850, in pandas._libs.parsers.TextReader._tokenize_rows\n  File \"pandas/_libs/parsers.pyx\", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status\n  File \"pandas/_libs/parsers.pyx\", line 2021, in pandas._libs.parsers.raise_parser_error\nUnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 87107: invalid continuation byte\n"
  }
}

This can be addressed by dynamically detecting the encoding prior to reading the CSV in pandas. See this notebook for reference on how to do this with chardet.

@brandomr brandomr self-assigned this Aug 21, 2023
@YohannParis
Copy link
Member

Yes, it would be helpful to get this information as part of metadata of a Dataset on TDS instead of testing on ta1-service?

@brandomr
Copy link
Contributor Author

Yes, it would be helpful to get this information as part of metadata of a Dataset on TDS instead of testing on ta1-service?

This is our perennial issue--since TDS doesn't ever "touch" the data it has no way to pull the encoding and store it. Would have to be TA1 service or the HMI server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants