Skip errors when reading a file fails #231

HWoidt · 2023-09-20T19:24:19Z

When reading a file fails, e.g. because the file is not valid utf-8, it should be skipped instead of aborting the whole indexing run.

This addresses #227

When reading a file fails, e.g. because the file is not valid utf-8, it should be skipped instead of aborting the whole indexing run.

kantord · 2023-09-21T08:10:59Z

seagoat/engine.py

+ if chunk.chunk_id not in self.cache.data["chunks_already_analyzed"]:
+ chunks_to_process.append(chunk)
+ self.cache.data["chunks_not_yet_analyzed"].add(chunk.chunk_id)
+ except Exception as e:


this is a bit generic, I think it might be a bad idea to skip any error

it's probably very annoying to crash the server for a repo with hundreds or thousands of files just because one or two files cannot be read, however most errors might be errors that apply to all files, or majority of files, in which case probably the best way is to crash the server and allow the user to create an issue on github to fix it

I am not sure how to design it well, I am now thinking that maybe there could be a counter, and it skips the first 5 errors or so, but crashes on the 5th?

Since we know the total number of files, maybe it's better to use a relative cut-off e.g. abort when more than 1% of all files fail. Or alternatively abort when more than x% percent of files processed so far are errorneous. This should nicely catch the case where something is fundamentally wrong and all files are failing.
In larger repos the probability that there are no "weird" files tends to be very small ;) It would be good if the server would be somewhat robust with regard to file-ingestion.

The pre-commit check complains about print(): Shall we just use logging.error() in the server or do you have something else in mind for log messages?

yeah, making it a % makes sense to me!

The pre-commit check complains about print(): Shall we just use logging.error() in the server or do you have something else in mind for log messages?

Yeah, I think it would make sense to use logging.error()

Skip errors when reading a file fails

45b399a

When reading a file fails, e.g. because the file is not valid utf-8, it should be skipped instead of aborting the whole indexing run.

kantord reviewed Sep 21, 2023

View reviewed changes

danipozo mentioned this pull request Sep 22, 2023

Try to ignore binary files and detect proper encoding #240

Merged

cori mentioned this pull request Sep 25, 2023

"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81" #250

Closed

kantord force-pushed the main branch 2 times, most recently from 89aa53c to 12b4145 Compare September 27, 2023 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip errors when reading a file fails #231

Skip errors when reading a file fails #231

HWoidt commented Sep 20, 2023

kantord Sep 21, 2023

kantord Sep 21, 2023

kantord Sep 21, 2023

HWoidt Sep 21, 2023

kantord Sep 21, 2023

Skip errors when reading a file fails #231

Are you sure you want to change the base?

Skip errors when reading a file fails #231

Conversation

HWoidt commented Sep 20, 2023

kantord Sep 21, 2023

Choose a reason for hiding this comment

kantord Sep 21, 2023

Choose a reason for hiding this comment

kantord Sep 21, 2023

Choose a reason for hiding this comment

HWoidt Sep 21, 2023

Choose a reason for hiding this comment

kantord Sep 21, 2023

Choose a reason for hiding this comment