Description
Clarification and motivation
Problems
This is an issue that I kept in mind for a long time, and since xrefcheck is becoming a serious tool, I think we should have this issue handled.
It is a known minor issue of readFile
that until the full produced lazy bytestring/text is processed, the file descriptor remains open.
An open file descriptor is a resource, and OSes tend to have a limit on the number of file descriptors that can be opened, and maybe a separate limit per process too. Meaning that we should be cautious to not exceed these limits.
In the markdown scanner, we use readFile
and seemingly do not enforce the computation (we apply parsing, but its result is not forced), meaning that the file descriptor won't be closed before the result of scanning being requested, and in case of large repositories this may be a severe issue.
Another issue similar to that: our markdown parser works on a strict Text
, meaning that during parsing the entire file's content will be kept in memory, and memory can be treated as a resource too.
How much severe this issue is? I'm not sure, documentation usually does not take much space, probably even if it is auto-generated. Documentation size is fundamentally limited by how much a human being can read, so we do not expect dozen of gigabytes here, let's further throw this concern away.
Proposed solution
Conceptually, we want two things:
- Minimize the time between
readFile
start and full parsing, as this is the time when we hold the resource (first the opened descriptor, then file content in memory). - Make parsing of various files as parallel as possible, in theory.
So from a ScanAction
object we probably expect that it reads the file as quickly as possible, maybe putting its full content in the memory, but is free to return a not fully evaluated result ((FileInfo, [ScanError])
pair). Forcing it is the caller's responsibility.
What should be done in code
First, let's document the expectations from ScanAction
, add them to haddock of this type.
Next, we should make sure that after readFile
call finishes, the file is fully read.
Since we do not parallelize file reads (in case of I/O against the filesystem this AFAIU does not make much sense), this will automatically resolve our issue with too many opened file descriptors.
Acceptance criteria
- Documentation of
ScanAction
is updated to include the mentioned comments about laziness of each inner action (file read and parsing). - The current implementation of
ScanAction
is updated respectively.