Skip to content

Avoid resource leaking in scanners #220

Open
@Martoon-00

Description

@Martoon-00

Clarification and motivation

Problems

This is an issue that I kept in mind for a long time, and since xrefcheck is becoming a serious tool, I think we should have this issue handled.

It is a known minor issue of readFile that until the full produced lazy bytestring/text is processed, the file descriptor remains open.

An open file descriptor is a resource, and OSes tend to have a limit on the number of file descriptors that can be opened, and maybe a separate limit per process too. Meaning that we should be cautious to not exceed these limits.

In the markdown scanner, we use readFile and seemingly do not enforce the computation (we apply parsing, but its result is not forced), meaning that the file descriptor won't be closed before the result of scanning being requested, and in case of large repositories this may be a severe issue.


Another issue similar to that: our markdown parser works on a strict Text, meaning that during parsing the entire file's content will be kept in memory, and memory can be treated as a resource too.

How much severe this issue is? I'm not sure, documentation usually does not take much space, probably even if it is auto-generated. Documentation size is fundamentally limited by how much a human being can read, so we do not expect dozen of gigabytes here, let's further throw this concern away.

Proposed solution

Conceptually, we want two things:

  1. Minimize the time between readFile start and full parsing, as this is the time when we hold the resource (first the opened descriptor, then file content in memory).
  2. Make parsing of various files as parallel as possible, in theory.

So from a ScanAction object we probably expect that it reads the file as quickly as possible, maybe putting its full content in the memory, but is free to return a not fully evaluated result ((FileInfo, [ScanError]) pair). Forcing it is the caller's responsibility.

What should be done in code

First, let's document the expectations from ScanAction, add them to haddock of this type.

Next, we should make sure that after readFile call finishes, the file is fully read.

Since we do not parallelize file reads (in case of I/O against the filesystem this AFAIU does not make much sense), this will automatically resolve our issue with too many opened file descriptors.

Acceptance criteria

  • Documentation of ScanAction is updated to include the mentioned comments about laziness of each inner action (file read and parsing).
  • The current implementation of ScanAction is updated respectively.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions