Rewrite file_source and file_parser #102

Pennycook · 2024-08-26T10:44:59Z

Feature/behavior summary

These two files are large, complicated, and poorly documented. The desired functionality is actually very simple:

"Clean" source files, by stripping comments and combining whitespace.
Provide an iterator over some representation of the cleaned lines.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?

Related issues

No response

Solution description

Replace file_source with a simpler approach, probably by re-using existing lexers.
Rewrite file_parser to use a simpler representation of the line_info and LineGroup objects.

Additional notes

We should explore implementing this functionality with pygments; it should be possible to iterate through tokens and discard the ones we don't want. This would also open up the opportunity to rewrite other parts of Code Base Investigator to use pygments tokens.

If we do use pygments, we'd be introducing a new dependency.

The text was updated successfully, but these errors were encountered:

Pennycook · 2024-10-11T08:56:44Z

After #122 is merged, parse_file accounts for ~20% of execution time in my offline stress test.

Pennycook · 2024-12-19T13:26:39Z

Recent experience (see #144) suggests that it may in fact be preferable to merge the cleaning step into the preprocessor. Our current two-step approach destroys physical line information, because tokenization happens after line continuation is handled. Then a source line spans two (or more) physical lines, all tokens are assigned the line number of the first physical line.

Pennycook added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Aug 26, 2024

Pennycook added this to the 2.0.0 milestone Aug 26, 2024

Pennycook mentioned this issue Nov 15, 2024

Replace is_whitespace function with str.isspace #130

Merged

Pennycook mentioned this issue Dec 18, 2024

Refactor Node using dataclass #144

Merged

Pennycook removed this from the 2.0.0 milestone Jan 15, 2025

Pennycook removed the documentation Improvements or additions to documentation label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite file_source and file_parser #102

Rewrite file_source and file_parser #102

Pennycook commented Aug 26, 2024

Pennycook commented Oct 11, 2024

Pennycook commented Dec 19, 2024

Rewrite file_source and file_parser #102

Rewrite file_source and file_parser #102

Comments

Pennycook commented Aug 26, 2024

Feature/behavior summary

Request attributes

Related issues

Solution description

Additional notes

Pennycook commented Oct 11, 2024

Pennycook commented Dec 19, 2024