Skip to content

Speedup PDB parsing by ~40%#133

Merged
douweschulte merged 15 commits intodouweschulte:masterfrom
maxall41:master
Aug 5, 2025
Merged

Speedup PDB parsing by ~40%#133
douweschulte merged 15 commits intodouweschulte:masterfrom
maxall41:master

Conversation

@maxall41
Copy link
Contributor

@maxall41 maxall41 commented Dec 16, 2024

Significantly speeds up PDB parsing.

Optimizations:

  • Use byte slices instead of Vec
  • Implemented custom mildly unsafe functions for string parsing: fast_parse_u64_from_string, and fast_trim

Benchmarks:

Benchmarks performed on a 2024 Macbook Air, M3, 24GB Ram

E Coli Bench

Loading entire predicted E. coli protein structure database. Database available here: https://alphafold.ebi.ac.uk/download

Speedup: 38.55%

Before:

❯ hyperfine --warmup 3 'cargo bench --bench load_e_coli'
Benchmark 1: cargo bench --bench load_e_coli
Time (mean ± σ): 12.375 s ± 0.961 s [User: 11.347 s, System: 0.288 s]
Range (min … max): 11.313 s … 13.776 s 10 runs

After:

❯ hyperfine --warmup 3 'cargo bench --bench load_e_coli'
Benchmark 1: cargo bench --bench load_e_coli
Time (mean ± σ): 7.605 s ± 0.210 s [User: 7.190 s, System: 0.259 s]
Range (min … max): 7.332 s … 7.901 s 10 runs

Standard Bench

Task | Speedup (%)
Open PDB - small 45.95%
Open PDB - medium 36.57%
Open PDB - big 26.44%
Open PDB - huge 40.83%

Copy link
Owner

@douweschulte douweschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, improving the performance of the parser is always nice. I added some comments inline. If any of these comments where already on your radar for changing I am sorry. Let me know if/when you can use another set of eyes on the changes.

i += 1;
}

result
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a check for trailing non-whitespace in the text?

}

// Parse digits
while i < len && bytes[i].is_ascii_digit() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be rewritten as an iterator? bytes.skip_while(|b| b.is_ascii_whitespace()).take_while(|b| b.is_ascii_digit()). Only if you think the code will improve from that change.

@douweschulte douweschulte linked an issue Dec 17, 2024 that may be closed by this pull request
@maxall41 maxall41 marked this pull request as ready for review July 11, 2025 00:40
@maxall41 maxall41 changed the title Faster PDB parsing Speedup PDB parsing by ~40 Jul 13, 2025
@maxall41 maxall41 changed the title Speedup PDB parsing by ~40 Speedup PDB parsing by ~40% Jul 13, 2025
let alternate_location = parse_char(linenumber, chars, 16, &mut errors);
let residue_name = parse(linenumber, line, 17..20, &mut errors);
let chain_id = String::from(parse_char(linenumber, line, 21, &mut errors));
let residue_serial_number = parse(linenumber, line, 22..26, &mut errors);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed contain negative numbers. So I reverted the changes for now. A new fast parse could be made that trims, checks for a sign, then uses the current fast_parse_u64 to get the number and then return the value as isize.

Copy link
Owner

@douweschulte douweschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Thanks for the great work.

I have changed a couple of small things. Mostly ignoring the 1htq pdb file where it crashed tests. Additionally I changed converting bytes back to a string into slicing into the original string slice. This removes the need for the unsafe without introducing UTF8 validity checks.

I will merge for now. As it looks nice and faster parsing is a win anyways. Feel free to improve on the parsing even more by adding a fast parse for negative numbers in a separate PR. This way we both also get smaller PRs that should be easier to fully grasp.

@douweschulte douweschulte merged commit 9aad4b9 into douweschulte:master Aug 5, 2025
@douweschulte douweschulte added this to the v0.13 milestone Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prefer Vec<u8>/[u8] over Vec<char>/[char]

2 participants