Speedup PDB parsing by ~40%#133
Conversation
douweschulte
left a comment
There was a problem hiding this comment.
Nice work, improving the performance of the parser is always nice. I added some comments inline. If any of these comments where already on your radar for changing I am sorry. Let me know if/when you can use another set of eyes on the changes.
src/read/pdb/utils.rs
Outdated
| i += 1; | ||
| } | ||
|
|
||
| result |
There was a problem hiding this comment.
Should there be a check for trailing non-whitespace in the text?
| } | ||
|
|
||
| // Parse digits | ||
| while i < len && bytes[i].is_ascii_digit() { |
There was a problem hiding this comment.
Could this be rewritten as an iterator? bytes.skip_while(|b| b.is_ascii_whitespace()).take_while(|b| b.is_ascii_digit()). Only if you think the code will improve from that change.
| let alternate_location = parse_char(linenumber, chars, 16, &mut errors); | ||
| let residue_name = parse(linenumber, line, 17..20, &mut errors); | ||
| let chain_id = String::from(parse_char(linenumber, line, 21, &mut errors)); | ||
| let residue_serial_number = parse(linenumber, line, 22..26, &mut errors); |
There was a problem hiding this comment.
This does indeed contain negative numbers. So I reverted the changes for now. A new fast parse could be made that trims, checks for a sign, then uses the current fast_parse_u64 to get the number and then return the value as isize.
douweschulte
left a comment
There was a problem hiding this comment.
Looks nice! Thanks for the great work.
I have changed a couple of small things. Mostly ignoring the 1htq pdb file where it crashed tests. Additionally I changed converting bytes back to a string into slicing into the original string slice. This removes the need for the unsafe without introducing UTF8 validity checks.
I will merge for now. As it looks nice and faster parsing is a win anyways. Feel free to improve on the parsing even more by adding a fast parse for negative numbers in a separate PR. This way we both also get smaller PRs that should be easier to fully grasp.
Significantly speeds up PDB parsing.
Optimizations:
fast_parse_u64_from_string, andfast_trimBenchmarks:
Benchmarks performed on a 2024 Macbook Air, M3, 24GB Ram
E Coli Bench
Loading entire predicted E. coli protein structure database. Database available here: https://alphafold.ebi.ac.uk/download
Speedup: 38.55%
Before:
❯ hyperfine --warmup 3 'cargo bench --bench load_e_coli'
Benchmark 1: cargo bench --bench load_e_coli
Time (mean ± σ): 12.375 s ± 0.961 s [User: 11.347 s, System: 0.288 s]
Range (min … max): 11.313 s … 13.776 s 10 runs
After:
❯ hyperfine --warmup 3 'cargo bench --bench load_e_coli'
Benchmark 1: cargo bench --bench load_e_coli
Time (mean ± σ): 7.605 s ± 0.210 s [User: 7.190 s, System: 0.259 s]
Range (min … max): 7.332 s … 7.901 s 10 runs
Standard Bench
Task | Speedup (%)
Open PDB - small 45.95%
Open PDB - medium 36.57%
Open PDB - big 26.44%
Open PDB - huge 40.83%