Speedup PDB parsing by ~40% by maxall41 · Pull Request #133 · douweschulte/pdbtbx

maxall41 · 2024-12-16T21:21:35Z

Significantly speeds up PDB parsing.

Optimizations:

Use byte slices instead of Vec
Implemented custom mildly unsafe functions for string parsing: fast_parse_u64_from_string, and fast_trim

Benchmarks:

Benchmarks performed on a 2024 Macbook Air, M3, 24GB Ram

E Coli Bench

Loading entire predicted E. coli protein structure database. Database available here: https://alphafold.ebi.ac.uk/download

Speedup: 38.55%

Before:

❯ hyperfine --warmup 3 'cargo bench --bench load_e_coli'
Benchmark 1: cargo bench --bench load_e_coli
Time (mean ± σ): 12.375 s ± 0.961 s [User: 11.347 s, System: 0.288 s]
Range (min … max): 11.313 s … 13.776 s 10 runs

After:

❯ hyperfine --warmup 3 'cargo bench --bench load_e_coli'
Benchmark 1: cargo bench --bench load_e_coli
Time (mean ± σ): 7.605 s ± 0.210 s [User: 7.190 s, System: 0.259 s]
Range (min … max): 7.332 s … 7.901 s 10 runs

Standard Bench

Task | Speedup (%)
Open PDB - small 45.95%
Open PDB - medium 36.57%
Open PDB - big 26.44%
Open PDB - huge 40.83%

douweschulte

Nice work, improving the performance of the parser is always nice. I added some comments inline. If any of these comments where already on your radar for changing I am sorry. Let me know if/when you can use another set of eyes on the changes.

src/read/pdb/lexer.rs

douweschulte · 2024-12-17T13:45:03Z

src/read/pdb/utils.rs

+        i += 1;
+    }
+
+    result


Should there be a check for trailing non-whitespace in the text?

douweschulte · 2024-12-17T13:47:05Z

src/read/pdb/utils.rs

+    }
+
+    // Parse digits
+    while i < len && bytes[i].is_ascii_digit() {


Could this be rewritten as an iterator? bytes.skip_while(|b| b.is_ascii_whitespace()).take_while(|b| b.is_ascii_digit()). Only if you think the code will improve from that change.

src/read/pdb/utils.rs

src/read/pdb/lexer.rs

…current parser

douweschulte · 2025-08-05T09:27:33Z

src/read/pdb/lexer.rs

+    let alternate_location = parse_char(linenumber, chars, 16, &mut errors);
    let residue_name = parse(linenumber, line, 17..20, &mut errors);
-    let chain_id = String::from(parse_char(linenumber, line, 21, &mut errors));
-    let residue_serial_number = parse(linenumber, line, 22..26, &mut errors);


This does indeed contain negative numbers. So I reverted the changes for now. A new fast parse could be made that trims, checks for a sign, then uses the current fast_parse_u64 to get the number and then return the value as isize.

douweschulte

Looks nice! Thanks for the great work.

I have changed a couple of small things. Mostly ignoring the 1htq pdb file where it crashed tests. Additionally I changed converting bytes back to a string into slicing into the original string slice. This removes the need for the unsafe without introducing UTF8 validity checks.

I will merge for now. As it looks nice and faster parsing is a win anyways. Feel free to improve on the parsing even more by adding a fast parse for negative numbers in a separate PR. This way we both also get smaller PRs that should be easier to fully grasp.

maxall41 and others added 7 commits September 30, 2024 16:44

Update parser.rs

6724d7b

Update parser.rs

47c514b

Update parser.rs

ffa0a37

Update parser.rs

fc181ba

Benchmark

a2b9749

Benchmark

4d1ed15

Faster!!!

c2826c3

douweschulte reviewed Dec 17, 2024

View reviewed changes

douweschulte linked an issue Dec 17, 2024 that may be closed by this pull request

Prefer Vec<u8>/[u8] over Vec<char>/[char] #15

Closed

maxall41 and others added 6 commits July 9, 2025 18:06

Merge branch 'douweschulte:master' into master

f316bbb

Error handling

0585c62

Remove core::str import

f63b433

Remove .python_version

110aa09

Implement miri testing

3635c79

Implement E coli bench & Moved ATOM & HETATM to optimize loading

b590f92

maxall41 marked this pull request as ready for review July 11, 2025 00:40

Fix some lint issues

c845e0a

maxall41 changed the title ~~Faster PDB parsing~~ Speedup PDB parsing by ~40 Jul 13, 2025

maxall41 changed the title ~~Speedup PDB parsing by ~40~~ Speedup PDB parsing by ~40% Jul 13, 2025

Small fixes and ignored 1htq in some places because it fail with the …

e57be90

…current parser

douweschulte approved these changes Aug 5, 2025

View reviewed changes

douweschulte merged commit 9aad4b9 into douweschulte:master Aug 5, 2025

douweschulte added this to the v0.13 milestone Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup PDB parsing by ~40%#133

Speedup PDB parsing by ~40%#133
douweschulte merged 15 commits intodouweschulte:masterfrom
maxall41:master

maxall41 commented Dec 16, 2024 •

edited

Loading

Uh oh!

douweschulte left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

douweschulte Dec 17, 2024

Uh oh!

douweschulte Dec 17, 2024

Uh oh!

Uh oh!

Uh oh!

douweschulte Aug 5, 2025

Uh oh!

douweschulte left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+                      i += 1;
+                  }
+                  result

Conversation

maxall41 commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimizations:

Benchmarks:

E Coli Bench

Before:

After:

Standard Bench

Uh oh!

douweschulte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

douweschulte Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

douweschulte Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

douweschulte Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

douweschulte left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxall41 commented Dec 16, 2024 •

edited

Loading