Aggregate a plain non-synthetic dataset for Bio sequences #91

ashvardanian · 2024-02-13T21:17:27Z

For fair benchmarks of Needleman-Wunsch scoring algorithms we should find a real-world protein bank and ideally export it into a whitespace or newline delimited .txt file, that will be easy to parse not only in Python, but also in C++. Community contributions more than welcome 🤗

The text was updated successfully, but these errors were encountered:

Requesting more dataset contributions #91

# [3.1.0](v3.0.0...v3.1.0) (2024-02-15) ### Add * `sz_isascii` and UTF8 Levenshtein distance ([a0962fb](a0962fb)) * 32-bit support with CPython ([253a3c1](253a3c1)) * Big-endian support ([b126fab](b126fab)) * Levenshtein & NW score for Rust (#89) ([663a633](663a633)), closes [#89](#89) * Macro SZ_NULL_CHAR, Clang-CL instrinsics. (#88) ([dee90bb](dee90bb)), closes [#88](#88) * serial clz/ctz for Win32 ([c968337](c968337)) ### Docs * sectioning contribution guide ([cf6ced0](cf6ced0)), closes [#91](#91) ### Fix * Clamping bounded Levenshtein ([69892fb](69892fb)) * Memory leak in macro ([c88a72a](c88a72a)) ### Improve * Port to `arm32v7` 32-bit arch ([4acf3b7](4acf3b7)) ### Make * `cibuildwheel.overrides` over custom scripts ([6d8c586](6d8c586)) * Clear root directory ([7497c96](7497c96)) * Constrain workflow names ([079f111](079f111)) * Disable a;; CI versioning ([a55d227](a55d227)) * Drop NumPy dependency ([c56239e](c56239e)) * Fix implicit `malloc` declaration ([f7761be](f7761be)) * Infer big-endian in CMake/setup.py ([72453c6](72453c6)) * Keywords for crates.io ([8d237a6](8d237a6)) * Overwrite packs with same name ([0642318](0642318)) * Packing CIBuildWheels for all archs ([49bee70](49bee70)) * Parallel wheels compilation ([0f5a946](0f5a946)) * Upgrade GitHub CI ([cd424ca](cd424ca)) * Upgrade Python CI ([4f1bf43](4f1bf43)) * Use QEMU for Linux wheels ([ac4556a](ac4556a))

Closes #91

ashvardanian added the good first issue Good for newcomers label Feb 13, 2024

ashvardanian added a commit that referenced this issue Feb 13, 2024

Docs: sectioning contribution guide

cf6ced0

Requesting more dataset contributions #91

ashvardanian mentioned this issue Feb 27, 2024

Better Heuristics for Substring Search #72

Open

ashvardanian changed the title ~~Aggregate a plain non-synthetic dataset for protein sequences~~ Aggregate a plain non-synthetic dataset for Bio sequences Apr 27, 2024

ashvardanian pushed a commit that referenced this issue Apr 28, 2024

Docs: Human omics dataset (#150)

b4d6eba

Closes #91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate a plain non-synthetic dataset for Bio sequences #91

Aggregate a plain non-synthetic dataset for Bio sequences #91

ashvardanian commented Feb 13, 2024

Aggregate a plain non-synthetic dataset for Bio sequences #91

Aggregate a plain non-synthetic dataset for Bio sequences #91

Comments

ashvardanian commented Feb 13, 2024