Skip to content

Commit 384fc1f

Browse files
authored
Initial implementation of DNAString and DNAStringSet (#1)
1 parent c7091e1 commit 384fc1f

File tree

19 files changed

+1084
-209
lines changed

19 files changed

+1084
-209
lines changed

.github/workflows/run-tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,11 @@ jobs:
2828
test:
2929
strategy:
3030
matrix:
31-
python: ["3.9", "3.10", "3.11", "3.12", "3.13"]
31+
python: ["3.9", "3.10", "3.11", "3.12", "3.13", "3.14"]
3232
platform:
3333
- ubuntu-latest
3434
- macos-latest
35-
- windows-latest
35+
# - windows-latest
3636
runs-on: ${{ matrix.platform }}
3737
name: Python ${{ matrix.python }}, ${{ matrix.platform }}
3838
steps:

CHANGELOG.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
# Changelog
22

3-
## Version 0.1 (development)
3+
## Version 0.0.1
44

5-
- Feature A added
6-
- FIX: nasty bug #1729 fixed
7-
- add your changes here!
5+
- Initial implementation, added the DNAString and DNAStringSet classes.

README.md

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
[![PyPI-Server](https://img.shields.io/pypi/v/biostrings.svg)](https://pypi.org/project/biostrings/)
2-
![Unit tests](https://github.com/YOUR_ORG_OR_USERNAME/biostrings/actions/workflows/run-tests.yml/badge.svg)
2+
![Unit tests](https://github.com/biocpy/biostrings/actions/workflows/run-tests.yml/badge.svg)
33

44
# biostrings
55

6-
> representations for dna strings
6+
Efficient manipulation of genomic sequences in Python, inspired by the design of Bioconductor's [Biostrings](https://bioconductor.org/packages/Biostrings) package.
77

8-
A longer description of your project goes here...
8+
The core design relies on a **"pool and ranges"** memory model:
9+
10+
- **DNAStringSet** stores all sequences in a single contiguous block of memory (the pool).
11+
- Individual sequences are defined by `start` and `width` coordinates (the ranges).
12+
- Slicing a `DNAStringSet` returns a **view** (a new set of ranges pointing to the same pool), making subsetting operations virtually instantaneous and memory-free, regardless of the data size.
913

1014
## Install
1115

@@ -15,6 +19,56 @@ To get started, install the package from [PyPI](https://pypi.org/project/biostri
1519
pip install biostrings
1620
```
1721

22+
## Quick Start
23+
24+
### Working with Single Sequences
25+
26+
The `DNAString` class represents a single DNA sequence. It enforces the IUPAC DNA alphabet and supports efficient byte-level operations.
27+
28+
```py
29+
from biostrings import DNAString
30+
31+
# Create a DNA string
32+
dna = DnaString("TTGAAAA-CTC-N")
33+
print(dna)
34+
# Output: TTGAAAA-CTC-N
35+
36+
# Basic operations
37+
print(len(dna)) # 13
38+
print(dna[0:3]) # DnaString(length=3, sequence='TTG')
39+
40+
# Reverse Complement
41+
# Handles IUPAC ambiguity codes correctly (e.g., N -> N, M -> K)
42+
rc = dna.reverse_complement()
43+
print(rc)
44+
# Output: N-GAG-TTTTCAA
45+
```
46+
47+
### Working with Sets of Sequences
48+
49+
The `DNAStringSet` is the primary container for handling collections of sequences (e.g., reads from a FASTA file).
50+
51+
```py
52+
from biostrings import DNAStringSet
53+
54+
# Efficiently create a set from a list of strings
55+
seqs = [
56+
"ACGT",
57+
"GATTACA",
58+
"TTGAAAA-CTC-N",
59+
"ACGTACGT"
60+
]
61+
dss = DNAStringSet(seqs, names=["s1", "s2", "s3", "s4"])
62+
63+
print(dss)
64+
# Output:
65+
# <DNAStringSet of length 4>
66+
# [ 1] 4 ACGT s1
67+
# [ 2] 7 GATTACA s2
68+
# [ 3] 13 TTGAAAA-CTC-N s3
69+
# [ 4] 8 ACGTACGT s4
70+
```
71+
1872
<!-- biocsetup-notes -->
1973

2074
## Note

docs/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,7 @@
299299
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
300300
"setuptools": ("https://setuptools.pypa.io/en/stable/", None),
301301
"pyscaffold": ("https://pyscaffold.org/en/stable", None),
302+
"iranges": ("https://biocpy.github.io/IRanges", None),
302303
}
303304

304305
print(f"loading configurations for {project} {version} ...", file=sys.stderr)

docs/index.md

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,14 @@
11
# biostrings
22

3-
representations for dna strings
3+
Efficient manipulation of genomic sequences in Python, inspired by the design of Bioconductor's [Biostrings](https://bioconductor.org/packages/Biostrings) package.
44

5+
## Install
56

6-
## Note
7-
8-
> This is the main page of your project's [Sphinx] documentation. It is
9-
> formatted in [Markdown]. Add additional pages by creating md-files in
10-
> `docs` or rst-files (formatted in [reStructuredText]) and adding links to
11-
> them in the `Contents` section below.
12-
>
13-
> Please check [Sphinx] and [MyST] for more information
14-
> about how to document your project and how to configure your preferences.
7+
To get started, install the package from [PyPI](https://pypi.org/project/biostrings/)
158

9+
```bash
10+
pip install biostrings
11+
```
1612

1713
## Contents
1814

@@ -29,9 +25,9 @@ Module Reference <api/modules>
2925

3026
## Indices and tables
3127

32-
* {ref}`genindex`
33-
* {ref}`modindex`
34-
* {ref}`search`
28+
- {ref}`genindex`
29+
- {ref}`modindex`
30+
- {ref}`search`
3531

3632
[Sphinx]: http://www.sphinx-doc.org/
3733
[Markdown]: https://daringfireball.net/projects/markdown/

lib/CMakeLists.txt

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
cmake_minimum_required(VERSION 3.24)
2+
3+
project(biostring
4+
VERSION 1.0.0
5+
DESCRIPTION "Building the biostrings shared library"
6+
LANGUAGES CXX)
7+
8+
find_package(pybind11 CONFIG)
9+
10+
# pybind11 method:
11+
pybind11_add_module(biostring
12+
src/stringsetpool.cpp
13+
src/init.cpp
14+
)
15+
16+
set_property(TARGET biostring PROPERTY CXX_STANDARD 17)
17+
18+
target_link_libraries(biostring PRIVATE pybind11::pybind11)
19+
20+
set_target_properties(biostring PROPERTIES
21+
OUTPUT_NAME lib_biostrings
22+
PREFIX ""
23+
)

lib/src/init.cpp

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#include "pybind11/pybind11.h"
2+
3+
namespace py = pybind11;
4+
5+
void init_stringsetpool(pybind11::module &);
6+
7+
PYBIND11_MODULE(lib_iranges, m) {
8+
m.doc() = "cpp implementations";
9+
10+
init_stringsetpool(m);
11+
}

lib/src/stringsetpool.cpp

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#include <pybind11/pybind11.h>
2+
#include <pybind11/stl.h>
3+
#include <pybind11/numpy.h>
4+
#include <string>
5+
#include <vector>
6+
#include <sstream>
7+
#include <cctype>
8+
#include <stdexcept>
9+
10+
namespace py = pybind11;
11+
12+
// Equivalent to R's new_XStringSet_from_CHARACTER C-function
13+
// Returns: (pool (bytes), starts (numpy array), widths (numpy array))
14+
py::tuple create_dnastringset_pool(py::list py_seqs) {
15+
size_t n = py_seqs.size();
16+
17+
py::array_t<int32_t> np_starts(n);
18+
py::array_t<int32_t> np_widths(n);
19+
20+
int32_t* starts_ptr = np_starts.mutable_data();
21+
int32_t* widths_ptr = np_widths.mutable_data();
22+
23+
std::stringstream pool_stream;
24+
int32_t current_start = 0;
25+
const std::string valid_chars = "ACGTRYSWKMBDHVN-";
26+
27+
for (size_t i = 0; i < n; ++i) {
28+
std::string s = py_seqs[i].cast<std::string>();
29+
int32_t current_width = static_cast<int32_t>(s.length());
30+
starts_ptr[i] = current_start;
31+
widths_ptr[i] = current_width;
32+
33+
for (char &c : s) {
34+
c = std::toupper(c);
35+
if (valid_chars.find(c) == std::string::npos) {
36+
throw std::invalid_argument(
37+
"Sequence " + std::to_string(i) + " contains invalid DNA character: " + c
38+
);
39+
}
40+
}
41+
42+
pool_stream.write(s.c_str(), current_width);
43+
current_start += current_width;
44+
}
45+
46+
py::bytes pool = py::bytes(pool_stream.str());
47+
return py::make_tuple(pool, np_starts, np_widths);
48+
}
49+
50+
void init_stringsetpool(pybind11::module &m) {
51+
m.doc() = "C++ extensions for biostrings";
52+
m.def(
53+
"create_dnastringset_pool",
54+
&create_dnastringset_pool,
55+
"Efficiently create the pool and ranges for a DnaStringset from a list of strings."
56+
);
57+
}

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[build-system]
22
# AVOID CHANGING REQUIRES: IT WILL BE UPDATED BY PYSCAFFOLD!
3-
requires = ["setuptools>=46.1.0", "setuptools_scm[toml]>=5"]
3+
requires = ["setuptools>=46.1.0", "setuptools_scm[toml]>=5", "cmake", "pybind11", "numpy"]
44
build-backend = "setuptools.build_meta"
55

66
[tool.setuptools_scm]
@@ -11,7 +11,7 @@ version_scheme = "no-guess-dev"
1111
[tool.ruff]
1212
line-length = 120
1313
src = ["src"]
14-
exclude = ["tests"]
14+
# exclude = ["tests"]
1515
lint.extend-ignore = ["F821"]
1616

1717
[tool.ruff.lint.pydocstyle]

setup.cfg

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
[metadata]
77
name = biostrings
8-
description = representations for dna strings
8+
description = Efficient manipulation of genomic sequences
99
author = Jayaram Kancherla
1010
author_email = [email protected]
1111
license = MIT
@@ -15,8 +15,8 @@ long_description_content_type = text/markdown; charset=UTF-8; variant=GFM
1515
url = https://github.com/pyscaffold/pyscaffold/
1616
# Add here related links, for example:
1717
project_urls =
18-
Documentation = https://pyscaffold.org/
19-
# Source = https://github.com/pyscaffold/pyscaffold/
18+
Documentation = https://github.com/BiocPy/biostrings
19+
Source = https://github.com/BiocPy/biostrings
2020
# Changelog = https://pyscaffold.org/en/latest/changelog.html
2121
# Tracker = https://github.com/pyscaffold/pyscaffold/issues
2222
# Conda-Forge = https://anaconda.org/conda-forge/pyscaffold
@@ -41,14 +41,16 @@ package_dir =
4141
=src
4242

4343
# Require a min/specific Python version (comma-separated conditions)
44-
# python_requires = >=3.8
44+
python_requires = >=3.9
4545

4646
# Add here dependencies of your project (line-separated), e.g. requests>=2.2,<3.0.
4747
# Version specifiers like >=2.2,<3.0 avoid problems due to API changes in
4848
# new major versions. This works if the required packages follow Semantic Versioning.
4949
# For more information, check out https://semver.org/.
5050
install_requires =
5151
importlib-metadata; python_version<"3.8"
52+
iranges
53+
numpy
5254

5355

5456
[options.packages.find]

0 commit comments

Comments
 (0)