Skip to content

Commit

Permalink
Merge branch 'master' into development
Browse files Browse the repository at this point in the history
  • Loading branch information
Stefano Cossu committed Mar 28, 2019
2 parents f84c99a + b4dfa0e commit 8d9b863
Show file tree
Hide file tree
Showing 60 changed files with 4,211 additions and 1,919 deletions.
17 changes: 16 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,24 @@ venv.bak/
.pytest_cache/

# Default Lakesuperior data directories
/data
lakesuperior/data/ldprs_store
lakesuperior/data/ldpnr_store

# Cython business.
/cython_debug
/lakesuperior/store/*.c
/lakesuperior/store/*.html
/lakesuperior/store/ldp_rs/*.c
/lakesuperior/store/ldp_rs/*.html
/lakesuperior/model/*.c
/lakesuperior/model/*/*.html
/lakesuperior/model/*/*.c
/lakesuperior/model/*.html
/lakesuperior/util/*.c
/lakesuperior/util/*.html
!ext/lib

# Vim CTags file.
tags

!.keep
5 changes: 3 additions & 2 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
[submodule "ext/lmdb"]
path = ext/lmdb
url = https://github.com/LMDB/lmdb.git
branch = stable
[submodule "ext/tpl"]
path = ext/tpl
url = https://github.com/troydhanson/tpl.git
branch = stable
[submodule "ext/spookyhash"]
path = ext/spookyhash
url = https://github.com/centaurean/spookyhash.git
[submodule "ext/collections-c"]
path = ext/collections-c
url = https://github.com/srdja/Collections-C.git
4 changes: 3 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@ language: python
matrix:
include:
- python: 3.6
dist: xenial
sudo: true
- python: 3.7
dist: xenial
sudo: true

install:
- pip install Cython==0.29
- pip install Cython==0.29.6 cymem
- pip install -e .
script:
- python setup.py test
Expand Down
3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,13 @@ include ext/lmdb/libraries/liblmdb/mdb.c
include ext/lmdb/libraries/liblmdb/lmdb.h
include ext/lmdb/libraries/liblmdb/midl.c
include ext/lmdb/libraries/liblmdb/midl.h
include ext/collections-c/src/*.c
include ext/collections-c/src/include/*.h
include ext/tpl/src/tpl.c
include ext/tpl/src/tpl.h
include ext/spookyhash/src/*.c
include ext/spookyhash/src/*.h

graft lakesuperior/data/bootstrap
graft lakesuperior/endpoints/templates
graft lakesuperior/etc.defaults
45 changes: 23 additions & 22 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,44 @@ Lakesuperior

|build status| |docs| |pypi| |codecov|

Lakesuperior is an alternative `Fedora
Repository <http://fedorarepository.org>`__ implementation.
Lakesuperior is a Linked Data repository software. It is capable of storing and
managing large volumes of files and their metadata regardless of their
format, size, ethnicity, gender identity or expression.

Fedora is a mature repository software system historically adopted by
major cultural heritage institutions. It exposes an
`LDP <https://www.w3.org/TR/ldp-primer/>`__ endpoint to manage
any type of binary files and their metadata in Linked Data format.
Lakesuperior is an alternative `Fedora Repository
<http://fedorarepository.org>`__ implementation. Fedora is a mature repository
software system historically adopted by major cultural heritage institutions
which extends the `Linked Data Platform <https://www.w3.org/TR/ldp-primer/>`__
protocol.

Guiding Principles
------------------

Lakesuperior aims at being an uncomplicated, efficient Fedora 4
implementation.
Lakesuperior aims at being a reliable and efficient Fedora 4 implementation.

Its main goals are:

- **Reliability:** Based on solid technologies with stability in mind.
- **Efficiency:** Small memory and CPU footprint, high scalability.
- **Ease of management:** Tools to perform monitoring and maintenance
included.
- **Ease of management:** Tools to perform migration, monitoring and
maintenance included.
- **Simplicity of design:** Straight-forward architecture, robustness
over features.

Key features
------------

- Drop-in replacement for Fedora4
- Very stable persistence layer based on
`LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
ACID-compliant writes guarantee consistency of data.
- Term-based search and SPARQL Query API + UI
- No performance penalty for storing many resources under the same
container, or having one resource link to many URIs
- Extensible provenance metadata tracking
- Multi-modal access: HTTP (REST), command line interface and native Python
API.
- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
- Stores binary files and RDF metadata in one repository.
- Multi-modal access: REST/LDP, command line and native Python API.
- (`almost <fcrepo4_deltas>`_) Drop-in replacement for Fedora4
- Very stable persistence layer based on
`LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
ACID-compliant writes guarantee consistency of data.
- Term-based search and SPARQL Query API + UI
- No performance penalty for storing many resources under the same
container, or having one resource link to many URIs
- Extensible provenance metadata tracking
- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.

Installation & Documentation
----------------------------
Expand All @@ -50,7 +51,7 @@ With Docker::
cd lakesuperior
docker-compose up

With pip (assuming you are familiar with it)::
With pip (requires a C compiler to be installed)::

pip install lakesuperior

Expand Down
2 changes: 1 addition & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The Lakesuperior API modules of most interest for a client are:
- :mod:`lakesupeiror.api.query`
- :mod:`lakesuperior.api.admin`

:mod:`lakesuperior.model.ldpr` is used to manipulate resources.
:mod:`lakesuperior.model.ldp.ldpr` is used to manipulate resources.

The full API docs are listed below.

Expand Down
8 changes: 4 additions & 4 deletions docs/apidoc/lakesuperior.model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,31 @@ Submodules
lakesuperior\.model\.ldp\_factory module
----------------------------------------

.. automodule:: lakesuperior.model.ldp_factory
.. automodule:: lakesuperior.model.ldp.ldp_factory
:members:
:undoc-members:
:show-inheritance:

lakesuperior\.model\.ldp\_nr module
-----------------------------------

.. automodule:: lakesuperior.model.ldp_nr
.. automodule:: lakesuperior.model.ldp.ldp_nr
:members:
:undoc-members:
:show-inheritance:

lakesuperior\.model\.ldp\_rs module
-----------------------------------

.. automodule:: lakesuperior.model.ldp_rs
.. automodule:: lakesuperior.model.ldp.ldp_rs
:members:
:undoc-members:
:show-inheritance:

lakesuperior\.model\.ldpr module
--------------------------------

.. automodule:: lakesuperior.model.ldpr
.. automodule:: lakesuperior.model.ldp.ldpr
:members:
:undoc-members:
:show-inheritance:
Expand Down
99 changes: 99 additions & 0 deletions docs/structures.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
Data Structure Internals
========================

**(Draft)**

Lakesuperior has its own methods for handling in-memory graphs. These methods
rely on C data structures and are therefore much faster than Python/RDFLib
objects.

The graph data model modules are in :py:module:`lakesuperior.model.graph`.

The Graph Data Model
--------------------

Triples are stored in a C hash set. Each triple is represented by a pointer to
a ``BufferTriple`` structure stored in a temporary memory pool. This pool is
tied to the life cycle of the ``SimpleGraph`` object it belongs to.

A triple structure contains three pointers to ``Buffer`` structures, which
contain a serialized version of a RDF term. These structures are stored in the
``SimpleGraph`` memory pool as well.

Each ``SimpleGraph`` object has a ``_terms`` property and a ``_triples``
property. These are C hash sets holding addresses of unique terms and
triples inserted in the graph. If the same term is entered more than once,
in any position in any triple, the first one entered is used and is pointed to
by the triple. This makes the graph data structure very compact.

In summary, the pointers can be represented this way::

<serialized term data in mem pool (x3)>
^ ^ ^
| | |
<Term structures in mem pool (x3)>
^ ^ ^
| | |
<Term struct addresses in _terms set (x3)>
^ ^ ^
| | |
<Triple structure in mem pool>
^
|
<address of triple in _triples set>

Let's say we insert the following triples in a ``SimpleGraph``::

<urn:s:0> <urn:p:0> <urn:o:0>
<urn:s:0> <urn:p:1> <urn:o:1>
<urn:s:0> <urn:p:1> <urn:o:2>
<urn:s:0> <urn:p:0> <urn:o:0>

The memory pool contains the following byte arrays of raw data, displayed in
the following list with their relative addresses (simplified to 8-bit
addresses and fixed-length byte strings for readability)::

0x00 <urn:s:0>
0x09 <urn:p:0>
0x12 <urn:o:0>

0x1b <urn:s:0>
0x24 <urn:p:1>
0x2d <urn:o:1>

0x36 <urn:s:0>
0x3f <urn:p:1>
0x48 <urn:o:2>

0x51 <urn:s:0>
0x5a <urn:p:0>
0x63 <urn:o:0>

However, the ``_terms`` set contains only ``Buffer`` structures pointing to
unique addresses::

0x00
0x09
0x12
0x24
0x2d
0x48

The other terms are just unutilized. They will be deallocated en masse when
the ``SimpleGraph`` object is garbage collected.

The ``_triples`` set would then contain 3 unique entries pointing to the unique
term addresses::

0x00 0x09 0x12
0x00 0x24 0x2d
0x00 0x24 0x48

(the actual addresses would actually belong to the structures pointing to the
raw data, but this is just an illustrative example).

The advantage of this approach is that the memory pool is contiguous and
append-only (until it gets purged), so it's cheap to just add to it, while the
sets that must maintain uniqueness and are the ones that most operations
(lookup, adding, removing, slicing, copying, etc.) are done on, contain much
less data and are therefore faster.
1 change: 1 addition & 0 deletions ext/collections-c
Submodule collections-c added at 719fd8
17 changes: 9 additions & 8 deletions lakesuperior/api/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,18 +77,19 @@ def fixity_check(uid):
resource is not an LDP-NR.
"""
from lakesuperior.api import resource as rsrc_api
from lakesuperior.model.ldp_factory import LDP_NR_TYPE
from lakesuperior.model.ldp.ldp_factory import LDP_NR_TYPE

rsrc = rsrc_api.get(uid)
if LDP_NR_TYPE not in rsrc.ldp_types:
raise IncompatibleLdpTypeError()
with env.app_globals.rdf_store.txn_ctx():
if LDP_NR_TYPE not in rsrc.ldp_types:
raise IncompatibleLdpTypeError()

ref_digest_term = rsrc.metadata.value(nsc['premis'].hasMessageDigest)
ref_digest_parts = ref_digest_term.split(':')
ref_cksum = ref_digest_parts[-1]
ref_cksum_algo = ref_digest_parts[-2]
ref_digest_term = rsrc.metadata.value(nsc['premis'].hasMessageDigest)
ref_digest_parts = ref_digest_term.split(':')
ref_cksum = ref_digest_parts[-1]
ref_cksum_algo = ref_digest_parts[-2]

calc_cksum = hashlib.new(ref_cksum_algo, rsrc.content.read()).hexdigest()
calc_cksum = hashlib.new(ref_cksum_algo, rsrc.content.read()).hexdigest()

if calc_cksum != ref_cksum:
raise ChecksumValidationError(uid, ref_cksum, calc_cksum)
Expand Down
Loading

0 comments on commit 8d9b863

Please sign in to comment.