Merge branch 'master' into development

scossu · Mar 28, 2019 · 8d9b863 · 8d9b863
2 parents f84c99a + b4dfa0e
commit 8d9b863
Show file tree

Hide file tree

Showing 60 changed files with 4,211 additions and 1,919 deletions.
diff --git a/.gitignore b/.gitignore
@@ -107,9 +107,24 @@ venv.bak/
 .pytest_cache/
 
 # Default Lakesuperior data directories
-/data
+lakesuperior/data/ldprs_store
+lakesuperior/data/ldpnr_store
 
 # Cython business.
+/cython_debug
 /lakesuperior/store/*.c
+/lakesuperior/store/*.html
 /lakesuperior/store/ldp_rs/*.c
+/lakesuperior/store/ldp_rs/*.html
+/lakesuperior/model/*.c
+/lakesuperior/model/*/*.html
+/lakesuperior/model/*/*.c
+/lakesuperior/model/*.html
+/lakesuperior/util/*.c
+/lakesuperior/util/*.html
 !ext/lib
+
+# Vim CTags file.
+tags
+
+!.keep
diff --git a/.gitmodules b/.gitmodules
@@ -1,11 +1,12 @@
 [submodule "ext/lmdb"]
     path = ext/lmdb
     url = https://github.com/LMDB/lmdb.git
-    branch = stable
 [submodule "ext/tpl"]
     path = ext/tpl
     url = https://github.com/troydhanson/tpl.git
-    branch = stable
 [submodule "ext/spookyhash"]
     path = ext/spookyhash
     url = https://github.com/centaurean/spookyhash.git
+[submodule "ext/collections-c"]
+    path = ext/collections-c
+    url = https://github.com/srdja/Collections-C.git
diff --git a/.travis.yml b/.travis.yml
@@ -3,12 +3,14 @@ language: python
 matrix:
     include:
     - python: 3.6
+      dist: xenial
+      sudo: true
     - python: 3.7
       dist: xenial
       sudo: true
 
 install:
-  - pip install Cython==0.29
+  - pip install Cython==0.29.6 cymem
   - pip install -e .
 script:
   - python setup.py test

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -5,10 +5,13 @@ include ext/lmdb/libraries/liblmdb/mdb.c
 include ext/lmdb/libraries/liblmdb/lmdb.h
 include ext/lmdb/libraries/liblmdb/midl.c
 include ext/lmdb/libraries/liblmdb/midl.h
+include ext/collections-c/src/*.c
+include ext/collections-c/src/include/*.h
 include ext/tpl/src/tpl.c
 include ext/tpl/src/tpl.h
 include ext/spookyhash/src/*.c
 include ext/spookyhash/src/*.h
+
 graft lakesuperior/data/bootstrap
 graft lakesuperior/endpoints/templates
 graft lakesuperior/etc.defaults
diff --git a/README.rst b/README.rst
@@ -3,43 +3,44 @@ Lakesuperior
 
 |build status| |docs| |pypi| |codecov|
 
-Lakesuperior is an alternative `Fedora
-Repository <http://fedorarepository.org>`__ implementation.
+Lakesuperior is a Linked Data repository software. It is capable of storing and
+managing  large volumes of files and their metadata regardless of their
+format, size, ethnicity, gender identity or expression.
 
-Fedora is a mature repository software system historically adopted by
-major cultural heritage institutions. It exposes an
-`LDP <https://www.w3.org/TR/ldp-primer/>`__ endpoint to manage
-any type of binary files and their metadata in Linked Data format.
+Lakesuperior is an alternative `Fedora Repository
+<http://fedorarepository.org>`__ implementation. Fedora is a mature repository
+software system historically adopted by major cultural heritage institutions
+which extends the `Linked Data Platform <https://www.w3.org/TR/ldp-primer/>`__
+protocol.
 
 Guiding Principles
 ------------------
 
-Lakesuperior aims at being an uncomplicated, efficient Fedora 4
-implementation.
+Lakesuperior aims at being a reliable and efficient Fedora 4 implementation.
 
 Its main goals are:
 
 -  **Reliability:** Based on solid technologies with stability in mind.
 -  **Efficiency:** Small memory and CPU footprint, high scalability.
--  **Ease of management:** Tools to perform monitoring and maintenance
-   included.
+-  **Ease of management:** Tools to perform migration, monitoring and
+   maintenance included.
 -  **Simplicity of design:** Straight-forward architecture, robustness
    over features.
 
 Key features
 ------------
 
--  Drop-in replacement for Fedora4
--  Very stable persistence layer based on
-   `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
-   ACID-compliant writes guarantee consistency of data.
--  Term-based search and SPARQL Query API + UI
--  No performance penalty for storing many resources under the same
-   container, or having one resource link to many URIs
--  Extensible provenance metadata tracking
--  Multi-modal access: HTTP (REST), command line interface and native Python
-   API.
--  Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
+- Stores binary files and RDF metadata in one repository.
+- Multi-modal access: REST/LDP, command line and native Python API.
+- (`almost <fcrepo4_deltas>`_) Drop-in replacement for Fedora4
+- Very stable persistence layer based on
+  `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
+  ACID-compliant writes guarantee consistency of data.
+- Term-based search and SPARQL Query API + UI
+- No performance penalty for storing many resources under the same
+  container, or having one resource link to many URIs
+- Extensible provenance metadata tracking
+- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
 
 Installation & Documentation
 ----------------------------
@@ -50,7 +51,7 @@ With Docker::
     cd lakesuperior
     docker-compose up
 
-With pip (assuming you are familiar with it)::
+With pip (requires a C compiler to be installed)::
 
     pip install lakesuperior
 

diff --git a/docs/api.rst b/docs/api.rst
@@ -10,7 +10,7 @@ The Lakesuperior API modules of most interest for a client are:
 - :mod:`lakesupeiror.api.query`
 - :mod:`lakesuperior.api.admin`
 
-:mod:`lakesuperior.model.ldpr` is used to manipulate resources.
+:mod:`lakesuperior.model.ldp.ldpr` is used to manipulate resources.
 
 The full API docs are listed below.
 

diff --git a/docs/apidoc/lakesuperior.model.rst b/docs/apidoc/lakesuperior.model.rst
@@ -7,31 +7,31 @@ Submodules
 lakesuperior\.model\.ldp\_factory module
 ----------------------------------------
 
-.. automodule:: lakesuperior.model.ldp_factory
+.. automodule:: lakesuperior.model.ldp.ldp_factory
     :members:
     :undoc-members:
     :show-inheritance:
 
 lakesuperior\.model\.ldp\_nr module
 -----------------------------------
 
-.. automodule:: lakesuperior.model.ldp_nr
+.. automodule:: lakesuperior.model.ldp.ldp_nr
     :members:
     :undoc-members:
     :show-inheritance:
 
 lakesuperior\.model\.ldp\_rs module
 -----------------------------------
 
-.. automodule:: lakesuperior.model.ldp_rs
+.. automodule:: lakesuperior.model.ldp.ldp_rs
     :members:
     :undoc-members:
     :show-inheritance:
 
 lakesuperior\.model\.ldpr module
 --------------------------------
 
-.. automodule:: lakesuperior.model.ldpr
+.. automodule:: lakesuperior.model.ldp.ldpr
     :members:
     :undoc-members:
     :show-inheritance:

diff --git a/docs/structures.rst b/docs/structures.rst
@@ -0,0 +1,99 @@
+Data Structure Internals
+========================
+
+**(Draft)**
+
+Lakesuperior has its own methods for handling in-memory graphs. These methods
+rely on C data structures and are therefore much faster than Python/RDFLib
+objects.
+
+The graph data model modules are in :py:module:`lakesuperior.model.graph`.
+
+The Graph Data Model
+--------------------
+
+Triples are stored in a C hash set. Each triple is represented by a pointer to
+a ``BufferTriple`` structure stored in a temporary memory pool. This pool is
+tied to the life cycle of the ``SimpleGraph`` object it belongs to.
+
+A triple structure contains three pointers to ``Buffer`` structures, which
+contain a serialized version of a RDF term. These structures are stored in the
+``SimpleGraph`` memory pool as well.
+
+Each ``SimpleGraph`` object has a ``_terms`` property and a ``_triples``
+property. These are C hash sets holding addresses of unique terms and
+triples inserted in the graph. If the same term is entered more than once,
+in any position in any triple, the first one entered is used and is pointed to
+by the triple. This makes the graph data structure very compact.
+
+In summary, the pointers can be represented this way::
+
+   <serialized term data in mem pool (x3)>
+         ^      ^      ^
+         |      |      |
+   <Term structures in mem pool (x3)>
+         ^      ^      ^
+         |      |      |
+   <Term struct addresses in _terms set (x3)>
+         ^      ^      ^
+         |      |      |
+   <Triple structure in mem pool>
+         ^
+         |
+   <address of triple in _triples set>
+
+Let's say we insert the following triples in a ``SimpleGraph``::
+
+   <urn:s:0> <urn:p:0> <urn:o:0>
+   <urn:s:0> <urn:p:1> <urn:o:1>
+   <urn:s:0> <urn:p:1> <urn:o:2>
+   <urn:s:0> <urn:p:0> <urn:o:0>
+
+The memory pool contains the following byte arrays  of raw data, displayed in
+the following list with their relative addresses (simplified to 8-bit
+addresses and fixed-length byte strings for readability)::
+
+   0x00     <urn:s:0>
+   0x09     <urn:p:0>
+   0x12     <urn:o:0>
+
+   0x1b     <urn:s:0>
+   0x24     <urn:p:1>
+   0x2d     <urn:o:1>
+
+   0x36     <urn:s:0>
+   0x3f     <urn:p:1>
+   0x48     <urn:o:2>
+
+   0x51     <urn:s:0>
+   0x5a     <urn:p:0>
+   0x63     <urn:o:0>
+
+However, the ``_terms`` set contains only ``Buffer`` structures pointing to
+unique addresses::
+
+   0x00
+   0x09
+   0x12
+   0x24
+   0x2d
+   0x48
+
+The other terms are just unutilized. They will be deallocated en masse when
+the ``SimpleGraph`` object is garbage collected.
+
+The ``_triples`` set would then contain 3 unique entries pointing to the unique
+term addresses::
+
+   0x00  0x09  0x12
+   0x00  0x24  0x2d
+   0x00  0x24  0x48
+
+(the actual addresses would actually belong to the structures pointing to the
+raw data, but this is just an illustrative example).
+
+The advantage of this approach is that the memory pool is contiguous and
+append-only (until it gets purged), so it's cheap to just add to it, while the
+sets that must maintain uniqueness and are the ones that most operations
+(lookup, adding, removing, slicing, copying, etc.) are done on, contain much
+less data and are therefore faster.
diff --git a/ext/collections-c b/ext/collections-c
diff --git a/lakesuperior/api/admin.py b/lakesuperior/api/admin.py
@@ -77,18 +77,19 @@ def fixity_check(uid):
         resource is not an LDP-NR.
     """
     from lakesuperior.api import resource as rsrc_api
-    from lakesuperior.model.ldp_factory import LDP_NR_TYPE
+    from lakesuperior.model.ldp.ldp_factory import LDP_NR_TYPE
 
     rsrc = rsrc_api.get(uid)
-    if LDP_NR_TYPE not in rsrc.ldp_types:
-        raise IncompatibleLdpTypeError()
+    with env.app_globals.rdf_store.txn_ctx():
+        if LDP_NR_TYPE not in rsrc.ldp_types:
+            raise IncompatibleLdpTypeError()
 
-    ref_digest_term = rsrc.metadata.value(nsc['premis'].hasMessageDigest)
-    ref_digest_parts = ref_digest_term.split(':')
-    ref_cksum = ref_digest_parts[-1]
-    ref_cksum_algo = ref_digest_parts[-2]
+        ref_digest_term = rsrc.metadata.value(nsc['premis'].hasMessageDigest)
+        ref_digest_parts = ref_digest_term.split(':')
+        ref_cksum = ref_digest_parts[-1]
+        ref_cksum_algo = ref_digest_parts[-2]
 
-    calc_cksum = hashlib.new(ref_cksum_algo, rsrc.content.read()).hexdigest()
+        calc_cksum = hashlib.new(ref_cksum_algo, rsrc.content.read()).hexdigest()
 
     if calc_cksum != ref_cksum:
         raise ChecksumValidationError(uid, ref_cksum, calc_cksum)