Skip to content
Andrea Gazzarini edited this page Apr 8, 2015 · 3 revisions

Tuple representation

The SolRDF schema [1] is very simple and intuitive. The basic underlying idea is that each document represents a triple or a quad:

<field name="s" type="string" indexed="true" stored="true" required="true" />
<field name="p" type="string" indexed="true" stored="true" required="true" />
<field name="o" type="string" indexed="false" stored="true" required="true" />
<field name="c" type="string" indexed="true" stored="true" required="false" />

As you can see, the o(bject) is the only field that is not indexed. That because it is the "representation" of the stored object that has been copied verbatim. On top of that, at index time, the object member of the incoming tuple, depending on its type (i.e. URI, blank node, literal) and datatype (in case it is a literal), is also copied into one of these fields:

<!-- URI, string or untyped literals -->  
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<field name="o_s" type="string" indexed="true" stored="false"/>

<!-- boolean literals -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<field name="o_b" type="boolean" indexed="true" stored="false" />

<!-- Date / datetime literals --> 
<fieldType name="datetime" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
<field name="o_d" type="datetime" indexed="true" stored="false" />

<!-- numeric literals -->
<fieldType name="numeric" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>	
<field name="o_n" type="numeric" indexed="true" stored="false" />

Additionaly, only if the object is a string literal and it has a language tag, then that value (the language) is indexed in the following field:

<!-- (optional) language attribute for string literals -->
<field name="o_lang" type="string" indexed="true" stored="false" />

Identity

A tuple identity is a unique identifier that is assigned to each incoming tuple. It is computed at runtime by a SignatureUpdateProcessor [2] that has been put in the middle of the update chain:

<updateRequestProcessorChain name="dedupe"> 
   <processor class="solr.processor.SignatureUpdateProcessorFactory"> 
      <bool name="enabled">true</bool> 
      <str name="signatureField">id</str> 
      <bool name="overwriteDupes">false</bool> 
      <str name="fields">s,p,o,c</str>
      <str name="signatureClass">solr.processor.Lookup3Signature</str> 
   </processor> 
   <processor class="solr.LogUpdateProcessorFactory" /> 
   <processor class="solr.RunUpdateProcessorFactory" /> 
</updateRequestProcessorChain>

The identity field, which is called, not surprisingly, "id", is computed from all compounding members of the tuple (s,p,o,c). That makes sure that two tuples representing the same assertion will get the same identifier and therefore won't result in a duplicate statement.


[1] https://github.com/agazzarini/SolRDF/wiki/Schema
[2] https://wiki.apache.org/solr/Deduplication