Update docsprint to reflect the new xapian-letor.

AyushP123 · AyushP123 · commit 28aaf3a6ce40 · 2017-08-21T11:00:05.000+05:30
diff --git a/advanced/index.rst b/advanced/index.rst
@@ -7,6 +7,7 @@ Advanced features
    postingsource
    unigramlm
    custom_weighting
+   learning_to_rank
    admin_notes
    scalability
    replication
diff --git a/advanced/learning_to_rank.rst b/advanced/learning_to_rank.rst
@@ -0,0 +1,159 @@
+
+.. Copyright (C) 2011 Parth Gupta
+.. Copyright (C) 2016 Ayush Tomar
+
+
+=======================
+Xapian Learning-to-Rank
+=======================
+
+.. contents:: Table of Contents
+
+
+Introduction
+============
+
+Learning-to-Rank(LTR) can be viewed as a weighting scheme which involves machine learning. The main idea behind LTR is to bring up relevant documents given a low ranking by probablistic techniques like BM25 by using machine learning models. A model is trained by learning from the relevance judgements provided by a user corresponding to a set of queries and a corpus of documents. This model is then used to re-rank the matchset to bring more relevant documents higher in the ranking. Learning-to-Rank has gained immense popularity and attention among researchers recently. Xapian is the first project with Learning-to-Rank functionality added to it.
+
+LTR can be broadly seen in two stages: Learning the model & Ranking. Learning the model takes the training file as input and produces a model. After that given this learnt model, when a new query comes in, scores can be assigned to the documents associated to it.
+
+Preparing the Training file
+---------------------------
+
+Currently the ranking models supported by LTR are supervised learning models. A supervised learning model requires a labelled training data as an input. To learn a model using LTR you need to provide the training data in the following format.
+
+.. code-block:: none
+
+    0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 ... 18:0.750000 19:1.000000 #docid = 1123323
+    1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 ... 18:0.500000 19:0.023400 #docid = 4222333
+
+Here each row represents the document for the specified query. The first column is the relevance label and which can take non-negative values. The second column represents the queryid, and the last column is the docid. The third column represents the value of the features.
+
+As mentioned before, this process requires a training file in the above format. xapian-letor API empowers you to generate such training file. But for that you have to supply some information like:
+
+1. Query file: This file has information of queries to be involved in
+   learning and its id. It should be formatted in such a way::
+
+    2010001 'landslide,malaysia'
+    2010002 'search,engine'
+    2010003 'Monuments,of,India'
+    2010004 'Indian,food'
+
+   where 2010xxx being query-id followed by a comma separated query in
+   single-quotes.
+
+2. Qrel file: This is the file containing relevance judgements. It should
+   be formatted in this way::
+
+    2010003 Q0 19243417 1
+    2010003 Q0 3256433 1
+    2010003 Q0 275014 1
+    2010003 Q0 298021 0
+    2010003 Q0 1456811 0
+
+   where first column is query-id, third column is Document-id and fourth column being relevance label which is 0 for irrelevance and 1 for relevance. Second column is many times referred as 'iter' but doesn't really important for us.  All the fields are whitespace delimited. This is the standard format of almost all the relevance judgement files. If you have little different relevance judgement file then you can easily convert it in such file using basic 'awk' command.
+
+3. Collection Index : Here you supply the path to the index of the corpus. If
+   you have 'title' information in the collection with some xml/html tag or so
+   then add::
+
+    indexer.index(title,1,"S");
+
+You can refer to the "Indexing" section under "A practical example" heading for the Collection Index. The database created in the practical example will be used as the collection index for the examples. In particular we are going to be using all the documents from which contain the term "watch" which will be used as the query for the examples.
+
+Provided such information, API is capable of creating the training file which is in the mentioned format and can be easily used for learning a model.
+
+To prepare a training file run the following command from the top level directory. This example assumes that you have created the db from the first example in "Indexing" section under "A practical example" header and you have installed xapian-letor.
+
+.. code-block:: none
+
+    $ xapian-prepare-trainingfile --db=db data/query.txt data/qrel.txt training_data.txt
+
+xapian-prepare-trainingfile is a utility present after you have installed xapian-letor. This should create a training_data.txt which should have the similar values to the data/training_data.txt.
+
+The source code is present for xapian-prepare-trainingfile.cc is present at `xapian/xapian-letor/bin/xapian-prepare-trainingfile.cc <https://github.com/xapian/xapian/blob/master/xapian-letor/bin/xapian-prepare-trainingfile.cc>`_.
+
+Learning the Model
+------------------
+
+In xapian-letor we support the following learning algorithms:
+
+1. `ListNET <http://dl.acm.org/citation.cfm?id=1273513>`_
+2. `Ranking-SVM <http://dl.acm.org/citation.cfm?id=775067>`_
+3. `ListMLE <http://icml2008.cs.helsinki.fi/papers/167.pdf>`_
+
+You can use any one of the rankers to Learn the model. The command line tool xapian-train uses ListNET as the ranker for learning. To learn a model run the following command from the top level directory.
+
+.. code-block:: none
+
+    $ xapian-train --db=db data/training_data.txt "ListNET_Ranker"
+
+Ranking
+-------
+
+After we have built a model, its quite straightforward to get a real score for a particular document for the given query. Here we supply the first hand retrieved ranked-list to the Ranking function, which assigns a new score to each document after converting it to the same dimensioned feature vector. This list is re-ranked according to the new scores.
+
+Here’s the significant part of the example code to implement ranking.
+
+.. xapianexample:: search_letor
+
+A full copy of this code is available in :xapian-code-example:`^`
+
+You can run this code as follows to re-rank the list of documents retrieved from the db containing the term "watch" in the order of relevance as mentioned in the data/qrel.
+
+.. xapianrunexample:: search_letor
+    :cleanfirst: db
+    :args: "db" "ListNET_Ranker" "watch"
+    :letor:
+
+Features
+========
+
+Features play a major role in the learning. In LTR, features are mainly of three types: query dependent, document dependent (pagerank, inLink/outLink number, number of children, etc) and query-document pair dependent (TF-IDF Score, BM25 Score, etc).
+
+Currently we have incorporated 19 features which are described below. These features are statistically tested in `Nallapati2004 <http://dl.acm.org/citation.cfm?id=1009006>`_.
+
+    Here c(w,D) means that count of term w in Document D. C represents the Collection. 'n' is the total number of terms in query.
+    :math:`|.|` is size-of function and idf(.) is the inverse-document-frequency.
+
+
+    1. :math:`\sum_{q_i \in Q \cap D} \log{\left( c(q_i,D) \right)}`
+
+    2. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\right)}`
+
+    3. :math:`\sum_{q_i \in Q \cap D} \log{\left(idf(q_i) \right) }`
+
+    4. :math:`\sum_{q_i \in Q \cap D} \log{\left( \frac{|C|}{c(q_i,C)} \right)}`
+
+    5. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}idf(q_i)\right)}`
+
+    6. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\frac{|C|}{c(q_i,C)}\right)}`
+
+
+All the above 6 features are calculated considering 'title only', 'body only' and 'whole' document. So they make in total 6*3=18 features. The 19th feature is the Xapian weighting scheme score assigned to the document (by default this is BM25).The API gives a choice to select which specific features you want to use. By default, all the 19 features defined above are used.
+
+One thing that should be noticed is that all the feature values are `normalized at Query-Level <https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm>`_. That means that the values of a particular feature for a particular query are divided by its query-level maximum value and hence all the feature values will be between 0 and 1. This normalization helps for unbiased learning.
+
+.. [Nallapati2004] Nallapati, R. Discriminative models for information retrieval. Proceedings of SIGIR 2004 (pp. 64-71).
+
+Checking quality of ranking
+---------------------------
+
+xapian-letor has support for Scorer metrics to check the ranking quality of LTR model. Ranking quality score is calculated based on the relevance label of ranked document obtained from the Qrel file. Currently we support the following quality metrics:
+
+.. code-block:: none
+
+    1. `Normalised Discounted Cumulative Gain (NDCG) measure <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`_
+
+To score your model using xapian-letor api you need to make sure that you use the same Ranker that you used to train the model, same set of features used to generate the training file and rank the documents along with the same model key used to train the model. By default "NDCG" scorer is used as the score_type and since we have only one scorer that is the only valid string allowed as scorer_type. By default all features are used for scoring:
+
+.. code-block:: none
+
+    Xapian::ListNETRanker ranker;
+    ranker.set_database_path(db_path);
+    ranker.set_query(query);
+    ranker.train_model(trainingfile, model_key); or ranker->train_model(trainingfile);
+    ranker.rank(mset, model_key, flist);
+    ranker.score(query, qrel, model_key, outputfile_path, msetsize, scorer_type, flist);
+
+Make sure that you use the same LTR algorithm (Ranker) and same set of Features (via Xapian::FeatureList) that were used while preparing the model you are evaluating, otherwise it will throw and exception. Ranker::score() method will return the model score for each query in the query file and an average score for all the queries. The results get saved at <outputfile_path>.
diff --git a/code/c++/search_letor.cc b/code/c++/search_letor.cc
@@ -0,0 +1,94 @@
+#include <xapian-letor.h>
+
+#include <iostream>
+#include <sstream>
+#include <string>
+
+using namespace std;
+
+static void show_usage()
+{
+    cout << "Usage: rank_letor --db=DIRECTORY MODEL_METADATA_KEY QUERY\n";
+}
+
+// Start of example code.
+// Stopwords:
+static const char * sw[] = {
+    "a", "about", "an", "and", "are", "as", "at",
+    "be", "by",
+    "en",
+    "for", "from",
+    "how",
+    "i", "in", "is", "it",
+    "of", "on", "or",
+    "that", "the", "this", "to",
+    "was", "what", "when", "where", "which", "who", "why", "will", "with"
+};
+
+void rank_letor(string db_path, string model_key, string query_)
+{
+    Xapian::SimpleStopper mystopper(sw, sw + sizeof(sw) / sizeof(sw[0]));
+    Xapian::Stem stemmer("english");
+    Xapian::doccount msize = 10;
+    Xapian::QueryParser parser;
+    parser.add_prefix("title", "S");
+    parser.add_prefix("subject", "S");
+    Xapian::Database db(db_path);
+    parser.set_database(db);
+    parser.set_default_op(Xapian::Query::OP_OR);
+    parser.set_stemmer(stemmer);
+    parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
+    parser.set_stopper(&mystopper);
+    Xapian::Query query_no_prefix = parser.parse_query(query_,
+						       parser.FLAG_DEFAULT|
+						       parser.FLAG_SPELLING_CORRECTION);
+    // query with title as default prefix
+    Xapian::Query query_default_prefix = parser.parse_query(query_,
+							    parser.FLAG_DEFAULT|
+							    parser.FLAG_SPELLING_CORRECTION,
+							    "S");
+    // Combine queries
+    Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, query_no_prefix,
+					query_default_prefix);
+    Xapian::Enquire enquire(db);
+    enquire.set_query(query);
+    Xapian::MSet mset = enquire.get_mset(0, msize);
+
+    cout << "Docids before re-ranking by LTR model:" << endl;
+    for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
+	Xapian::Document doc = i.get_document();
+	string data = doc.get_data();
+	cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
+    }
+
+    // Initialise Ranker object with ListNETRanker instance, db path and query.
+    // See Ranker documentation for available Ranker subclass options.
+    Xapian::ListNETRanker ranker;
+    ranker.set_database_path(db_path);
+    ranker.set_query(query);
+
+    // Get vector of re-ranked docids
+    ranker.rank(mset, model_key);
+
+    cout << "Docids after re-ranking by LTR model:\n" << endl;
+
+    for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
+	Xapian::Document doc = i.get_document();
+	string data = doc.get_data();
+	cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
+    }
+}
+// End of example code.
+
+int main(int argc, char** argv)
+{
+    if (argc != 4) {
+	show_usage();
+	return 0;
+    }
+    string db_path = argv[1];
+    string model_key = argv[2];
+    string query = argv[3];
+    rank_letor(db_path, model_key, query);
+    return 0;
+}
diff --git a/code/expected.out/search_letor.out b/code/expected.out/search_letor.out
diff --git a/conf.py b/conf.py
@@ -52,7 +52,7 @@
 
 # Add any Sphinx extension module names here, as strings. They can be extensions
 # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
-extensions = ['sphinx.ext.todo',]
+extensions = ['sphinx.ext.todo', 'sphinx.ext.mathjax']
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
@@ -303,7 +303,7 @@ def xapian_code_example_filename(ex):
     return "code/%s/%s%s" % (highlight_language, ex, ext)
 
 # Return the command to show in the generated docs.
-def xapian_code_example_command(ex):
+def xapian_code_example_command(ex, option):
     if highlight_language == 'lua':
         return "lua %s" % xapian_code_example_filename(ex)
     elif highlight_language == 'perl':
@@ -319,8 +319,12 @@ def xapian_code_example_command(ex):
     elif highlight_language == 'tcl':
         return "tclsh %s" % xapian_code_example_filename(ex)
     elif highlight_language == 'c++':
-        return "g++ `xapian-config --cxxflags` %s code/c++/support.cc -o %s `xapian-config --libs`\n./%s" \
-            % (xapian_code_example_filename(ex), ex, ex)
+        if option == 0:
+            return "g++ `xapian-config --cxxflags` %s code/c++/support.cc -o %s `xapian-config --libs`\n./%s" \
+                % (xapian_code_example_filename(ex), ex, ex)
+        else:
+            return "g++ `xapian-config --cxxflags` %s -lxapianletor -o %s `xapian-config --libs`\n./%s" \
+                % (xapian_code_example_filename(ex), ex, ex)
     elif highlight_language == 'csharp':
         return "cli-csc -unsafe -target:exe -out:%s.exe %s -r:XapianSharp.dll\n./%s.exe" \
             % (ex, xapian_code_example_filename(ex), ex)
@@ -448,6 +452,7 @@ class XapianRunExample(LiteralInclude):
         'cleanfirst': directives.unchanged,
         'shouldfail': directives.unchanged,
         'silent': directives.flag,
+        'letor': directives.flag,
     }
 
     def run(self):
@@ -466,7 +471,10 @@ def run(self):
             examples_missing.append(last_example)
             return [nodes.literal(text = 'No version of example %s in language %s - patches welcome!'
                 % (last_example, highlight_language))]
-        command = xapian_code_example_command(ex)
+        option = 0
+        if 'letor' in self.options:
+            option = 1
+        command = xapian_code_example_command(ex, option)
 
         cleanfirst = ''
         if 'cleanfirst' in self.options:
diff --git a/data/qrel.txt b/data/qrel.txt
@@ -0,0 +1,7 @@
+20001 Q0 4 1
+20001 Q0 13 2
+20001 Q0 15 3
+20001 Q0 18 4
+20001 Q0 33 5
+20001 Q0 36 6
+20001 Q0 46 7
diff --git a/data/query.txt b/data/query.txt
@@ -0,0 +1 @@
+20001 'watch'
diff --git a/data/training_data.txt b/data/training_data.txt
@@ -0,0 +1,7 @@
+1 qid:20001 1:1 2:0.792481 3:0.861654 4:0.845488 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:0.847108 14:1 15:1 16:0.940478 17:1 18:1 19:1 #docid=4
+2 qid:20001 1:1 2:0.792481 3:0.861654 4:1 5:0.474269 6:0.511136 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:0.474904 15:0.511628 16:1 17:0.799299 18:0.819967 19:0.935227 #docid=13
+5 qid:20001 1:1 2:0.792481 3:0.861654 4:0.406469 5:0.330285 6:0.337768 7:1 8:1 9:1 10:1 11:1 12:1 13:0.409563 14:0.33085 15:0.338209 16:0.709374 17:0.706023 18:0.712966 19:0.858446 #docid=33
+4 qid:20001 1:1 2:1 3:1 4:0.577883 5:0.341465 6:0.327575 7:1 8:1 9:1 10:1 11:1 12:1 13:0.58097 14:0.342039 15:0.32658 16:0.815522 17:0.713763 18:0.684041 19:0.857419 #docid=18
+6 qid:20001 1:1 2:0.5 3:0.666667 4:0.732395 5:0.258682 6:0.359799 7:1 8:1 9:1 10:1 11:1 12:1 13:0.734846 14:0.259173 15:0.36334 16:0.891876 17:0.646325 18:0.745886 19:0.821365 #docid=36
+3 qid:20001 1:1 2:0.792481 3:0.861654 4:0.522759 5:0.195721 6:0.211422 7:1 8:1 9:1 10:1 11:1 12:1 13:0.52593 14:0.196124 15:0.211752 16:0.784466 17:0.576678 18:0.596577 19:0.750731 #docid=15
+7 qid:20001 1:0 2:0.5 3:0.333333 4:0 5:0.204372 6:0.137361 7:1 8:1 9:1 10:1 11:1 12:1 13:0 14:0.204788 15:0.135186 16:0 17:0.588316 18:0.291052 19:0.374552 #docid=46