Skip to content

Commit 28aaf3a

Browse files
committed
Update docsprint to reflect the new xapian-letor.
1 parent a707ae5 commit 28aaf3a

File tree

8 files changed

+282
-5
lines changed

8 files changed

+282
-5
lines changed

advanced/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Advanced features
77
postingsource
88
unigramlm
99
custom_weighting
10+
learning_to_rank
1011
admin_notes
1112
scalability
1213
replication

advanced/learning_to_rank.rst

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
2+
.. Copyright (C) 2011 Parth Gupta
3+
.. Copyright (C) 2016 Ayush Tomar
4+
5+
6+
=======================
7+
Xapian Learning-to-Rank
8+
=======================
9+
10+
.. contents:: Table of Contents
11+
12+
13+
Introduction
14+
============
15+
16+
Learning-to-Rank(LTR) can be viewed as a weighting scheme which involves machine learning. The main idea behind LTR is to bring up relevant documents given a low ranking by probablistic techniques like BM25 by using machine learning models. A model is trained by learning from the relevance judgements provided by a user corresponding to a set of queries and a corpus of documents. This model is then used to re-rank the matchset to bring more relevant documents higher in the ranking. Learning-to-Rank has gained immense popularity and attention among researchers recently. Xapian is the first project with Learning-to-Rank functionality added to it.
17+
18+
LTR can be broadly seen in two stages: Learning the model & Ranking. Learning the model takes the training file as input and produces a model. After that given this learnt model, when a new query comes in, scores can be assigned to the documents associated to it.
19+
20+
Preparing the Training file
21+
---------------------------
22+
23+
Currently the ranking models supported by LTR are supervised learning models. A supervised learning model requires a labelled training data as an input. To learn a model using LTR you need to provide the training data in the following format.
24+
25+
.. code-block:: none
26+
27+
0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 ... 18:0.750000 19:1.000000 #docid = 1123323
28+
1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 ... 18:0.500000 19:0.023400 #docid = 4222333
29+
30+
Here each row represents the document for the specified query. The first column is the relevance label and which can take non-negative values. The second column represents the queryid, and the last column is the docid. The third column represents the value of the features.
31+
32+
As mentioned before, this process requires a training file in the above format. xapian-letor API empowers you to generate such training file. But for that you have to supply some information like:
33+
34+
1. Query file: This file has information of queries to be involved in
35+
learning and its id. It should be formatted in such a way::
36+
37+
2010001 'landslide,malaysia'
38+
2010002 'search,engine'
39+
2010003 'Monuments,of,India'
40+
2010004 'Indian,food'
41+
42+
where 2010xxx being query-id followed by a comma separated query in
43+
single-quotes.
44+
45+
2. Qrel file: This is the file containing relevance judgements. It should
46+
be formatted in this way::
47+
48+
2010003 Q0 19243417 1
49+
2010003 Q0 3256433 1
50+
2010003 Q0 275014 1
51+
2010003 Q0 298021 0
52+
2010003 Q0 1456811 0
53+
54+
where first column is query-id, third column is Document-id and fourth column being relevance label which is 0 for irrelevance and 1 for relevance. Second column is many times referred as 'iter' but doesn't really important for us. All the fields are whitespace delimited. This is the standard format of almost all the relevance judgement files. If you have little different relevance judgement file then you can easily convert it in such file using basic 'awk' command.
55+
56+
3. Collection Index : Here you supply the path to the index of the corpus. If
57+
you have 'title' information in the collection with some xml/html tag or so
58+
then add::
59+
60+
indexer.index(title,1,"S");
61+
62+
You can refer to the "Indexing" section under "A practical example" heading for the Collection Index. The database created in the practical example will be used as the collection index for the examples. In particular we are going to be using all the documents from which contain the term "watch" which will be used as the query for the examples.
63+
64+
Provided such information, API is capable of creating the training file which is in the mentioned format and can be easily used for learning a model.
65+
66+
To prepare a training file run the following command from the top level directory. This example assumes that you have created the db from the first example in "Indexing" section under "A practical example" header and you have installed xapian-letor.
67+
68+
.. code-block:: none
69+
70+
$ xapian-prepare-trainingfile --db=db data/query.txt data/qrel.txt training_data.txt
71+
72+
xapian-prepare-trainingfile is a utility present after you have installed xapian-letor. This should create a training_data.txt which should have the similar values to the data/training_data.txt.
73+
74+
The source code is present for xapian-prepare-trainingfile.cc is present at `xapian/xapian-letor/bin/xapian-prepare-trainingfile.cc <https://github.com/xapian/xapian/blob/master/xapian-letor/bin/xapian-prepare-trainingfile.cc>`_.
75+
76+
Learning the Model
77+
------------------
78+
79+
In xapian-letor we support the following learning algorithms:
80+
81+
1. `ListNET <http://dl.acm.org/citation.cfm?id=1273513>`_
82+
2. `Ranking-SVM <http://dl.acm.org/citation.cfm?id=775067>`_
83+
3. `ListMLE <http://icml2008.cs.helsinki.fi/papers/167.pdf>`_
84+
85+
You can use any one of the rankers to Learn the model. The command line tool xapian-train uses ListNET as the ranker for learning. To learn a model run the following command from the top level directory.
86+
87+
.. code-block:: none
88+
89+
$ xapian-train --db=db data/training_data.txt "ListNET_Ranker"
90+
91+
Ranking
92+
-------
93+
94+
After we have built a model, its quite straightforward to get a real score for a particular document for the given query. Here we supply the first hand retrieved ranked-list to the Ranking function, which assigns a new score to each document after converting it to the same dimensioned feature vector. This list is re-ranked according to the new scores.
95+
96+
Here’s the significant part of the example code to implement ranking.
97+
98+
.. xapianexample:: search_letor
99+
100+
A full copy of this code is available in :xapian-code-example:`^`
101+
102+
You can run this code as follows to re-rank the list of documents retrieved from the db containing the term "watch" in the order of relevance as mentioned in the data/qrel.
103+
104+
.. xapianrunexample:: search_letor
105+
:cleanfirst: db
106+
:args: "db" "ListNET_Ranker" "watch"
107+
:letor:
108+
109+
Features
110+
========
111+
112+
Features play a major role in the learning. In LTR, features are mainly of three types: query dependent, document dependent (pagerank, inLink/outLink number, number of children, etc) and query-document pair dependent (TF-IDF Score, BM25 Score, etc).
113+
114+
Currently we have incorporated 19 features which are described below. These features are statistically tested in `Nallapati2004 <http://dl.acm.org/citation.cfm?id=1009006>`_.
115+
116+
Here c(w,D) means that count of term w in Document D. C represents the Collection. 'n' is the total number of terms in query.
117+
:math:`|.|` is size-of function and idf(.) is the inverse-document-frequency.
118+
119+
120+
1. :math:`\sum_{q_i \in Q \cap D} \log{\left( c(q_i,D) \right)}`
121+
122+
2. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\right)}`
123+
124+
3. :math:`\sum_{q_i \in Q \cap D} \log{\left(idf(q_i) \right) }`
125+
126+
4. :math:`\sum_{q_i \in Q \cap D} \log{\left( \frac{|C|}{c(q_i,C)} \right)}`
127+
128+
5. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}idf(q_i)\right)}`
129+
130+
6. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\frac{|C|}{c(q_i,C)}\right)}`
131+
132+
133+
All the above 6 features are calculated considering 'title only', 'body only' and 'whole' document. So they make in total 6*3=18 features. The 19th feature is the Xapian weighting scheme score assigned to the document (by default this is BM25).The API gives a choice to select which specific features you want to use. By default, all the 19 features defined above are used.
134+
135+
One thing that should be noticed is that all the feature values are `normalized at Query-Level <https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm>`_. That means that the values of a particular feature for a particular query are divided by its query-level maximum value and hence all the feature values will be between 0 and 1. This normalization helps for unbiased learning.
136+
137+
.. [Nallapati2004] Nallapati, R. Discriminative models for information retrieval. Proceedings of SIGIR 2004 (pp. 64-71).
138+
139+
Checking quality of ranking
140+
---------------------------
141+
142+
xapian-letor has support for Scorer metrics to check the ranking quality of LTR model. Ranking quality score is calculated based on the relevance label of ranked document obtained from the Qrel file. Currently we support the following quality metrics:
143+
144+
.. code-block:: none
145+
146+
1. `Normalised Discounted Cumulative Gain (NDCG) measure <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`_
147+
148+
To score your model using xapian-letor api you need to make sure that you use the same Ranker that you used to train the model, same set of features used to generate the training file and rank the documents along with the same model key used to train the model. By default "NDCG" scorer is used as the score_type and since we have only one scorer that is the only valid string allowed as scorer_type. By default all features are used for scoring:
149+
150+
.. code-block:: none
151+
152+
Xapian::ListNETRanker ranker;
153+
ranker.set_database_path(db_path);
154+
ranker.set_query(query);
155+
ranker.train_model(trainingfile, model_key); or ranker->train_model(trainingfile);
156+
ranker.rank(mset, model_key, flist);
157+
ranker.score(query, qrel, model_key, outputfile_path, msetsize, scorer_type, flist);
158+
159+
Make sure that you use the same LTR algorithm (Ranker) and same set of Features (via Xapian::FeatureList) that were used while preparing the model you are evaluating, otherwise it will throw and exception. Ranker::score() method will return the model score for each query in the query file and an average score for all the queries. The results get saved at <outputfile_path>.

code/c++/search_letor.cc

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
#include <xapian-letor.h>
2+
3+
#include <iostream>
4+
#include <sstream>
5+
#include <string>
6+
7+
using namespace std;
8+
9+
static void show_usage()
10+
{
11+
cout << "Usage: rank_letor --db=DIRECTORY MODEL_METADATA_KEY QUERY\n";
12+
}
13+
14+
// Start of example code.
15+
// Stopwords:
16+
static const char * sw[] = {
17+
"a", "about", "an", "and", "are", "as", "at",
18+
"be", "by",
19+
"en",
20+
"for", "from",
21+
"how",
22+
"i", "in", "is", "it",
23+
"of", "on", "or",
24+
"that", "the", "this", "to",
25+
"was", "what", "when", "where", "which", "who", "why", "will", "with"
26+
};
27+
28+
void rank_letor(string db_path, string model_key, string query_)
29+
{
30+
Xapian::SimpleStopper mystopper(sw, sw + sizeof(sw) / sizeof(sw[0]));
31+
Xapian::Stem stemmer("english");
32+
Xapian::doccount msize = 10;
33+
Xapian::QueryParser parser;
34+
parser.add_prefix("title", "S");
35+
parser.add_prefix("subject", "S");
36+
Xapian::Database db(db_path);
37+
parser.set_database(db);
38+
parser.set_default_op(Xapian::Query::OP_OR);
39+
parser.set_stemmer(stemmer);
40+
parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
41+
parser.set_stopper(&mystopper);
42+
Xapian::Query query_no_prefix = parser.parse_query(query_,
43+
parser.FLAG_DEFAULT|
44+
parser.FLAG_SPELLING_CORRECTION);
45+
// query with title as default prefix
46+
Xapian::Query query_default_prefix = parser.parse_query(query_,
47+
parser.FLAG_DEFAULT|
48+
parser.FLAG_SPELLING_CORRECTION,
49+
"S");
50+
// Combine queries
51+
Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, query_no_prefix,
52+
query_default_prefix);
53+
Xapian::Enquire enquire(db);
54+
enquire.set_query(query);
55+
Xapian::MSet mset = enquire.get_mset(0, msize);
56+
57+
cout << "Docids before re-ranking by LTR model:" << endl;
58+
for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
59+
Xapian::Document doc = i.get_document();
60+
string data = doc.get_data();
61+
cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
62+
}
63+
64+
// Initialise Ranker object with ListNETRanker instance, db path and query.
65+
// See Ranker documentation for available Ranker subclass options.
66+
Xapian::ListNETRanker ranker;
67+
ranker.set_database_path(db_path);
68+
ranker.set_query(query);
69+
70+
// Get vector of re-ranked docids
71+
ranker.rank(mset, model_key);
72+
73+
cout << "Docids after re-ranking by LTR model:\n" << endl;
74+
75+
for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
76+
Xapian::Document doc = i.get_document();
77+
string data = doc.get_data();
78+
cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
79+
}
80+
}
81+
// End of example code.
82+
83+
int main(int argc, char** argv)
84+
{
85+
if (argc != 4) {
86+
show_usage();
87+
return 0;
88+
}
89+
string db_path = argv[1];
90+
string model_key = argv[2];
91+
string query = argv[3];
92+
rank_letor(db_path, model_key, query);
93+
return 0;
94+
}

code/expected.out/search_letor.out

Whitespace-only changes.

conf.py

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@
5252

5353
# Add any Sphinx extension module names here, as strings. They can be extensions
5454
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
55-
extensions = ['sphinx.ext.todo',]
55+
extensions = ['sphinx.ext.todo', 'sphinx.ext.mathjax']
5656

5757
# Add any paths that contain templates here, relative to this directory.
5858
templates_path = ['_templates']
@@ -303,7 +303,7 @@ def xapian_code_example_filename(ex):
303303
return "code/%s/%s%s" % (highlight_language, ex, ext)
304304

305305
# Return the command to show in the generated docs.
306-
def xapian_code_example_command(ex):
306+
def xapian_code_example_command(ex, option):
307307
if highlight_language == 'lua':
308308
return "lua %s" % xapian_code_example_filename(ex)
309309
elif highlight_language == 'perl':
@@ -319,8 +319,12 @@ def xapian_code_example_command(ex):
319319
elif highlight_language == 'tcl':
320320
return "tclsh %s" % xapian_code_example_filename(ex)
321321
elif highlight_language == 'c++':
322-
return "g++ `xapian-config --cxxflags` %s code/c++/support.cc -o %s `xapian-config --libs`\n./%s" \
323-
% (xapian_code_example_filename(ex), ex, ex)
322+
if option == 0:
323+
return "g++ `xapian-config --cxxflags` %s code/c++/support.cc -o %s `xapian-config --libs`\n./%s" \
324+
% (xapian_code_example_filename(ex), ex, ex)
325+
else:
326+
return "g++ `xapian-config --cxxflags` %s -lxapianletor -o %s `xapian-config --libs`\n./%s" \
327+
% (xapian_code_example_filename(ex), ex, ex)
324328
elif highlight_language == 'csharp':
325329
return "cli-csc -unsafe -target:exe -out:%s.exe %s -r:XapianSharp.dll\n./%s.exe" \
326330
% (ex, xapian_code_example_filename(ex), ex)
@@ -448,6 +452,7 @@ class XapianRunExample(LiteralInclude):
448452
'cleanfirst': directives.unchanged,
449453
'shouldfail': directives.unchanged,
450454
'silent': directives.flag,
455+
'letor': directives.flag,
451456
}
452457

453458
def run(self):
@@ -466,7 +471,10 @@ def run(self):
466471
examples_missing.append(last_example)
467472
return [nodes.literal(text = 'No version of example %s in language %s - patches welcome!'
468473
% (last_example, highlight_language))]
469-
command = xapian_code_example_command(ex)
474+
option = 0
475+
if 'letor' in self.options:
476+
option = 1
477+
command = xapian_code_example_command(ex, option)
470478

471479
cleanfirst = ''
472480
if 'cleanfirst' in self.options:

data/qrel.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
20001 Q0 4 1
2+
20001 Q0 13 2
3+
20001 Q0 15 3
4+
20001 Q0 18 4
5+
20001 Q0 33 5
6+
20001 Q0 36 6
7+
20001 Q0 46 7

data/query.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
20001 'watch'

data/training_data.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
1 qid:20001 1:1 2:0.792481 3:0.861654 4:0.845488 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:0.847108 14:1 15:1 16:0.940478 17:1 18:1 19:1 #docid=4
2+
2 qid:20001 1:1 2:0.792481 3:0.861654 4:1 5:0.474269 6:0.511136 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:0.474904 15:0.511628 16:1 17:0.799299 18:0.819967 19:0.935227 #docid=13
3+
5 qid:20001 1:1 2:0.792481 3:0.861654 4:0.406469 5:0.330285 6:0.337768 7:1 8:1 9:1 10:1 11:1 12:1 13:0.409563 14:0.33085 15:0.338209 16:0.709374 17:0.706023 18:0.712966 19:0.858446 #docid=33
4+
4 qid:20001 1:1 2:1 3:1 4:0.577883 5:0.341465 6:0.327575 7:1 8:1 9:1 10:1 11:1 12:1 13:0.58097 14:0.342039 15:0.32658 16:0.815522 17:0.713763 18:0.684041 19:0.857419 #docid=18
5+
6 qid:20001 1:1 2:0.5 3:0.666667 4:0.732395 5:0.258682 6:0.359799 7:1 8:1 9:1 10:1 11:1 12:1 13:0.734846 14:0.259173 15:0.36334 16:0.891876 17:0.646325 18:0.745886 19:0.821365 #docid=36
6+
3 qid:20001 1:1 2:0.792481 3:0.861654 4:0.522759 5:0.195721 6:0.211422 7:1 8:1 9:1 10:1 11:1 12:1 13:0.52593 14:0.196124 15:0.211752 16:0.784466 17:0.576678 18:0.596577 19:0.750731 #docid=15
7+
7 qid:20001 1:0 2:0.5 3:0.333333 4:0 5:0.204372 6:0.137361 7:1 8:1 9:1 10:1 11:1 12:1 13:0 14:0.204788 15:0.135186 16:0 17:0.588316 18:0.291052 19:0.374552 #docid=46

0 commit comments

Comments
 (0)