|
| 1 | + |
| 2 | +.. Copyright (C) 2011 Parth Gupta |
| 3 | +.. Copyright (C) 2016 Ayush Tomar |
| 4 | +
|
| 5 | +
|
| 6 | +======================= |
| 7 | +Xapian Learning-to-Rank |
| 8 | +======================= |
| 9 | + |
| 10 | +.. contents:: Table of Contents |
| 11 | + |
| 12 | + |
| 13 | +Introduction |
| 14 | +============ |
| 15 | + |
| 16 | +Learning-to-Rank(LTR) can be viewed as a weighting scheme which involves machine learning. The main idea behind LTR is to bring up relevant documents given a low ranking by probablistic techniques like BM25 by using machine learning models. A model is trained by learning from the relevance judgements provided by a user corresponding to a set of queries and a corpus of documents. This model is then used to re-rank the matchset to bring more relevant documents higher in the ranking. Learning-to-Rank has gained immense popularity and attention among researchers recently. Xapian is the first project with Learning-to-Rank functionality added to it. |
| 17 | + |
| 18 | +LTR can be broadly seen in two stages: Learning the model & Ranking. Learning the model takes the training file as input and produces a model. After that given this learnt model, when a new query comes in, scores can be assigned to the documents associated to it. |
| 19 | + |
| 20 | +Preparing the Training file |
| 21 | +--------------------------- |
| 22 | + |
| 23 | +Currently the ranking models supported by LTR are supervised learning models. A supervised learning model requires a labelled training data as an input. To learn a model using LTR you need to provide the training data in the following format. |
| 24 | + |
| 25 | +.. code-block:: none |
| 26 | +
|
| 27 | + 0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 ... 18:0.750000 19:1.000000 #docid = 1123323 |
| 28 | + 1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 ... 18:0.500000 19:0.023400 #docid = 4222333 |
| 29 | +
|
| 30 | +Here each row represents the document for the specified query. The first column is the relevance label and which can take non-negative values. The second column represents the queryid, and the last column is the docid. The third column represents the value of the features. |
| 31 | + |
| 32 | +As mentioned before, this process requires a training file in the above format. xapian-letor API empowers you to generate such training file. But for that you have to supply some information like: |
| 33 | + |
| 34 | +1. Query file: This file has information of queries to be involved in |
| 35 | + learning and its id. It should be formatted in such a way:: |
| 36 | + |
| 37 | + 2010001 'landslide,malaysia' |
| 38 | + 2010002 'search,engine' |
| 39 | + 2010003 'Monuments,of,India' |
| 40 | + 2010004 'Indian,food' |
| 41 | + |
| 42 | + where 2010xxx being query-id followed by a comma separated query in |
| 43 | + single-quotes. |
| 44 | + |
| 45 | +2. Qrel file: This is the file containing relevance judgements. It should |
| 46 | + be formatted in this way:: |
| 47 | + |
| 48 | + 2010003 Q0 19243417 1 |
| 49 | + 2010003 Q0 3256433 1 |
| 50 | + 2010003 Q0 275014 1 |
| 51 | + 2010003 Q0 298021 0 |
| 52 | + 2010003 Q0 1456811 0 |
| 53 | + |
| 54 | + where first column is query-id, third column is Document-id and fourth column being relevance label which is 0 for irrelevance and 1 for relevance. Second column is many times referred as 'iter' but doesn't really important for us. All the fields are whitespace delimited. This is the standard format of almost all the relevance judgement files. If you have little different relevance judgement file then you can easily convert it in such file using basic 'awk' command. |
| 55 | + |
| 56 | +3. Collection Index : Here you supply the path to the index of the corpus. If |
| 57 | + you have 'title' information in the collection with some xml/html tag or so |
| 58 | + then add:: |
| 59 | + |
| 60 | + indexer.index(title,1,"S"); |
| 61 | + |
| 62 | +You can refer to the "Indexing" section under "A practical example" heading for the Collection Index. The database created in the practical example will be used as the collection index for the examples. In particular we are going to be using all the documents from which contain the term "watch" which will be used as the query for the examples. |
| 63 | + |
| 64 | +Provided such information, API is capable of creating the training file which is in the mentioned format and can be easily used for learning a model. |
| 65 | + |
| 66 | +To prepare a training file run the following command from the top level directory. This example assumes that you have created the db from the first example in "Indexing" section under "A practical example" header and you have installed xapian-letor. |
| 67 | + |
| 68 | +.. code-block:: none |
| 69 | +
|
| 70 | + $ xapian-prepare-trainingfile --db=db data/query.txt data/qrel.txt training_data.txt |
| 71 | +
|
| 72 | +xapian-prepare-trainingfile is a utility present after you have installed xapian-letor. This should create a training_data.txt which should have the similar values to the data/training_data.txt. |
| 73 | + |
| 74 | +The source code is present for xapian-prepare-trainingfile.cc is present at `xapian/xapian-letor/bin/xapian-prepare-trainingfile.cc <https://github.com/xapian/xapian/blob/master/xapian-letor/bin/xapian-prepare-trainingfile.cc>`_. |
| 75 | + |
| 76 | +Learning the Model |
| 77 | +------------------ |
| 78 | + |
| 79 | +In xapian-letor we support the following learning algorithms: |
| 80 | + |
| 81 | +1. `ListNET <http://dl.acm.org/citation.cfm?id=1273513>`_ |
| 82 | +2. `Ranking-SVM <http://dl.acm.org/citation.cfm?id=775067>`_ |
| 83 | +3. `ListMLE <http://icml2008.cs.helsinki.fi/papers/167.pdf>`_ |
| 84 | + |
| 85 | +You can use any one of the rankers to Learn the model. The command line tool xapian-train uses ListNET as the ranker for learning. To learn a model run the following command from the top level directory. |
| 86 | + |
| 87 | +.. code-block:: none |
| 88 | +
|
| 89 | + $ xapian-train --db=db data/training_data.txt "ListNET_Ranker" |
| 90 | +
|
| 91 | +Ranking |
| 92 | +------- |
| 93 | + |
| 94 | +After we have built a model, its quite straightforward to get a real score for a particular document for the given query. Here we supply the first hand retrieved ranked-list to the Ranking function, which assigns a new score to each document after converting it to the same dimensioned feature vector. This list is re-ranked according to the new scores. |
| 95 | + |
| 96 | +Here’s the significant part of the example code to implement ranking. |
| 97 | + |
| 98 | +.. xapianexample:: search_letor |
| 99 | + |
| 100 | +A full copy of this code is available in :xapian-code-example:`^` |
| 101 | + |
| 102 | +You can run this code as follows to re-rank the list of documents retrieved from the db containing the term "watch" in the order of relevance as mentioned in the data/qrel. |
| 103 | + |
| 104 | +.. xapianrunexample:: search_letor |
| 105 | + :cleanfirst: db |
| 106 | + :args: "db" "ListNET_Ranker" "watch" |
| 107 | + :letor: |
| 108 | + |
| 109 | +Features |
| 110 | +======== |
| 111 | + |
| 112 | +Features play a major role in the learning. In LTR, features are mainly of three types: query dependent, document dependent (pagerank, inLink/outLink number, number of children, etc) and query-document pair dependent (TF-IDF Score, BM25 Score, etc). |
| 113 | + |
| 114 | +Currently we have incorporated 19 features which are described below. These features are statistically tested in `Nallapati2004 <http://dl.acm.org/citation.cfm?id=1009006>`_. |
| 115 | + |
| 116 | + Here c(w,D) means that count of term w in Document D. C represents the Collection. 'n' is the total number of terms in query. |
| 117 | + :math:`|.|` is size-of function and idf(.) is the inverse-document-frequency. |
| 118 | + |
| 119 | + |
| 120 | + 1. :math:`\sum_{q_i \in Q \cap D} \log{\left( c(q_i,D) \right)}` |
| 121 | + |
| 122 | + 2. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\right)}` |
| 123 | + |
| 124 | + 3. :math:`\sum_{q_i \in Q \cap D} \log{\left(idf(q_i) \right) }` |
| 125 | + |
| 126 | + 4. :math:`\sum_{q_i \in Q \cap D} \log{\left( \frac{|C|}{c(q_i,C)} \right)}` |
| 127 | + |
| 128 | + 5. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}idf(q_i)\right)}` |
| 129 | + |
| 130 | + 6. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\frac{|C|}{c(q_i,C)}\right)}` |
| 131 | + |
| 132 | + |
| 133 | +All the above 6 features are calculated considering 'title only', 'body only' and 'whole' document. So they make in total 6*3=18 features. The 19th feature is the Xapian weighting scheme score assigned to the document (by default this is BM25).The API gives a choice to select which specific features you want to use. By default, all the 19 features defined above are used. |
| 134 | + |
| 135 | +One thing that should be noticed is that all the feature values are `normalized at Query-Level <https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm>`_. That means that the values of a particular feature for a particular query are divided by its query-level maximum value and hence all the feature values will be between 0 and 1. This normalization helps for unbiased learning. |
| 136 | + |
| 137 | +.. [Nallapati2004] Nallapati, R. Discriminative models for information retrieval. Proceedings of SIGIR 2004 (pp. 64-71). |
| 138 | +
|
| 139 | +Checking quality of ranking |
| 140 | +--------------------------- |
| 141 | + |
| 142 | +xapian-letor has support for Scorer metrics to check the ranking quality of LTR model. Ranking quality score is calculated based on the relevance label of ranked document obtained from the Qrel file. Currently we support the following quality metrics: |
| 143 | + |
| 144 | +.. code-block:: none |
| 145 | +
|
| 146 | + 1. `Normalised Discounted Cumulative Gain (NDCG) measure <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`_ |
| 147 | +
|
| 148 | +To score your model using xapian-letor api you need to make sure that you use the same Ranker that you used to train the model, same set of features used to generate the training file and rank the documents along with the same model key used to train the model. By default "NDCG" scorer is used as the score_type and since we have only one scorer that is the only valid string allowed as scorer_type. By default all features are used for scoring: |
| 149 | + |
| 150 | +.. code-block:: none |
| 151 | +
|
| 152 | + Xapian::ListNETRanker ranker; |
| 153 | + ranker.set_database_path(db_path); |
| 154 | + ranker.set_query(query); |
| 155 | + ranker.train_model(trainingfile, model_key); or ranker->train_model(trainingfile); |
| 156 | + ranker.rank(mset, model_key, flist); |
| 157 | + ranker.score(query, qrel, model_key, outputfile_path, msetsize, scorer_type, flist); |
| 158 | +
|
| 159 | +Make sure that you use the same LTR algorithm (Ranker) and same set of Features (via Xapian::FeatureList) that were used while preparing the model you are evaluating, otherwise it will throw and exception. Ranker::score() method will return the model score for each query in the query file and an average score for all the queries. The results get saved at <outputfile_path>. |
0 commit comments