sltr_demo

这个是o19s提供的elasticsearch-learning-to-rank插件的demo

原版demo中记录了两个feature，这给最后效果的展示带来一定的干扰，因此这里对原版demo稍作修改，以期直观得展示出该插件的效果

注意一下修改源码中的文件路径

使用背景

传统的es仅仅使用tokens作为特征，再结合倒排索引进行排序会有一些天生的局限，比如以下例子；2中dress为形容词，3和4中为名词，两种情况下的使用场景不同，而仅仅使用tokens是无法处理这个问题的；因此有将机器学习结合到es中的需求；

基本原理

LTR的基本原理： LTR(Learning to Rank)或MLR(Machine-Learning Rank)是机器学习中的一个分支，将排序模型应用于信息检索系统。
LTR es插件的基本原理：这个插件融合了排序模型（RankLib(https://sourceforge.net/p/lemur/wiki/RankLib/)或是XGBOOST）与Elasticsearch，排序模型输入打分文件，输出可读格式的模型，训练模型可以通过编程或命令行操作。通过倒排索引返回的文档再经过该模型，即可得到最终的返回结果，如下图；
LTR es插件的具体实现：该插件在es中具体的实现，本质上是通过rescoring将模型得分与原先的query得分线性叠加，利用rescoring中的query_weight以及rescore_query_weight可以控制两者的权重，可见Rescoring；

插件重要概念

feature store A feature store corresponds to an independent LTR system: features, feature sets, models backed by a single index and cache. 通常来说，一个feature store对应着一个搜索问题，通常也就对应一个应用，比如wikipedia和wiktionary分别对应两个不同的feature store
feature set 一组feature的组合
feature Elasticsearch LTR features correspond to Elasticsearch queries. The score of an Elasticsearch query, when run using the user’s search terms (and other parameters), are the values you use in your training set . 简单来说，es每一条query执行后的分数，都可以是es ltr的feature

插件使用流程

创建feature stoce PUT /_ltr
创建feature sets POST /_ltr/_featureset/movie_features
进行feature logging，即记录每个feature的分数
将训练好的model嵌入es POST _ltr/_featureset/movie_features/_createmodel
使用sltr语句进行搜索

模型测试

本地环境搭建： es 6.1.2+ltr-1.0.0-es6.1.2.zip
hpc1项目存放地址: es /data/home/li****an/es/elasticsearch-6.1.2
为了能更直观的展示该插件的效果，我对原生demo做了修改

`demo`使用步骤

prepare.py

下载RankLib.jar与tmdb.json，前者供训练模型使用，后者是数据集
create_insert.py

创建index并将tmdbs.json插入
train.py

创建feature store，PUT http://localhost:9200/_ltr，其中_ltr为feature store name
创建feature set，POST http://localhost:9200/_ltr/_featureset/movie_features，其中movie_features为feature set name
在之前创建的feature set上进行log features，结合标注数据sample_judgements.txt，生成最终的训练数据sample_judgements_with_score.txt
使用sample_judgements_with_score.txt训练生成模型文件model.txt，并将模型插入es

search.py

使用sltr语句进行搜索

`demo`效果展示

目标：利用插件，让原本的结果颠倒
原搜索结果以rambo作为关键词对tmdb数据集进行检索：

query = {
  "query": {
      "multi_match": {
          "query": "rambo",
          "fields": ["overview"]
       }
   }
}

得到结果为：

Rambo III 1370
First Blood 1368
Rambo: First Blood Part II 1369
Rambo 7555
In the Line of Duty: The F.B.I. Murders 31362
Son of Rambow 13258
Spud 61410

构造标注数据

# 将原本非常相关的`7555`对应0，即不相关；
# 将原本不相关的`61410`对应4，即非常相关；
0	qid:1 #	7555	Rambo
1	qid:1 #	1370	Rambo III
1	qid:1 #	1369	Rambo: First Blood Part II
2	qid:1 #	1368	First Blood
3	qid:1 #	31362	In the Line of Duty: The F.B.I. Murders
4	qid:1 #	13258	Son of Rambow
4	qid:1 #	61410	Spud

将训练好的模型插入es后得到检索结果

# 原本不相关的`Son of Rambow`现在排在了第一位
Son of Rambow 13258
Spud 61410
In the Line of Duty: The F.B.I. Murders 31362
First Blood 1368
Rambo III 1370
Rambo: First Blood Part II 1369
Rambo 7555

使用插件所遇到的困难：无法适应`es 6.0.0`版本，该插件无法与现有`jieba`插件兼容同一版本`es`

该插件所适应的es版本见下： jieba插件所使用的es版本见下：

坑

版本

python train.py command is throwing error · Issue #123 · o19s/elasticsearch-learning-to-rank · GitHub

ranklib

一开始运行python train.py时报错，显示pool-1-thread-1，跟这个issue一样，后来无意间运行竟然成功了

最后

es7.0之后将要取消对mapping_types的支持，估计o19s又有的忙了

拥抱开源：）

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
img		img
README.md		README.md
create_insert.py		create_insert.py
prepare.py		prepare.py
sample_judgments.txt		sample_judgments.txt
sample_judgments_with_score.txt		sample_judgments_with_score.txt
search.py		search.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

README.md

README.md

create_insert.py

create_insert.py

prepare.py

prepare.py

sample_judgments.txt

sample_judgments.txt

sample_judgments_with_score.txt

sample_judgments_with_score.txt

search.py

search.py

train.py

train.py

Repository files navigation

sltr_demo

使用背景

基本原理

插件重要概念

插件使用流程

模型测试

`demo`使用步骤

`demo`效果展示

使用插件所遇到的困难：无法适应`es 6.0.0`版本，该插件无法与现有`jieba`插件兼容同一版本`es`

坑

版本

ranklib

最后

About

Releases

Packages

Languages

andrew-sn/sltr_demo

Folders and files

Latest commit

History

Repository files navigation

sltr_demo

使用背景

基本原理

插件重要概念

插件使用流程

模型测试

demo使用步骤

demo效果展示

使用插件所遇到的困难：无法适应es 6.0.0版本，该插件无法与现有jieba插件兼容同一版本es

坑

版本

ranklib

最后

About

Topics

Resources

Stars

Watchers

Forks

Languages

`demo`使用步骤

`demo`效果展示

使用插件所遇到的困难：无法适应`es 6.0.0`版本，该插件无法与现有`jieba`插件兼容同一版本`es`