Skip to content

RelTextRank: An Open Source Framework for Building Relational Syntactic-Semantic Text Pair Representation

Notifications You must be signed in to change notification settings

ktymoshenko/RelTextRank

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Installation

Prerequisites

The tool requires the following prerequisites:

  • Java 1.8+
  • Apache Maven > 3.3.9. Refer to http://maven.apache.org/install.html for the installation instructions
  • Python 2.7.x (required by the evaluation script and for reproducing the experiments on cross-pair kernels)

Installing

git clone https://github.com/iKernels/RelTextRank.git

Run the following command sequence to install the software and set the Java classpath.

cd ./RelTextRank

export JAVA_HOME=<path to your JDK distribution>

sh scripts/install/install.sh 

rm target/dependency/google-collections-1.0.jar 

export CLASSPATH=${CLASSPATH}:bin/:target/dependency/*:target/classes

mkdir logs

Setting up DKPRO

Additional DKPro resources for computing semantic Wikipedia and WordNet-based DKPro similarity featuress, if required. You do not need to do this, if you are going to generate the structures only following examples in this readme. Refer to https://dkpro.github.io/dkpro-similarity/settinguptheresources/ for the instructions on how to setup the following DKPro resources (Please, remember to set up the DKPRO_HOME environment variable as described in the installation instructions web-page):

  • WordNet Lexical Semantic Resource index. Follow all the official installation instructions, but substitute the original wordnet_properties.xml file supplied within WordNet resource graph archive, with the following file instead: https://raw.githubusercontent.com/dkpro/dkpro-lsr/master/de.tudarmstadt.ukp.dkpro.lexsemresource.wordnet-asl/src/main/resources/resource/WordNet_3/wordnet_properties.xml.

    We do not employ Wiktionary in the pipeline. Therefore, you need to remove the following lines from the resources.xml file (or, alternatively, you may download and install the Wiktionary resources as described in the DKPro installation instructions):

    <bean id="wiktionary-en" lazy-init="true" class="de.tudarmstadt.ukp.dkpro.lexsemresource.wiktionary.WiktionaryResource">
    <constructor-arg value="ENGLISH"/>
    <constructor-arg value="${DKPRO_HOME}/LexSemResources/Wiktionary/jwktl_0.15.2_en20100403"/>
    </bean>

    Addtionaly:

    • make sure to set up the correct paths to your WordNet resources in ${DKPRO_HOME}/LexSemResources/WordNet_3x/wordnet_properties.xml (check line 51).
    • check that that the WordNet entry in ${DKPRO_HOME}/de.tudarmstadt.ukp.dkpro.lexsemresource.core.ResourceFactory/resources.xml looks as follows:
      <bean id="wordnet3-en" lazy-init="true"
             class="de.tudarmstadt.ukp.dkpro.lexsemresource.wordnet.WordNetResource">
             <constructor-arg value="${DKPRO_HOME}/LexSemResources/WordNet_3x/wordnet_properties.xml"/>
      </bean>
  • Wikipedia Explicit Semantic Analysis index. If you want to be able to access to the full range of features available in this pipeline, please, download the precompiled the Wikipedia Explicit Semantic Analysis index (see the Explicit Semantic Analysis: Vector Indexes section of the DKPro installation instructions).

Reproducing the EMNLP 2018 results

If you want to reproduce the results from our most recent publication

please install the tool as above and follow the instructions at the wiki page.

Note that you need to set up the WordNet Lexical Semantic Resource index and the DKPRO_HOME environment variable if you wish to reproduce the EMNLP 18 results.

Running the experiments with structural representations

You can use the tool to build the structures employed in the following papers:

(Note, that the results might slightly differ from the results reported in the above works due to some minor changes in the code).

We show how to run the experiments on the example of the WikiQA dataset.

Downloading the WikiQA dataset

First, you need to download the WikiQA data from https://www.microsoft.com/en-us/download/details.aspx?id=52419. Then run the following commands from the root of the RelTextRank distribution.

export wikiqa_location=<folder to which you unpacked the WikiQa distribution>
mkdir data/wikiQA
python scripts/converters/wikiqa_convert.py ${wikiqa_location}/WikiQA-train.tsv data/wikiQA/WikiQA-train.questions.txt  data/wikiQA/WikiQA-train.tsv.resultset
python scripts/converters/wikiqa_convert.py ${wikiqa_location}/WikiQA-test.tsv data/wikiQA/WikiQA-test.questions.txt  data/wikiQA/WikiQA-test.tsv.resultset
python scripts/converters/wikiqa_convert.py ${wikiqa_location}/WikiQA-dev.tsv data/wikiQA/WikiQA-dev.questions.txt  data/wikiQA/WikiQA-dev.tsv.resultset

It may take a long time to train the pipeline on the full-scale data on a single machine. If you wish to train on the subset of data, run the following command to prepare the input file with the

input data:

python scripts/converters/extract_trainset_subset.py -i data/wikiQA/WikiQA-train.questions.txt -o  data/wikiQA/WikiQA-train.questions.toy.txt -p 0.3

Running experiments with a conveniency script

We provide a conveniency python script which generates a shell script which runs the end-to-end experiment in a specific configuration:

python scripts/experiment_launchers/experiment_launcher.py  

Below you may find examples of the commands to generate the shell scripts which will run end-to-end experiments with different structural representations and the Partial Tree Kernel SVM on WikiQA.

First, you need to set the corpus_name environment variable.

To train on full-scale data (will take time):

export corpus_name=wikiqa

OR

To train on toy data (should be fast but less accurate):

export corpus_name=wikiqa_toy

After you have set the variable, run one of the commands below.

  • CH, shallow chunk-pos based tree

    • bash python scripts/experiment_launchers/experiment_launcher.py -l ${corpus_name} -o scripts/generated_scripts -c CH -p "-t 5 -F 3 -C T -m 1000" -e it.unitn.nlpir.experiment.fqa.CHExperiment -suf T -s it.unitn.nlpir.system.core.ClassTextPairConversion -ate " -skipAllSame" -ade " -skipAllSame"
    • Here, T1 and T2 are both represented as shallow tree structures with lemmas as leaves, their POS-tags as their parent nodes. The POS- tag nodes are further grouped under chunk and sentence nodes. CH excludes punctuation marks and words not included into any chunk

  • DT1, a dependency-based tree represntation

    • bash python scripts/experiment_launchers/experiment_launcher.py -l ${corpus_name} -o scripts/generated_scripts -c DT1 -p "-t 5 -F 3 -C T -m 1000" -e it.unitn.nlpir.experiment.fqa.DT1Experiment -suf T -s it.unitn.nlpir.system.core.ClassTextPairConversion -ate " -skipAllSame" -ade " -skipAllSame"
    • A dependency tree in which grammatical relations become nodes and lemmas are located at the leaf level

  • DT2, a dependency-based tree representation

    • bash python scripts/experiment_launchers/experiment_launcher.py -l ${corpus_name} -o scripts/generated_scripts -c DT2 -p "-t 5 -F 3 -C T -m 1000" -e it.unitn.nlpir.experiment.fqa.DT2Experiment -suf T -s it.unitn.nlpir.system.core.ClassTextPairConversion -ate " -skipAllSame" -ade " -skipAllSame"
    • DT1 modified to include the chunking information, and lemmas in the same chunk are grouped under the same chunk node.

  • LCTQ-DT2A, a dependency-based tree representation

    • bash python scripts/experiment_launchers/experiment_launcher.py -l ${corpus_name} -o scripts/generated_scripts -c DT3q_DT2a -p "-t 5 -F 3 -C T -m 1000" -e it.unitn.nlpir.experiment.fqa.LCTqDT2aExperiment -suf T -s it.unitn.nlpir.system.core.ClassTextPairConversion -ate " -skipAllSame" -ade " -skipAllSame"
    • T2 is represented as DT2. T1 is represented as a lexical-centered dependency tree with the grammatical relation REL(head,child) represented as (head (child HEAD GR-REL POS-pos(head)). Here REL is a grammatical relation, head and child are the head and child lemmas in the relation, respectively, and pos(head) is the POS-tag of the head lemma. GR- and POS- tag in the node name indicates that the node is grammar relation or part-of-speech node, respectively.

  • CONST, a constituency-based tree representation

    • bash python scripts/experiment_launchers/experiment_launcher.py -l ${corpus_name} -o scripts/generated_scripts -c CONST -p "-t 5 -F 3 -C T -m 1000" -e it.unitn.nlpir.experiment.fqa.ConstExperiment -suf T -s it.unitn.nlpir.system.core.ClassTextPairConversion -ate " -skipAllSame" -ade " -skipAllSame"
    • Constituency tree

Below we provide the bash commands to generate shell scripts which will run the commands end-to-end:

The above commands will output something like the following: nohup sh scripts/generated_scripts/<experiment_script_name>.sh > logs/<experiment_script_name>.log 2>&1 & Launch it to run an end-to-end experiment. In order to see perfomance simply do:

tail -11 logs/<experiment_script_name>.log

First table reports performance on the development data, and the second on the test data. REF_FILE is the upper bound of performance. SVM is the performance of your system.

You may see the expected performance of the below scrips in this google spreadsheet.

Using the performance evaluation script

Use the following script to evaluate the peformance:

python scripts/eval/ev.py --ignore_noanswer --ignore_allanswer data/examples/<experimental_folder>/svm.relevancy data/examples/<experimental_folder>/<predictions_file>

Note that svm.relevancy and <predictions_file> should be aligned line-by-line.

Visualizing the structural representations demo

If you want to build and visualize structural representations for two input texts, run the following:

java -Xmx4G it.unitn.nlpir.system.demo.TextPairRepresentationDemo -expClassName it.unitn.nlpir.experiment.fqa.<structure_generation_class_name>

For example, if you want to see a CONST structure, you may run:

java -Xmx4G it.unitn.nlpir.system.demo.TextPairRepresentationDemo -expClassName it.unitn.nlpir.experiment.fqa.ConstExperiment

After initialization, the interactive prompt with ask you to enter question and answer delimited by a tab and press Enter. The demo will then generate the pseudo-code for the structural representations of your input pairs.

For example, if you enter:

What is the capital of Italy?   Rome is the capital.

the demo will output:

[main] INFO it.unitn.nlpir.experiment.fqa.TrecQAWithQCExperiment - [ROOT [ROOT [SBARQ [WHNP [WP [what::w]]] [SQ [VBZ [be::v]] [NP [REL-FOCUS-LOC-NP [DT [the::d]] [REL-NN [capital::n]]] [PP [IN [of::i]] [NP [NNP [italy::n]]]]]] [. [?::.]]]]]       [ROOT [ROOT [S [REL-FOCUS-LOC-NP [NNP [rome::n]]] [VP [VBZ [be::v]] [REL-NP [DT [the::d]] [REL-NN [capital::n]]]] [. [.::.]]]]]
[main] INFO it.unitn.nlpir.system.demo.TextPairRepresentationDemo - Text1: (ROOT (ROOT (SBARQ (WHNP (WP (what::w))) (SQ (VBZ (be::v)) (NP (REL-FOCUS-LOC-NP (DT (the::d)) (REL-NN (capital::n))) (PP (IN (of::i)) (NP (NNP (italy::n)))))) (. (?::.)))))
[main] INFO it.unitn.nlpir.system.demo.TextPairRepresentationDemo - Text2: (ROOT (ROOT (S (REL-FOCUS-LOC-NP (NNP (rome::n))) (VP (VBZ (be::v)) (REL-NP (DT (the::d)) (REL-NN (capital::n)))) (. (.::.)))))

Copy-paste the first line, namely

[ROOT [ROOT [SBARQ [WHNP [WP [what::w]]] [SQ [VBZ [be::v]] [NP [REL-FOCUS-LOC-NP [DT [the::d]] [REL-NN [capital::n]]] [PP [IN [of::i]] [NP [NNP [italy::n]]]]]] [. [?::.]]]]]       [ROOT [ROOT [S [REL-FOCUS-LOC-NP [NNP [rome::n]]] [VP [VBZ [be::v]] [REL-NP [DT [the::d]] [REL-NN [capital::n]]]] [. [.::.]]]]]

to http://ironcreek.net/phpsyntaxtree/? and you will see the visualization of your question and answer trees.

License

This software is licensed under Apache 2.0 license.

About

RelTextRank: An Open Source Framework for Building Relational Syntactic-Semantic Text Pair Representation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 54.8%
  • C 32.6%
  • Python 11.2%
  • Other 1.4%