-
Notifications
You must be signed in to change notification settings - Fork 150
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
zhigwang
committed
Jan 29, 2018
1 parent
7dddc92
commit 95d3460
Showing
11 changed files
with
960 additions
and
1,521 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,46 +1,66 @@ | ||
# BiMPM: Bilateral Multi-Perspective Matching for Natural Language Sentences | ||
|
||
## Updates (Jan 28, 2018) | ||
* This repository has been updated to tensorflow 1.4 | ||
* The training process speeds up 15+ times without lossing the accuracy. | ||
* All codes have been re-constructed for better readability and adaptability. | ||
|
||
## Description | ||
This repository includes the source code for natural language sentence matching. | ||
Basically, the program will take two sentences as input, and predict a label for the two input sentences. | ||
You can use this program to deal with tasks like [paraphrase identification](https://aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_%28State_of_the_art%29), [natural language inference](http://nlp.stanford.edu/projects/snli/), [duplicate questions identification](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) et al. More details about the underneath model can be found in our [paper](https://arxiv.org/pdf/1702.03814.pdf). Please cite our paper when you use this program! :heart_eyes: | ||
Basically, the program takes two sentences as input, and predict a label for the two input sentences. | ||
You can use this program to deal with tasks like [paraphrase identification](https://aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_%28State_of_the_art%29), [natural language inference](http://nlp.stanford.edu/projects/snli/), [duplicate questions identification](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) et al. More details about the underneath model can be found in our [paper](https://arxiv.org/pdf/1702.03814.pdf) published in IJCAI 2017. Please cite our paper when you use this program! :heart_eyes: | ||
|
||
## Requirements | ||
* python 2.7 | ||
* tensorflow 0.12 | ||
* tensorflow 1.4 | ||
|
||
## Data format | ||
Both the train and test set require a tab-separated format. | ||
Both the train and test sets require a tab-separated format. | ||
Each line in the train (or test) file corresponds to an instance, and it should be arranged as | ||
> label sentence#1 sentence#2 other_info | ||
> label sentence#1 sentence#2 instanceID | ||
For more details about the data format, you can download the [Quora Question Pair](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) dataset used in our [paper](https://arxiv.org/pdf/1702.03814.pdf). | ||
For more details about the data format, you can download the [SNLI](https://drive.google.com/file/d/1CxjKsaM6YgZPRKmJhNn7WcIC3gISehcS/view?usp=sharing) and the [Quora Question Pair](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) datasets used in our [paper](https://arxiv.org/pdf/1702.03814.pdf). | ||
|
||
|
||
## Training | ||
You can find the training script at BiMPM/src/SentenceMatchTrainer.py | ||
|
||
To see all the **optional arguments**, just run | ||
> python BiMPM/src/SentenceMatchTrainer.py --help | ||
First, edit the configuration file at ${workspace}/BiMPM/configs/snli.sample.config (or ${workspace}/BiMPM/configs/quora.sample.config ). | ||
You need to change the "train\_path", "dev\_path", "word\_vec\_path", "model\_dir", "suffix" to your own setting. | ||
|
||
Here is an example of how to train a very simple model: | ||
> python BiMPM/src/SentenceMatchTrainer.py --train\_path train.tsv --dev\_path dev.tsv --test\_path test.tsv --word\_vec_path wordvec.txt --suffix sample --fix\_word\_vec --model\_dir models --MP\_dim 20 | ||
Second, launch job using the following command line | ||
> python ${workspace}/BiMPM/SentenceMatchTrainer.py --config\_path ${workspace}/BiMPM/configs/snli.sample.config | ||
To get a better performance on your own datasets, you need to play with other arguments. Here is one example of the command line [configuration](https://drive.google.com/file/d/0B0PlTAo--BnaQ3N4cXR1b0Z0YU0/view?usp=sharing) I used in my experiments. | ||
|
||
## Testing | ||
You can find the testing script at BiMPM/src/SentenceMatchDecoder.py | ||
> python ${workspace}/BiMPM/src/SentenceMatchDecoder.py --in\_path ${your\_path\_to}/dev.tsv --word\_vec\_path ${your\_path\_to}/wordvec.txt --out\_path ${your\_path\_to}/result.json --model\_prefix ${model\_dir}/SentenceMatch.${suffix} | ||
Where "model\_dir" and "suffix" are the variables set in your configuration file. | ||
|
||
To see all the **optional arguments**, just run | ||
> python BiMPM/src/SentenceMatchDecoder.py --help | ||
The output file is a json file with the follwing format. | ||
|
||
Here is an example of how to test your model: | ||
> python BiMPM/src/SentenceMatchDecoder.py --in\_path test.tsv --word\_vec\_path wordvec.txt --mode prediction --model\_prefix models/SentenceMatch.sample --out\_path test.prediction | ||
```javascript | ||
{ | ||
{ | ||
"ID": "instanceID", | ||
"truth": label, | ||
"sent1": sentence1, | ||
"sent2": sentence2, | ||
"prediction": prediciton, | ||
"probs": probs_for_all_possible_labels | ||
}, | ||
{ | ||
"ID": "instanceID", | ||
"truth": label, | ||
"sent1": sentence1, | ||
"sent2": sentence2, | ||
"prediction": prediciton, | ||
"probs": probs_for_all_possible_labels | ||
} | ||
} | ||
``` | ||
|
||
The SentenceMatchDecoder.py can run in two modes: | ||
* prediction: predicting the label for each sentence pair | ||
* probs: outputting probabilities of all labels for each sentence pair | ||
|
||
## Reporting issues | ||
Please let [me](https://zhiguowang.github.io/) know, if you encounter any problems. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
{ | ||
"train_path": "/u/zhigwang/zhigwang1/sentence_match/quora/data/train.tsv", | ||
"dev_path": "/u/zhigwang/zhigwang1/sentence_match/quora/data/dev.tsv", | ||
"word_vec_path": "/u/zhigwang/zhigwang1/sentence_match/quora/wordvec.txt", | ||
"model_dir": "/u/zhigwang/zhigwang1/sentence_match/quora/logs", | ||
"suffix": "quora", | ||
"fix_word_vec": true, | ||
"isLower": true, | ||
"max_sent_length": 50, | ||
"max_char_per_word": 10, | ||
|
||
"with_char": true, | ||
"char_emb_dim": 20, | ||
"char_lstm_dim": 40, | ||
|
||
|
||
"batch_size": 60, | ||
"max_epochs": 20, | ||
"dropout_rate": 0.1, | ||
"learning_rate": 0.0005, | ||
"optimize_type": "adam", | ||
"lambda_l2": 0.0, | ||
"grad_clipper": 10.0, | ||
|
||
"context_layer_num": 1, | ||
"context_lstm_dim": 100, | ||
"aggregation_layer_num": 1, | ||
"aggregation_lstm_dim": 100, | ||
|
||
"with_full_match": true, | ||
"with_maxpool_match": false, | ||
"with_max_attentive_match": false, | ||
"with_attentive_match": true, | ||
|
||
"with_cosine": true, | ||
"with_mp_cosine": true, | ||
"cosine_MP_dim": 5, | ||
|
||
"att_dim": 50, | ||
"att_type": "symmetric", | ||
|
||
"highway_layer_num": 1, | ||
"with_highway": true, | ||
"with_match_highway": true, | ||
"with_aggregation_highway": true, | ||
|
||
"use_cudnn": true, | ||
|
||
"with_moving_average": false | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
{ | ||
"train_path": "/u/zhigwang/zhigwang1/sentence_match/snli/train.tsv", | ||
"dev_path": "/u/zhigwang/zhigwang1/sentence_match/snli/dev.tsv", | ||
"word_vec_path": "/u/zhigwang/zhigwang1/sentence_match/snli/wordvec.txt", | ||
"model_dir": "/u/zhigwang/zhigwang1/sentence_match/snli/logs", | ||
"suffix": "quora", | ||
"fix_word_vec": true, | ||
"isLower": true, | ||
"max_sent_length": 100, | ||
"max_char_per_word": 10, | ||
|
||
"with_char": true, | ||
"char_emb_dim": 20, | ||
"char_lstm_dim": 40, | ||
|
||
"batch_size": 100, | ||
"max_epochs": 10, | ||
"dropout_rate": 0.2, | ||
"learning_rate": 0.001, | ||
"optimize_type": "adam", | ||
"lambda_l2": 0.0, | ||
"grad_clipper": 10.0, | ||
|
||
"context_layer_num": 1, | ||
"context_lstm_dim": 100, | ||
"aggregation_layer_num": 1, | ||
"aggregation_lstm_dim": 100, | ||
|
||
"with_full_match": true, | ||
"with_maxpool_match": false, | ||
"with_max_attentive_match": false, | ||
"with_attentive_match": true, | ||
|
||
"with_cosine": true, | ||
"with_mp_cosine": true, | ||
"cosine_MP_dim": 5, | ||
|
||
"att_dim": 50, | ||
"att_type": "symmetric", | ||
|
||
"highway_layer_num": 1, | ||
"with_highway": true, | ||
"with_match_highway": true, | ||
"with_aggregation_highway": true, | ||
|
||
"use_cudnn": true, | ||
|
||
"with_moving_average": false | ||
} |
Oops, something went wrong.