updated to tensorflow 1.4

zhiguowang · Jan 29, 2018 · 95d3460 · 95d3460
1 parent 7dddc92
commit 95d3460
Show file tree

Hide file tree

Showing 11 changed files with 960 additions and 1,521 deletions.
diff --git a/README.md b/README.md
@@ -1,46 +1,66 @@
 # BiMPM: Bilateral Multi-Perspective Matching for Natural Language Sentences
 
+## Updates (Jan 28, 2018)
+* This repository has been updated to tensorflow 1.4
+* The training process speeds up 15+ times without lossing the accuracy.
+* All codes have been re-constructed for better readability and adaptability.
+
 ## Description
 This repository includes the source code for natural language sentence matching. 
-Basically, the program will take two sentences as input, and predict a label for the two input sentences. 
-You can use this program to deal with tasks like [paraphrase identification](https://aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_%28State_of_the_art%29), [natural language inference](http://nlp.stanford.edu/projects/snli/), [duplicate questions identification](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) et al. More details about the underneath model can be found in our [paper](https://arxiv.org/pdf/1702.03814.pdf). Please cite our paper when you use this program! :heart_eyes:
+Basically, the program takes two sentences as input, and predict a label for the two input sentences. 
+You can use this program to deal with tasks like [paraphrase identification](https://aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_%28State_of_the_art%29), [natural language inference](http://nlp.stanford.edu/projects/snli/), [duplicate questions identification](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) et al. More details about the underneath model can be found in our [paper](https://arxiv.org/pdf/1702.03814.pdf) published in IJCAI 2017. Please cite our paper when you use this program! :heart_eyes:
 
 ## Requirements
 * python 2.7
-* tensorflow 0.12
+* tensorflow 1.4
 
 ## Data format
-Both the train and test set require a tab-separated format.
+Both the train and test sets require a tab-separated format.
 Each line in the train (or test) file corresponds to an instance, and it should be arranged as
-> label	sentence#1	sentence#2	other_info
+> label	sentence#1	sentence#2 instanceID	
 
-For more details about the data format, you can download the [Quora Question Pair](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) dataset used in our [paper](https://arxiv.org/pdf/1702.03814.pdf).
+For more details about the data format, you can download the [SNLI](https://drive.google.com/file/d/1CxjKsaM6YgZPRKmJhNn7WcIC3gISehcS/view?usp=sharing) and the [Quora Question Pair](https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing) datasets used in our [paper](https://arxiv.org/pdf/1702.03814.pdf).
 
 
 ## Training
 You can find the training script at BiMPM/src/SentenceMatchTrainer.py
 
-To see all the **optional arguments**, just run
-> python BiMPM/src/SentenceMatchTrainer.py --help
+First, edit the configuration file at ${workspace}/BiMPM/configs/snli.sample.config (or ${workspace}/BiMPM/configs/quora.sample.config ).
+You need to change the "train\_path", "dev\_path", "word\_vec\_path", "model\_dir", "suffix" to your own setting.
 
-Here is an example of how to train a very simple model:
-> python  BiMPM/src/SentenceMatchTrainer.py --train\_path train.tsv --dev\_path dev.tsv --test\_path test.tsv --word\_vec_path wordvec.txt --suffix sample --fix\_word\_vec --model\_dir models --MP\_dim 20 
+Second, launch job using the following command line
+> python  ${workspace}/BiMPM/SentenceMatchTrainer.py --config\_path ${workspace}/BiMPM/configs/snli.sample.config
 
-To get a better performance on your own datasets, you need to play with other arguments. Here is one example of the command line [configuration](https://drive.google.com/file/d/0B0PlTAo--BnaQ3N4cXR1b0Z0YU0/view?usp=sharing) I used in my experiments.
 
 ## Testing
 You can find the testing script at BiMPM/src/SentenceMatchDecoder.py
+> python  ${workspace}/BiMPM/src/SentenceMatchDecoder.py --in\_path ${your\_path\_to}/dev.tsv --word\_vec\_path ${your\_path\_to}/wordvec.txt --out\_path ${your\_path\_to}/result.json --model\_prefix ${model\_dir}/SentenceMatch.${suffix}
 
+Where "model\_dir" and "suffix" are the variables set in your configuration file.
 
-To see all the **optional arguments**, just run
-> python BiMPM/src/SentenceMatchDecoder.py --help
+The output file is a json file with the follwing format.
 
-Here is an example of how to test your model:
-> python  BiMPM/src/SentenceMatchDecoder.py --in\_path test.tsv --word\_vec\_path wordvec.txt --mode prediction --model\_prefix models/SentenceMatch.sample --out\_path test.prediction
+```javascript
+{
+    { 
+        "ID": "instanceID",
+        "truth": label,
+        "sent1": sentence1,
+        "sent2": sentence2,
+        "prediction": prediciton,
+        "probs": probs_for_all_possible_labels
+    },
+    { 
+        "ID": "instanceID",
+        "truth": label,
+        "sent1": sentence1,
+        "sent2": sentence2,
+        "prediction": prediciton,
+        "probs": probs_for_all_possible_labels
+    }
+}
+```
 
-The SentenceMatchDecoder.py can run in two modes:
-* prediction: predicting the label for each sentence pair
-* probs: outputting probabilities of all labels for each sentence pair
 
 ## Reporting issues
 Please let [me](https://zhiguowang.github.io/) know, if you encounter any problems.
diff --git a/configs/quora.sample.config b/configs/quora.sample.config
@@ -0,0 +1,50 @@
+{
+    "train_path": "/u/zhigwang/zhigwang1/sentence_match/quora/data/train.tsv",
+    "dev_path": "/u/zhigwang/zhigwang1/sentence_match/quora/data/dev.tsv",
+    "word_vec_path": "/u/zhigwang/zhigwang1/sentence_match/quora/wordvec.txt",
+    "model_dir": "/u/zhigwang/zhigwang1/sentence_match/quora/logs",
+    "suffix": "quora",
+    "fix_word_vec": true,
+    "isLower": true,
+    "max_sent_length": 50,
+    "max_char_per_word": 10,
+
+    "with_char": true,
+    "char_emb_dim": 20,
+    "char_lstm_dim": 40,
+
+
+    "batch_size": 60,
+    "max_epochs": 20,
+    "dropout_rate": 0.1,
+    "learning_rate": 0.0005,
+    "optimize_type": "adam",
+    "lambda_l2": 0.0,
+    "grad_clipper": 10.0,
+
+    "context_layer_num": 1,
+    "context_lstm_dim": 100,
+    "aggregation_layer_num": 1,
+    "aggregation_lstm_dim": 100,
+
+    "with_full_match": true,
+    "with_maxpool_match": false,
+    "with_max_attentive_match": false,
+    "with_attentive_match": true,
+
+    "with_cosine": true,
+    "with_mp_cosine": true,
+    "cosine_MP_dim": 5,
+
+    "att_dim": 50,
+    "att_type": "symmetric",
+
+    "highway_layer_num": 1,
+    "with_highway": true,
+    "with_match_highway": true,
+    "with_aggregation_highway": true,
+
+    "use_cudnn": true,
+
+    "with_moving_average": false
+}
diff --git a/configs/snli.sample.config b/configs/snli.sample.config
@@ -0,0 +1,49 @@
+{
+    "train_path": "/u/zhigwang/zhigwang1/sentence_match/snli/train.tsv",
+    "dev_path": "/u/zhigwang/zhigwang1/sentence_match/snli/dev.tsv",
+    "word_vec_path": "/u/zhigwang/zhigwang1/sentence_match/snli/wordvec.txt",
+    "model_dir": "/u/zhigwang/zhigwang1/sentence_match/snli/logs",
+    "suffix": "quora",
+    "fix_word_vec": true,
+    "isLower": true,
+    "max_sent_length": 100,
+    "max_char_per_word": 10,
+
+    "with_char": true,
+    "char_emb_dim": 20,
+    "char_lstm_dim": 40,
+
+    "batch_size": 100,
+    "max_epochs": 10,
+    "dropout_rate": 0.2,
+    "learning_rate": 0.001,
+    "optimize_type": "adam",
+    "lambda_l2": 0.0,
+    "grad_clipper": 10.0,
+
+    "context_layer_num": 1,
+    "context_lstm_dim": 100,
+    "aggregation_layer_num": 1,
+    "aggregation_lstm_dim": 100,
+
+    "with_full_match": true,
+    "with_maxpool_match": false,
+    "with_max_attentive_match": false,
+    "with_attentive_match": true,
+
+    "with_cosine": true,
+    "with_mp_cosine": true,
+    "cosine_MP_dim": 5,
+
+    "att_dim": 50,
+    "att_type": "symmetric",
+
+    "highway_layer_num": 1,
+    "with_highway": true,
+    "with_match_highway": true,
+    "with_aggregation_highway": true,
+
+    "use_cudnn": true,
+
+    "with_moving_average": false
+}