The KnowAlpha: Automatically Recommending Useful Information to Programmers through Semantic Understanding
We shall give an insturction that will guide you to use the source code in this project to
build KnowAlpha, and then deploy it in practice.
Instruction for building the KnowAlpha recommender system from source code and executing experiments
- Prepare system environment
- Models
- Data Pipeline
- Build Models
- Deploy System
- Evaluation Results
Minimum configuration of machines
RAM:
512GCPU:
56 logic coresDisk:
1TB+- GPU: 4X Tesla V100(32G X 4)
Install python environment
We develop the whole system using python, so we recommend you to install an anaconda virtual python3.6 environment at: https://www.anaconda.com/
Install MongoDB Database
Install MongoDB database into your computer with a linux system, and configure db ip and port according to the instruction of https://www.mongodb.com/. To enable, fast retrieval of those data, install an Elastic Search Engine according to the instruction of https://www.elastic.co/.
Required python packages
machine learning:
scikit-learn, tensorflow, openNMT,texar,pytorch,networkx,sumeval,summy,TextBlob,bert-as-servicedata preprocessing:
pymongo, numpy, pandas
Project Check
- Prepare Data
- The mentioned neural network models are in ProgrammingAlpha/programmingalpha/models.
- Run the scripts in ProgrammingAlpha/test/db_test/ folder to prepare training data.
- Run the scripts in ProgrammingAlpha/test/retriver_test/ folder to build the model mentioned in KnowAlpha.
Download the data dump from archieve.org. Our training data from 4 online Q&A forums currently consists of Stack Overflow, Artificail Intelligence, Cross Validated and Data Science.
- Build a MongoDB cluster and put all the data needed to the Database. Then deploy the elastic search engine on top of your database cluster.
- Make the dirs listed in ProgrammingAlpha/programmingalpha/__init__.py.
As the project is heavily based on several open released pretraining models, we at least need to prepare the BERT models according to the instructions of https://github.com/google-research/bert (tensorflow version) and https://github.com/huggingface/pytorch-pretrained-BERT (pytorch version). Store the pretrained model weight and auxiliary data of BERT model to the dirs BertBasePath or BertLargePath mentioned in ProgrammingAlpha/programmingalpha/__init__.py.
Data Analysis and Link Analysis
- Run ProgrammingAlpha/test/associationAlg_test/seedSearchForTags.py to analyze the AI related tags and using association mining to find all required posts data.
- Run ProgrammingAlpha/test/graphLinke_test/build_link_path.py to build the posts link graph. If you have a spark cluster, you can boost the computaion space via running ProgrammingAlpha/test/graphLinke_test/spark-graph.py; or you can run ProgrammingAlpha/test/graphLinke_test/extract_link_semi_path.py to build an incomplete graph for quick test.
- Exract link distance posts pairs: run ProgrammingAlpha/test/graphLinke_test/build_label_pair.py to generate "link distance + posts ids(1+2)" data record, which is later used to generate inference task data.
Training Data for KnowAlpha
- Run ProgrammingAlpha/test/db_test/gen_corpus+_inference.py and push the generated corpus to mongodb cluster.
- Run ProgrammingAlpha/test/db_test/gen_samples.py with task parameter as 'inference' to sample training and validating data.
- Preprocess the generated samples by running ProgrammingAlpha/test/tokenizer_test/tokenize_corpus.py.
Build Local Knowledge Base
- Run ProgrammingAlpha/test/db_test/buildQAIndexer.py firstly to gather all answers to each question.
- Run ProgrammingAlpha/test/db_test/gen_kwnowledge_unit.py to generate knowledge unit data used by KnowAlpha.
- Push the knowledge units data to mongoDB cluster.
After finished running all the above scripts, the system is ready for model training.
Document Search Engine
- The document search engine is in KnowAlpha/programmingalpha/retrievers/SearchEngine folder.
- Follow the requirement.txt and see the run.py about how to use doc search engine.
Build Knowledge Inference Model
- Run ProgrammingAlpha/test/retriever_test/build_linkprediction_model.py to train the Knowledge Inference Net.
- Other Inference Networks are available in https://github.com/asyml/texar/tree/master/examples/sentence_classifier and https://github.com/zhangzhenyu13/ATEC_NLP.
Evaluate the KnowlAlpha
- Sample 2000 solved questions via runining "ProgrammingAlpha/test/db_test/gen_samples.py --maxSize 2000 --task inference" to generate the test samples.
- Run the ProgrammingAlpha/test/retriever_test/run_model.sh --do_eval to predict the link distance results directly, which is used to measure model performance on the test samples.
- Run the ProgrammingAlpha/test/retriever_test/interactive.py with input stream re-directed to a file containing post ids of test samples, which is the evaluation of KnowAlpha.
- 1)Use the sklearn metrics toolkit to evaluate the model performance of Inference Net; 2) Refer to https://github.com/microsoft/recommenders for evaluation of the retrieved results of KnowAlpha.
- Other Inference Networks can be found and used in https://github.com/asyml/texar/tree/master/examples/sentence_classifier and https://github.com/zhangzhenyu13/ATEC_NLP.
..........................................................
- Prepare system environment
- Models
- Data Pipeline
- Build Models
- Deploy System
- Evaluation Results
Minimum configuration of machines
RAM:
512GCPU:
56 logic coresDisk:
1TB+- GPU: 4X Tesla V100(32G X 4)
Install python environment
We develop the whole system using python, so we recommend you to install an anaconda virtual python3.6 environment at: https://www.anaconda.com/
Install MongoDB Database
Install MongoDB database into your computer with a linux system, and configure db ip and port according to the instruction of https://www.mongodb.com/. To enable, fast retrieval of those data, install an Elastic Search Engine according to the instruction of https://www.elastic.co/.
Required python packages
machine learning:
scikit-learn, tensorflow, openNMT,texar,pytorch,networkx,sumeval,summy,TextBlob,bert-as-servicedata preprocessing:
pymongo, numpy, pandas
Project Check
- Prepare Data
- The mentioned neural network models are in ProgrammingAlpha/programmingalpha/models.
- The evaluation metiric tool APIs are in ProgrammingAlpha/programmingalpha/Utility/metrics.py.
- Run the scripts in ProgrammingAlpha/test/db_test/ folder to prepare training data.
- Run the scripts in ProgrammingAlpha/test/retriver_test/ folder to build the model mentioned in KnowAlpha.
- Run the scripts in ProgrammingAlpha/test/text_generation_test/ to build the model mentioned in AnsAlpha.
Download the data dump from archieve.org. Our training data from 4 online Q&A forums currently consists of Stack Overflow, Artificail Intelligence, Cross Validated and Data Science.
- Build a MongoDB cluster and put all the data needed to the Database. Then deploy the elastic search engine on top of your database cluster.
- After downloading the java crawler maven project, please use intelliJ idea at: https://www.jetbrains.com/idea/ to deploy the crawler jar package in your machine
- Make the dirs listed in ProgrammingAlpha/programmingalpha/__init__.py.
As the project is heavily based on several open released pretraining models, we at least need to prepare the BERT models according to the instructions of https://github.com/google-research/bert (tensorflow version) and https://github.com/huggingface/pytorch-pretrained-BERT (pytorch version). Store the pretrained model weight and auxiliary data of BERT model to the dirs BertBasePath or BertLargePath mentioned in ProgrammingAlpha/programmingalpha/__init__.py.
Training Data for AnsAlpha
- Run ProgrammingAlpha/test/db_test/gen_corpus_seq2seq.py and push the generated corpus to mongodb cluster.
- Run ProgrammingAlpha/test/db_test/gen_samples.py with task parameter as 'seq2seq' to sample training and validating data.
- Leverage the code snippets in OpenNMT package and generate training data. Instructions can be found here http://opennmt.net/OpenNMT-py/options/preprocess.html.
Build Local Knowledge Base
- Run ProgrammingAlpha/test/db_test/buildQAIndexer.py firstly to gather all answers to each question.
- Run ProgrammingAlpha/test/db_test/gen_kwnowledge_unit.py to generate knowledge unit data used by KnowAlpha.
- Push the knowledge units data to mongoDB cluster.
After finished running all the above scripts, the system is ready for model training.
Build Text Generation Models (e.g. AnswerNet)
- Run ProgrammingAlpha/test/text_generation_test/build_copy_transformer.py to begin teacher forcing training of AnswerNet.
- Run ProgrammingAlpha/test/text_generation_test/build_rl_transformer.py to start training AnswerNet using reinforcement learning.
- To train a text generation model with other networks, a quick start can be followed in http://opennmt.net/OpenNMT-py/options/train.html.
- Other optional networks for text generation is also available in https://github.com/asyml/texar.
Evaluate the AnsAlpha
- Sample 2000 solved questions via runining "ProgrammingAlpha/test/db_test/gen_samples.py --maxSize 2000 --task seq2seq" or unsolved questions via ProgrammingAlpha/test/db_test/unsolved_seq2seq.py. Or you can directly invoke the Google Custom Search Engine after including the 4 online forums mentioned before.
- After finishing training the AnswerNet and other text generation models, use ProgrammingAlpha/test/text_generation_test/run_inference.sh or ProgrammingAlpha/test/text_generation_test/transformerinference.py to generate answers to the sampled questions.
- Run ProgrammingAlpha/test/utilities_test/computeScore.py true_answers.file generated_answers.file to get the evaluation BLEU/ROUGUE-2 score.
- We also have conducted a simple user survey using online web here https://wj.qq.com/s2/3597786/b668/. And the resuls are listed below.
Id | 2 | 1 | 0 | -1 | -2 | mean | std.dev. |
---|---|---|---|---|---|---|---|
1 | 17 | 8 | 1 | 0 | 1 | 1.481 | 0.768 |
2 | 3 | 11 | 9 | 1 | 3 | 0.37 | 1.196 |
3 | 0 | 7 | 10 | 5 | 5 | -0.296 | 1.097 |
4 | 1 | 5 | 7 | 10 | 4 | -0.407 | 1.13 |
5 | 24 | 3 | 0 | 0 | 0 | 1.888 | 0.098 |
6 | 13 | 11 | 1 | 1 | 1 | 1.259 | 0.932 |
7 | 13 | 10 | 4 | 0 | 0 | 1.333 | 0.518 |
8 | 6 | 8 | 7 | 4 | 2 | 0.444 | 1.432 |
9 | 19 | 6 | 1 | 0 | 1 | 1.555 | 0.765 |
10 | 3 | 14 | 7 | 2 | 1 | 0.592 | 0.834 |
total | 99 | 83 | 47 | 23 | 18 | 0.822 | 0.877 |
..........................................................
User Interface
- We currently implemented a simple web page, which is available in https://github.com/zhangzhenyu13/ProgrammingAlpha/tree/master/alphaservices/AlphaWeb.
- To start the web page service: cd /path/to/repo/programmingalpha/alphaservices/AlphaWeb, then just use the bash to run the start_server_web.sh.
- To start the backend service: just use bash to run /path/to/repo/test/servers_test.sh.
- Make sure the service config files are set properly so that each service can be routed properly by the portal backend service. Those config files are contained in /path/to/repo/ConfigData
- After finishing above steps, you can access the service online: ip:port/webServices/alpha-QA, where ip and port are set via the config files.
@INPROCEEDINGS{programmingAlpha,
author={Zhenyu Zhang, Hailong Sun, HongyuZhang, PengboCai},
title={The KnowAlpha: Finding Useful Information to Programmers through Semantic Understanding
},
year={2019},
url={https://github.com/zhangzhenyu13/ProgrammingAlpha}
}
@INPROCEEDINGS{programmingAlpha,
author={Zhenyu Zhang, Hailong Sun, HongyuZhang, PengboCai},
title={AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation},
year={2019},
url={https://github.com/zhangzhenyu13/ProgrammingAlpha}
}