ProgrammingAlpha: Releasing Programmers from Searching Stack Overflow

The KnowAlpha: Automatically Recommending Useful Information to Programmers through Semantic Understanding


We shall give an insturction that will guide you to use the source code in this project to 
build KnowAlpha, and then deploy it in practice.

Instruction for building the KnowAlpha recommender system from source code and executing experiments

Prepare system environment
Models
Data Pipeline
Build Models
Deploy System
Evaluation Results

Prepare system environment

Minimum configuration of machines

RAM: 512G
CPU: 56 logic cores
Disk: 1TB+
GPU: 4X Tesla V100(32G X 4)

Install python environment

We develop the whole system using python, so we recommend you to install an anaconda virtual python3.6 environment at: https://www.anaconda.com/

Install MongoDB Database

Install MongoDB database into your computer with a linux system, and configure db ip and port according to the instruction of https://www.mongodb.com/. To enable, fast retrieval of those data, install an Elastic Search Engine according to the instruction of https://www.elastic.co/.

Required python packages

machine learning:scikit-learn, tensorflow, openNMT,texar,pytorch,networkx,sumeval,summy,TextBlob,bert-as-service
data preprocessing: pymongo, numpy, pandas

Project Check

Prepare Data
The mentioned neural network models are in ProgrammingAlpha/programmingalpha/models.
Run the scripts in ProgrammingAlpha/test/db_test/ folder to prepare training data.
Run the scripts in ProgrammingAlpha/test/retriver_test/ folder to build the model mentioned in KnowAlpha.

Prepare Data

Download the data dump from archieve.org. Our training data from 4 online Q&A forums currently consists of Stack Overflow, Artificail Intelligence, Cross Validated and Data Science.

Build a MongoDB cluster and put all the data needed to the Database. Then deploy the elastic search engine on top of your database cluster.
Make the dirs listed in ProgrammingAlpha/programmingalpha/__init__.py.

Download the BERT Model

As the project is heavily based on several open released pretraining models, we at least need to prepare the BERT models according to the instructions of https://github.com/google-research/bert (tensorflow version) and https://github.com/huggingface/pytorch-pretrained-BERT (pytorch version). Store the pretrained model weight and auxiliary data of BERT model to the dirs BertBasePath or BertLargePath mentioned in ProgrammingAlpha/programmingalpha/__init__.py.

Prepare the training Data

Data Analysis and Link Analysis

Run ProgrammingAlpha/test/associationAlg_test/seedSearchForTags.py to analyze the AI related tags and using association mining to find all required posts data.
Run ProgrammingAlpha/test/graphLinke_test/build_link_path.py to build the posts link graph. If you have a spark cluster, you can boost the computaion space via running ProgrammingAlpha/test/graphLinke_test/spark-graph.py; or you can run ProgrammingAlpha/test/graphLinke_test/extract_link_semi_path.py to build an incomplete graph for quick test.
Exract link distance posts pairs: run ProgrammingAlpha/test/graphLinke_test/build_label_pair.py to generate "link distance + posts ids(1+2)" data record, which is later used to generate inference task data.

Training Data for KnowAlpha

Run ProgrammingAlpha/test/db_test/gen_corpus+_inference.py and push the generated corpus to mongodb cluster.
Run ProgrammingAlpha/test/db_test/gen_samples.py with task parameter as 'inference' to sample training and validating data.
Preprocess the generated samples by running ProgrammingAlpha/test/tokenizer_test/tokenize_corpus.py.

Build Local Knowledge Base

Run ProgrammingAlpha/test/db_test/buildQAIndexer.py firstly to gather all answers to each question.
Run ProgrammingAlpha/test/db_test/gen_kwnowledge_unit.py to generate knowledge unit data used by KnowAlpha.
Push the knowledge units data to mongoDB cluster.

After finished running all the above scripts, the system is ready for model training.

Build Models

Document Search Engine

The document search engine is in KnowAlpha/programmingalpha/retrievers/SearchEngine folder.
Follow the requirement.txt and see the run.py about how to use doc search engine.

Build Knowledge Inference Model

Run ProgrammingAlpha/test/retriever_test/build_linkprediction_model.py to train the Knowledge Inference Net.
Other Inference Networks are available in https://github.com/asyml/texar/tree/master/examples/sentence_classifier and https://github.com/zhangzhenyu13/ATEC_NLP.

Evaluate the Model Performance

Evaluate the KnowlAlpha

Sample 2000 solved questions via runining "ProgrammingAlpha/test/db_test/gen_samples.py --maxSize 2000 --task inference" to generate the test samples.
Run the ProgrammingAlpha/test/retriever_test/run_model.sh --do_eval to predict the link distance results directly, which is used to measure model performance on the test samples.
Run the ProgrammingAlpha/test/retriever_test/interactive.py with input stream re-directed to a file containing post ids of test samples, which is the evaluation of KnowAlpha.
1)Use the sklearn metrics toolkit to evaluate the model performance of Inference Net; 2) Refer to https://github.com/microsoft/recommenders for evaluation of the retrieved results of KnowAlpha.
Other Inference Networks can be found and used in https://github.com/asyml/texar/tree/master/examples/sentence_classifier and https://github.com/zhangzhenyu13/ATEC_NLP.

..........................................................

AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation

Instruction for building the Q&A system from source code and executing experiments

Prepare system environment
Models
Data Pipeline
Build Models
Deploy System
Evaluation Results

Prepare system environment

Minimum configuration of machines

RAM: 512G
CPU: 56 logic cores
Disk: 1TB+
GPU: 4X Tesla V100(32G X 4)

Install python environment

We develop the whole system using python, so we recommend you to install an anaconda virtual python3.6 environment at: https://www.anaconda.com/

Install MongoDB Database

Install MongoDB database into your computer with a linux system, and configure db ip and port according to the instruction of https://www.mongodb.com/. To enable, fast retrieval of those data, install an Elastic Search Engine according to the instruction of https://www.elastic.co/.

Required python packages

machine learning:scikit-learn, tensorflow, openNMT,texar,pytorch,networkx,sumeval,summy,TextBlob,bert-as-service
data preprocessing: pymongo, numpy, pandas

Project Check

Prepare Data
The mentioned neural network models are in ProgrammingAlpha/programmingalpha/models.
The evaluation metiric tool APIs are in ProgrammingAlpha/programmingalpha/Utility/metrics.py.
Run the scripts in ProgrammingAlpha/test/db_test/ folder to prepare training data.
Run the scripts in ProgrammingAlpha/test/retriver_test/ folder to build the model mentioned in KnowAlpha.
Run the scripts in ProgrammingAlpha/test/text_generation_test/ to build the model mentioned in AnsAlpha.

Prepare Data

Download the data dump from archieve.org. Our training data from 4 online Q&A forums currently consists of Stack Overflow, Artificail Intelligence, Cross Validated and Data Science.

Build a MongoDB cluster and put all the data needed to the Database. Then deploy the elastic search engine on top of your database cluster.
After downloading the java crawler maven project, please use intelliJ idea at: https://www.jetbrains.com/idea/ to deploy the crawler jar package in your machine
Make the dirs listed in ProgrammingAlpha/programmingalpha/__init__.py.

Download the BERT Model

As the project is heavily based on several open released pretraining models, we at least need to prepare the BERT models according to the instructions of https://github.com/google-research/bert (tensorflow version) and https://github.com/huggingface/pytorch-pretrained-BERT (pytorch version). Store the pretrained model weight and auxiliary data of BERT model to the dirs BertBasePath or BertLargePath mentioned in ProgrammingAlpha/programmingalpha/__init__.py.

Prepare the training Data

Training Data for AnsAlpha

Run ProgrammingAlpha/test/db_test/gen_corpus_seq2seq.py and push the generated corpus to mongodb cluster.
Run ProgrammingAlpha/test/db_test/gen_samples.py with task parameter as 'seq2seq' to sample training and validating data.
Leverage the code snippets in OpenNMT package and generate training data. Instructions can be found here http://opennmt.net/OpenNMT-py/options/preprocess.html.

Build Local Knowledge Base

Run ProgrammingAlpha/test/db_test/buildQAIndexer.py firstly to gather all answers to each question.
Run ProgrammingAlpha/test/db_test/gen_kwnowledge_unit.py to generate knowledge unit data used by KnowAlpha.
Push the knowledge units data to mongoDB cluster.

After finished running all the above scripts, the system is ready for model training.

Train Neural Network Models

Build Text Generation Models (e.g. AnswerNet)

Run ProgrammingAlpha/test/text_generation_test/build_copy_transformer.py to begin teacher forcing training of AnswerNet.
Run ProgrammingAlpha/test/text_generation_test/build_rl_transformer.py to start training AnswerNet using reinforcement learning.
To train a text generation model with other networks, a quick start can be followed in http://opennmt.net/OpenNMT-py/options/train.html.
Other optional networks for text generation is also available in https://github.com/asyml/texar.

Evaluate the Model Performance

Evaluate the AnsAlpha

Sample 2000 solved questions via runining "ProgrammingAlpha/test/db_test/gen_samples.py --maxSize 2000 --task seq2seq" or unsolved questions via ProgrammingAlpha/test/db_test/unsolved_seq2seq.py. Or you can directly invoke the Google Custom Search Engine after including the 4 online forums mentioned before.
After finishing training the AnswerNet and other text generation models, use ProgrammingAlpha/test/text_generation_test/run_inference.sh or ProgrammingAlpha/test/text_generation_test/transformerinference.py to generate answers to the sampled questions.
Run ProgrammingAlpha/test/utilities_test/computeScore.py true_answers.file generated_answers.file to get the evaluation BLEU/ROUGUE-2 score.
We also have conducted a simple user survey using online web here https://wj.qq.com/s2/3597786/b668/. And the resuls are listed below.

User Survey

Id	2	1	0	-1	-2	mean	std.dev.
1	17	8	1	0	1	1.481	0.768
2	3	11	9	1	3	0.37	1.196
3	0	7	10	5	5	-0.296	1.097
4	1	5	7	10	4	-0.407	1.13
5	24	3	0	0	0	1.888	0.098
6	13	11	1	1	1	1.259	0.932
7	13	10	4	0	0	1.333	0.518
8	6	8	7	4	2	0.444	1.432
9	19	6	1	0	1	1.555	0.765
10	3	14	7	2	1	0.592	0.834
total	99	83	47	23	18	0.822	0.877

..........................................................

Deploying The ProgrammingAlpha System

User Interface

We currently implemented a simple web page, which is available in https://github.com/zhangzhenyu13/ProgrammingAlpha/tree/master/alphaservices/AlphaWeb.
To start the web page service: cd /path/to/repo/programmingalpha/alphaservices/AlphaWeb, then just use the bash to run the start_server_web.sh.
To start the backend service: just use bash to run /path/to/repo/test/servers_test.sh.
Make sure the service config files are set properly so that each service can be routed properly by the portal backend service. Those config files are contained in /path/to/repo/ConfigData
After finishing above steps, you can access the service online: ip:port/webServices/alpha-QA, where ip and port are set via the config files.

Please give a cite to our work if you want use the project somewhere else.

@INPROCEEDINGS{programmingAlpha, 
author={Zhenyu Zhang, Hailong Sun, HongyuZhang, PengboCai}, 
title={The KnowAlpha: Finding Useful Information to Programmers through Semantic Understanding
},
year={2019},
url={https://github.com/zhangzhenyu13/ProgrammingAlpha} 
}

@INPROCEEDINGS{programmingAlpha, 
author={Zhenyu Zhang, Hailong Sun, HongyuZhang, PengboCai}, 
title={AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation},
year={2019},
url={https://github.com/zhangzhenyu13/ProgrammingAlpha} 
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
ConfigData		ConfigData
libs		libs
programmingalpha		programmingalpha
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProgrammingAlpha: Releasing Programmers from Searching Stack Overflow

The KnowAlpha: Automatically Recommending Useful Information to Programmers through Semantic Understanding

Instruction for building the KnowAlpha recommender system from source code and executing experiments

Prepare system environment

Prepare Data

Download the BERT Model

Prepare the training Data

Build Models

Evaluate the Model Performance

AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation

Instruction for building the Q&A system from source code and executing experiments

Prepare system environment

Prepare Data

Download the BERT Model

Prepare the training Data

Train Neural Network Models

Evaluate the Model Performance

User Survey

Deploying The ProgrammingAlpha System

Please give a cite to our work if you want use the project somewhere else.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

zhangzhenyu13/ProgrammingAlpha

Folders and files

Latest commit

History

Repository files navigation

ProgrammingAlpha: Releasing Programmers from Searching Stack Overflow

The KnowAlpha: Automatically Recommending Useful Information to Programmers through Semantic Understanding

Instruction for building the KnowAlpha recommender system from source code and executing experiments

Prepare system environment

Prepare Data

Download the BERT Model

Prepare the training Data

Build Models

Evaluate the Model Performance

AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation

Instruction for building the Q&A system from source code and executing experiments

Prepare system environment

Prepare Data

Download the BERT Model

Prepare the training Data

Train Neural Network Models

Evaluate the Model Performance

User Survey

Deploying The ProgrammingAlpha System

Please give a cite to our work if you want use the project somewhere else.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages