Skip to content

zhangzhenyu13/ProgrammingAlpha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProgrammingAlpha: Releasing Programmers from Searching Stack Overflow

The KnowAlpha: Automatically Recommending Useful Information to Programmers through Semantic Understanding


We shall give an insturction that will guide you to use the source code in this project to 
build KnowAlpha, and then deploy it in practice.

Instruction for building the KnowAlpha recommender system from source code and executing experiments

  • Prepare system environment
  • Models
  • Data Pipeline
  • Build Models
  • Deploy System
  • Evaluation Results

Prepare system environment

Minimum configuration of machines

  • RAM: 512G
  • CPU: 56 logic cores
  • Disk: 1TB+
  • GPU: 4X Tesla V100(32G X 4)

Install python environment

We develop the whole system using python, so we recommend you to install an anaconda virtual python3.6 environment at: https://www.anaconda.com/

Install MongoDB Database

Install MongoDB database into your computer with a linux system, and configure db ip and port according to the instruction of https://www.mongodb.com/. To enable, fast retrieval of those data, install an Elastic Search Engine according to the instruction of https://www.elastic.co/.

Required python packages

  • machine learning:scikit-learn, tensorflow, openNMT,texar,pytorch,networkx,sumeval,summy,TextBlob,bert-as-service
  • data preprocessing: pymongo, numpy, pandas

Project Check

  • Prepare Data
  • The mentioned neural network models are in ProgrammingAlpha/programmingalpha/models.
  • Run the scripts in ProgrammingAlpha/test/db_test/ folder to prepare training data.
  • Run the scripts in ProgrammingAlpha/test/retriver_test/ folder to build the model mentioned in KnowAlpha.

Prepare Data

Download the data dump from archieve.org. Our training data from 4 online Q&A forums currently consists of Stack Overflow, Artificail Intelligence, Cross Validated and Data Science.

  • Build a MongoDB cluster and put all the data needed to the Database. Then deploy the elastic search engine on top of your database cluster.
  • Make the dirs listed in ProgrammingAlpha/programmingalpha/__init__.py.

Download the BERT Model

As the project is heavily based on several open released pretraining models, we at least need to prepare the BERT models according to the instructions of https://github.com/google-research/bert (tensorflow version) and https://github.com/huggingface/pytorch-pretrained-BERT (pytorch version). Store the pretrained model weight and auxiliary data of BERT model to the dirs BertBasePath or BertLargePath mentioned in ProgrammingAlpha/programmingalpha/__init__.py.

Prepare the training Data

Data Analysis and Link Analysis

  • Run ProgrammingAlpha/test/associationAlg_test/seedSearchForTags.py to analyze the AI related tags and using association mining to find all required posts data.
  • Run ProgrammingAlpha/test/graphLinke_test/build_link_path.py to build the posts link graph. If you have a spark cluster, you can boost the computaion space via running ProgrammingAlpha/test/graphLinke_test/spark-graph.py; or you can run ProgrammingAlpha/test/graphLinke_test/extract_link_semi_path.py to build an incomplete graph for quick test.
  • Exract link distance posts pairs: run ProgrammingAlpha/test/graphLinke_test/build_label_pair.py to generate "link distance + posts ids(1+2)" data record, which is later used to generate inference task data.

Training Data for KnowAlpha

  • Run ProgrammingAlpha/test/db_test/gen_corpus+_inference.py and push the generated corpus to mongodb cluster.
  • Run ProgrammingAlpha/test/db_test/gen_samples.py with task parameter as 'inference' to sample training and validating data.
  • Preprocess the generated samples by running ProgrammingAlpha/test/tokenizer_test/tokenize_corpus.py.

Build Local Knowledge Base

  • Run ProgrammingAlpha/test/db_test/buildQAIndexer.py firstly to gather all answers to each question.
  • Run ProgrammingAlpha/test/db_test/gen_kwnowledge_unit.py to generate knowledge unit data used by KnowAlpha.
  • Push the knowledge units data to mongoDB cluster.

After finished running all the above scripts, the system is ready for model training.

Build Models

Document Search Engine

  • The document search engine is in KnowAlpha/programmingalpha/retrievers/SearchEngine folder.
  • Follow the requirement.txt and see the run.py about how to use doc search engine.

Build Knowledge Inference Model

Evaluate the Model Performance

Evaluate the KnowlAlpha

  • Sample 2000 solved questions via runining "ProgrammingAlpha/test/db_test/gen_samples.py --maxSize 2000 --task inference" to generate the test samples.
  • Run the ProgrammingAlpha/test/retriever_test/run_model.sh --do_eval to predict the link distance results directly, which is used to measure model performance on the test samples.
  • Run the ProgrammingAlpha/test/retriever_test/interactive.py with input stream re-directed to a file containing post ids of test samples, which is the evaluation of KnowAlpha.
  • 1)Use the sklearn metrics toolkit to evaluate the model performance of Inference Net; 2) Refer to https://github.com/microsoft/recommenders for evaluation of the retrieved results of KnowAlpha.
  • Other Inference Networks can be found and used in https://github.com/asyml/texar/tree/master/examples/sentence_classifier and https://github.com/zhangzhenyu13/ATEC_NLP.

..........................................................

AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation

Instruction for building the Q&A system from source code and executing experiments

  • Prepare system environment
  • Models
  • Data Pipeline
  • Build Models
  • Deploy System
  • Evaluation Results

Prepare system environment

Minimum configuration of machines

  • RAM: 512G
  • CPU: 56 logic cores
  • Disk: 1TB+
  • GPU: 4X Tesla V100(32G X 4)

Install python environment

We develop the whole system using python, so we recommend you to install an anaconda virtual python3.6 environment at: https://www.anaconda.com/

Install MongoDB Database

Install MongoDB database into your computer with a linux system, and configure db ip and port according to the instruction of https://www.mongodb.com/. To enable, fast retrieval of those data, install an Elastic Search Engine according to the instruction of https://www.elastic.co/.

Required python packages

  • machine learning:scikit-learn, tensorflow, openNMT,texar,pytorch,networkx,sumeval,summy,TextBlob,bert-as-service
  • data preprocessing: pymongo, numpy, pandas

Project Check

  • Prepare Data
  • The mentioned neural network models are in ProgrammingAlpha/programmingalpha/models.
  • The evaluation metiric tool APIs are in ProgrammingAlpha/programmingalpha/Utility/metrics.py.
  • Run the scripts in ProgrammingAlpha/test/db_test/ folder to prepare training data.
  • Run the scripts in ProgrammingAlpha/test/retriver_test/ folder to build the model mentioned in KnowAlpha.
  • Run the scripts in ProgrammingAlpha/test/text_generation_test/ to build the model mentioned in AnsAlpha.

Prepare Data

Download the data dump from archieve.org. Our training data from 4 online Q&A forums currently consists of Stack Overflow, Artificail Intelligence, Cross Validated and Data Science.

  • Build a MongoDB cluster and put all the data needed to the Database. Then deploy the elastic search engine on top of your database cluster.
  • After downloading the java crawler maven project, please use intelliJ idea at: https://www.jetbrains.com/idea/ to deploy the crawler jar package in your machine
  • Make the dirs listed in ProgrammingAlpha/programmingalpha/__init__.py.

Download the BERT Model

As the project is heavily based on several open released pretraining models, we at least need to prepare the BERT models according to the instructions of https://github.com/google-research/bert (tensorflow version) and https://github.com/huggingface/pytorch-pretrained-BERT (pytorch version). Store the pretrained model weight and auxiliary data of BERT model to the dirs BertBasePath or BertLargePath mentioned in ProgrammingAlpha/programmingalpha/__init__.py.

Prepare the training Data

Training Data for AnsAlpha

  • Run ProgrammingAlpha/test/db_test/gen_corpus_seq2seq.py and push the generated corpus to mongodb cluster.
  • Run ProgrammingAlpha/test/db_test/gen_samples.py with task parameter as 'seq2seq' to sample training and validating data.
  • Leverage the code snippets in OpenNMT package and generate training data. Instructions can be found here http://opennmt.net/OpenNMT-py/options/preprocess.html.

Build Local Knowledge Base

  • Run ProgrammingAlpha/test/db_test/buildQAIndexer.py firstly to gather all answers to each question.
  • Run ProgrammingAlpha/test/db_test/gen_kwnowledge_unit.py to generate knowledge unit data used by KnowAlpha.
  • Push the knowledge units data to mongoDB cluster.

After finished running all the above scripts, the system is ready for model training.

Train Neural Network Models

Build Text Generation Models (e.g. AnswerNet)

  • Run ProgrammingAlpha/test/text_generation_test/build_copy_transformer.py to begin teacher forcing training of AnswerNet.
  • Run ProgrammingAlpha/test/text_generation_test/build_rl_transformer.py to start training AnswerNet using reinforcement learning.
  • To train a text generation model with other networks, a quick start can be followed in http://opennmt.net/OpenNMT-py/options/train.html.
  • Other optional networks for text generation is also available in https://github.com/asyml/texar.

Evaluate the Model Performance

Evaluate the AnsAlpha

  • Sample 2000 solved questions via runining "ProgrammingAlpha/test/db_test/gen_samples.py --maxSize 2000 --task seq2seq" or unsolved questions via ProgrammingAlpha/test/db_test/unsolved_seq2seq.py. Or you can directly invoke the Google Custom Search Engine after including the 4 online forums mentioned before.
  • After finishing training the AnswerNet and other text generation models, use ProgrammingAlpha/test/text_generation_test/run_inference.sh or ProgrammingAlpha/test/text_generation_test/transformerinference.py to generate answers to the sampled questions.
  • Run ProgrammingAlpha/test/utilities_test/computeScore.py true_answers.file generated_answers.file to get the evaluation BLEU/ROUGUE-2 score.
  • We also have conducted a simple user survey using online web here https://wj.qq.com/s2/3597786/b668/. And the resuls are listed below.

User Survey

Id 2 1 0 -1 -2 mean std.dev.
1 17 8 1 0 1 1.481 0.768
2 3 11 9 1 3 0.37 1.196
3 0 7 10 5 5 -0.296 1.097
4 1 5 7 10 4 -0.407 1.13
5 24 3 0 0 0 1.888 0.098
6 13 11 1 1 1 1.259 0.932
7 13 10 4 0 0 1.333 0.518
8 6 8 7 4 2 0.444 1.432
9 19 6 1 0 1 1.555 0.765
10 3 14 7 2 1 0.592 0.834
total 99 83 47 23 18 0.822 0.877

..........................................................

Deploying The ProgrammingAlpha System

User Interface

  • We currently implemented a simple web page, which is available in https://github.com/zhangzhenyu13/ProgrammingAlpha/tree/master/alphaservices/AlphaWeb.
  • To start the web page service: cd /path/to/repo/programmingalpha/alphaservices/AlphaWeb, then just use the bash to run the start_server_web.sh.
  • To start the backend service: just use bash to run /path/to/repo/test/servers_test.sh.
  • Make sure the service config files are set properly so that each service can be routed properly by the portal backend service. Those config files are contained in /path/to/repo/ConfigData
  • After finishing above steps, you can access the service online: ip:port/webServices/alpha-QA, where ip and port are set via the config files.

Please give a cite to our work if you want use the project somewhere else.

@INPROCEEDINGS{programmingAlpha, 
author={Zhenyu Zhang, Hailong Sun, HongyuZhang, PengboCai}, 
title={The KnowAlpha: Finding Useful Information to Programmers through Semantic Understanding
},
year={2019},
url={https://github.com/zhangzhenyu13/ProgrammingAlpha} 
}
@INPROCEEDINGS{programmingAlpha, 
author={Zhenyu Zhang, Hailong Sun, HongyuZhang, PengboCai}, 
title={AnsAlpha: Towards Automatic Answering of Developers’ Questions through Comprehension and Generation},
year={2019},
url={https://github.com/zhangzhenyu13/ProgrammingAlpha} 
}

About

Automated Q&A, Solution Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •