Issue 22 first exploration of questions #23

helloaidank · 2024-02-20T13:15:29Z

Description

This PR introduces two Python scripts designed to perform our first iteration of a question analysis: BERTopic_first_analysis.py and questions_eda_analysis.py. The outputs from these scripts are solely figures.

Instructions for Reviewer

Hi @sofiapinto, thanks a lot for reviewing my code!

Setup

note you will have had to have run the extracted questions (issue Extracting Questions from MSE & Buildhub #16) scripts for this to work as it is based on their outputs. Can you run these scripts over the extracted questions from the 119_air_source_heat_pumps_ashp forum from buildhub please.
clone this repo: git clone [email protected]:nestauk/asf_public_discourse_home_decarbonisation.git
checkout to the correct branch: git checkout issue-22-first_exploration_of_questions
Run: make install;
Run: direnv allow;
Activate the conda environment using: conda activate asf_public_discourse_home_decarbonisation

Review

Can you run python BERTopic_first_analysis.py
Can you also run python questions_eda_analysis.py
The subsequent outputs will be saved in asf_public_discourse_home_decarbonisation/outputs/figures/extracted_questions/

Scripts

If you could review the following scripts:

asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/BERTopic_first_analysis.py: Implements topic modeling using the BERTopic library, providing insights into the thematic structure of the question data.
asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/questions_eda_analysis.py: Facilitates EDA on question datasets, plotting most frequent questions from the extracted questions data.

Checklist:

sofiapinto

Hey @helloaidank,

bertopic==0.16.0 and kaleido are missing from the requirements. No sure if you just forgot to commit the requirements file, but without these two requirements, your BERTopic_first_analysis.py script doesn't work;
I think it would make more sense to have the EDA applied to the whole of the BuildHub data, instead of each sub-forum, as each sub-forum is quite small. Hence, why you don't get many repetitions.
What I learnt by looking at the data & investigating a bit further:
- A lot of the questions we are getting are questions made up of 3 or less words (see plot below) - so perhaps something is not working well in your code to remove these;

The above also tells us that we're missing the context some times! (as we expected)
If we take a good look at the questions data, most of them are not questions. For the Green and Ethical Money Saving sub-forum, we get 184143 questions without a question mark and 40219 with a question mark. That's a lot of potential false positives! Before we go any further, we need to improve our method for identifying questions. A few thoughts:
- either removing some of the starting keywords such as "is" OR just focusing on sentences ending with a question mark;
- only focusing on the original posts for now, not replies; - unsure as to whether this will help;
- we should for sure use the titles to identifying questions in addition to using the text as we are already - I am guessing a lot of titles are posed as questions, so they should
- adding the sentences coming before the question, to have some context.
We don't seem to have a lot of questions (or at least repeated questions posed as "don't knows" which is good) - w/ ~200 "don't know's" in the Green and Ethical Money Saving category.

Happy to discuss more in our stand up.