-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 22 first exploration of questions #23
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @helloaidank,
bertopic==0.16.0
andkaleido
are missing from the requirements. No sure if you just forgot to commit the requirements file, but without these two requirements, yourBERTopic_first_analysis.py
script doesn't work;- I think it would make more sense to have the EDA applied to the whole of the BuildHub data, instead of each sub-forum, as each sub-forum is quite small. Hence, why you don't get many repetitions.
- What I learnt by looking at the data & investigating a bit further:
- A lot of the questions we are getting are questions made up of 3 or less words (see plot below) - so perhaps something is not working well in your code to remove these;
- The above also tells us that we're missing the context some times! (as we expected)
- If we take a good look at the questions data, most of them are not questions. For the Green and Ethical Money Saving sub-forum, we get 184143 questions without a question mark and 40219 with a question mark. That's a lot of potential false positives! Before we go any further, we need to improve our method for identifying questions. A few thoughts:
- either removing some of the starting keywords such as "is" OR just focusing on sentences ending with a question mark;
- only focusing on the original posts for now, not replies; - unsure as to whether this will help;
- we should for sure use the titles to identifying questions in addition to using the text as we are already - I am guessing a lot of titles are posed as questions, so they should
- adding the sentences coming before the question, to have some context.
- We don't seem to have a lot of questions (or at least repeated questions posed as "don't knows" which is good) - w/ ~200 "don't know's" in the Green and Ethical Money Saving category.
Happy to discuss more in our stand up.
asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/BERTopic_first_analysis.py
Outdated
Show resolved
Hide resolved
""" | ||
python BERTopic_first_analysis.py | ||
|
||
This script performs topic modeling on a set of questions using the BERTopic library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script performs topic modeling on a set of questions using the BERTopic library. | |
This script clusters questions together to identify groups of similar questions. To do that we apply BERTopic topic model on a set of questions. |
def load_data(file_path: str) -> List[str]: | ||
""" | ||
Loads extracted questions from a CSV file into a list. | ||
|
||
Args: | ||
file_path (str): The path to the CSV file containing the questions. | ||
|
||
Returns: | ||
List[str]: A list of questions. | ||
""" | ||
return pd.read_csv(file_path)["Question"].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create getters for questions' data in the getters folder, and then call the getter here instead of creating this function in this script. Doesn't need to happen now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this was very much a temporary measure.
def plot_topic_distribution(topic_model: BERTopic, figure_file_path: str) -> None: | ||
""" | ||
Plots and saves the distribution of the top topics identified by the BERTopic model. | ||
|
||
Args: | ||
topic_model (BERTopic): The BERTopic model after fitting to data. | ||
""" | ||
topic_counts = topic_model.get_topic_info()["Count"][1:17] | ||
topic_labels = topic_model.get_topic_info()["Name"][1:17].str.replace("_", " ") | ||
plt.figure(figsize=(14, 8)) | ||
plt.barh(topic_labels, topic_counts, color=NESTA_COLOURS[0]) | ||
plt.ylabel("Topics") | ||
plt.xlabel("Count") | ||
plt.title("Topic Distribution") | ||
plt.tight_layout() | ||
plt.savefig( | ||
figure_file_path + "topic_distribution.png", dpi=300, bbox_inches="tight" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once I merge my BERTopic code you can use the utils I created and call them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script can be moved to the pipeline faqs_identification folder after a few future changes. Doesn't need to happen in this PR.
After we do the evaluation of different models on different datasets, we can set the parms for each questions dataset and fix a random seed. These parms can then be read by this file.
def visualise_topics(topic_model: BERTopic, figure_file_path: str) -> None: | ||
""" | ||
Generates and saves a visualisation of topics identified by the BERTopic model. | ||
|
||
Args: | ||
topic_model (BERTopic): The BERTopic model after fitting to data. | ||
""" | ||
fig = topic_model.visualize_topics() | ||
fig.write_image(figure_file_path + "topic_visualisation.png") | ||
|
||
|
||
def visualise_barchart(topic_model: BERTopic, figure_file_path: str) -> None: | ||
""" | ||
Generates and saves a barchart visualisation of the top n topics identified by the BERTopic model. | ||
|
||
Args: | ||
topic_model (BERTopic): The BERTopic model after fitting to data. | ||
""" | ||
fig_barchart = topic_model.visualize_barchart(top_n_topics=16, n_words=10) | ||
fig_barchart.write_image(figure_file_path + "topic_visualisation_barchart.png") | ||
|
||
|
||
def visualise_hierarchy(topic_model: BERTopic, figure_file_path: str) -> None: | ||
""" | ||
Generates and saves a hierarchical visualisation of topics identified by the BERTopic model. | ||
|
||
Args: | ||
topic_model (BERTopic): The BERTopic model after fitting to data. | ||
""" | ||
fig_hierarchy = topic_model.visualize_hierarchy() | ||
fig_hierarchy.write_image(figure_file_path + "topic_visualisation_hierarchy.png") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you find these helpful, let's move them to the topic modelling utils file I created, it's a better place for them to live. needs to be done in a PR after we both merge our PRs.
asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/BERTopic_first_analysis.py
Outdated
Show resolved
Hide resolved
asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/questions_eda_analysis.py
Outdated
Show resolved
Hide resolved
asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/questions_eda_analysis.py
Show resolved
Hide resolved
extracted_questions_df = load_data(input_data) | ||
question_counts = get_question_counts(extracted_questions_df) | ||
plot_question_counts(question_counts, figure_path=output_figures_path) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few additional things I did/checked (the code is not necessarily the best, just initial thoughts from me!):
extracted_questions_df = load_data(input_data) | |
question_counts = get_question_counts(extracted_questions_df) | |
plot_question_counts(question_counts, figure_path=output_figures_path) | |
os.makedirs(output_figures_path, exist_ok=True) | |
extracted_questions_df = load_data(input_data) | |
extracted_questions_df["contains_qm"] =extracted_questions_df["Question"].apply(lambda x: True if "?" in x else False) | |
print("Contains question mark?\n", extracted_questions_df["contains_qm"].value_counts()) | |
extracted_questions_df["Question"] = extracted_questions_df["Question"].str.lower() | |
question_counts = get_question_counts(extracted_questions_df) | |
plot_question_counts(question_counts, figure_path=output_figures_path, top_n=15) | |
dont_knows_path = os.path.join( | |
PROJECT_DIR, | |
f"outputs/data/extracted_questions/{forum}/forum_{category}/idk_phrases_{category}_all.csv", | |
) | |
dont_knows_df = load_data(dont_knows_path) | |
dont_knows_df["sentences_without_inclusion"] = dont_knows_df["sentences_without_inclusion"].str.lower() | |
dk_counts = get_question_counts(dont_knows_df, column_name="sentences_without_inclusion") | |
dk_counts = dk_counts[dk_counts> 1] | |
if len(dk_counts)>0: | |
print(dk_counts) | |
else: | |
print('No frequent "dont know" expressions found') | |
Hi @sofiapinto , I have added some of the modifications you suggested and also included the LLM label titling. The code should still run fine but please do check! |
Description
This PR introduces two Python scripts designed to perform our first iteration of a question analysis:
BERTopic_first_analysis.py
andquestions_eda_analysis.py
. The outputs from these scripts are solely figures.Instructions for Reviewer
Hi @sofiapinto, thanks a lot for reviewing my code!
Setup
119_air_source_heat_pumps_ashp
forum from buildhub please.[email protected]:nestauk/asf_public_discourse_home_decarbonisation.git
issue-22-first_exploration_of_questions
make install
;direnv allow
;conda activate asf_public_discourse_home_decarbonisation
Review
python BERTopic_first_analysis.py
python questions_eda_analysis.py
asf_public_discourse_home_decarbonisation/outputs/figures/extracted_questions/
Scripts
If you could review the following scripts:
asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/BERTopic_first_analysis.py
: Implements topic modeling using the BERTopic library, providing insights into the thematic structure of the question data.asf_public_discourse_home_decarbonisation/analysis/FAQ_analysis/questions_eda_analysis.py
: Facilitates EDA on question datasets, plotting most frequent questions from the extracted questions data.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s