Skip to content

Replication and feature extension of a recommender system (Liu et al., 2024) utilising two LLMs at pre-processing and content encoding.

Notifications You must be signed in to change notification settings

lisannbecker/ONCE-extension

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ONCE: Boosting Content-based Recommendation with Open- and Closed-source LLMs


A. Ibrahimi, L. Becker, R. van Emmerik


This project aims to reproduce and add an extension to the paper "ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models" by Liu et al. (2024). Please request our report for detailed information on the implementation of our reproduction and extension of the ONCE architecture.

Installations

Requirements

  • Linux or Windows with Python = 3.11

Environment Setup

For DIRE and ONCE:

conda create --name once python=3.11 
conda activate once

pip install -r requirements_once.txt

Or for the GENRE implementation:

conda create --name once python=3.8 
conda activate genre

pip install -r requirements_genre.txt

Model and Data Download

Navigate to the Ekstra Bladet website to download the small and/or large EB-NeRD dataset. Move the data to your data to src/data/.

Run the following scripts to download the LLaMA 7b model and its embeddings to src/data/:

python src/scripts/scripts_dire/1_download_llama_vocab_model.py

python src/scripts/scripts_dire/2_download_llama_embeddings.py

Reproduction

GENRE Implementation: Prompting Closed-source LLMs for Content-based Recommendation

conda activate genre

Move to the GENRE repository:

cd src/lib/GENRE/

Move or duplicate the EB-NeRD data to this folder for GENRE implementation src/lib/GENRE/data/eb-nerd/eb-nerd-data/ebnerd_small. To convert the EB-NeRD articles to match the ONCE datatype, run:

python src/lib/GENRE/scripts_genre/articles_process.py

Overview

We use the GPT-3.5-turbo API provided by OpenAI as a closed-source LLM. All the request codes are available in this repository.

Codes and Corresponding Generated Data

Dataset Schemes Request Code Generated Data
EB-NeRD Content Summarizer news_summarizer.py data/eb-nerd/eb-nerd-outputs/news_summarizer.log
EB-NeRD User Profiler Train user_profiler.py data/eb-nerd/eb-nerd-outputs/user_profiler.log
EB-NeRD User Profiler Val user_profiler.py data/eb-nerd/eb-nerd-outputs/user_profiler_val.log

To update the datasets with the generated data, run these scripts:

python src/lib/GENRE/scripts_genre/update_articles.py
python src/lib/GENRE/scripts_genre/update_history.py

To obtain statistics of the augmented data, run the following script:

python src/lib/GENRE/scripts_genre/statistics_eb_nerd.py

DIRE Implementation: Finetuning Open-source LLMs for Content-based Recommendation

conda activate once

Tokenize the EB-NeRD data located in src/data/ with the basic tokenizer:

python src/scripts/scripts_dire/3_processor_dire.py
python src/scripts/scripts_dire/4_processor_dire_llama.py
python src/scripts/scripts_dire/5_ebnerd_fusion_dire.py

Move to the Legommenders repository and pre-train. Replace {yourModel} with the recommender model you would like to use out of 'fastformer', 'naml', or 'nrms'. You will need to update the config files to point to src/data/:

cd src/lib/Legommenders/

python worker.py --embed config/embed/llama-token.yaml --model config/model/llm/llama-naml.yaml \
    --exp config/exp/llama-split.yaml --data config/data/eb-nerd.yaml --version small --llm_ver 13b \
    --hidden_size 64 --layer 0 --lora 0  --fast_eval 0 --embed_hidden_size 5120 --page_size 32 --cuda -1

Training and testing with default parameters:

cd src/lib/Legommenders/

python  worker.py --data config/data/eb-nerd.yaml --embed config/embed/llama-token.yaml \
     --model config/model/llm/llama-fastformer.yaml --exp config/exp/tt-llm.yaml --embed_hidden_size 4096 \
     --llm_ver 7b --layer 31 --version small --lr 0.0001   --item_lr 0.00001 --batch_size 64 --acc_batch 1 --epoch_batch -4

ONCE

conda activate once

Tokenize the EB-NeRD data located in src/data/ with the tokenizer for the basic data and topics and region:

python src/scripts/scripts_dire/6_processor_once.py
python src/scripts/scripts_dire/7_processor_llama_once.py
python src/scripts/scripts_dire/8_ebnerd_fusion_once.py

Move to the Legommenders repository and pre-train. Replace {yourModel} with the recommender model you would like to use out of 'fastformer', 'naml', or 'nrms'. You will need to update the config files to point to src/data/:

cd src/lib/Legommenders/

python worker.py --embed config/embed/llama-token.yaml --model config/model/llm/llama-{yourModel}-once.yaml --exp config/exp/llama-split-once.yaml --data config/data/eb-nerd-once.yaml --version small --llm_ver 7b --hidden_size 64 --layer 0 --lora 0 --fast_eval 0 --embed_hidden_size 4096 --page_size 8

Training and testing with default parameters:

python worker.py  --data config/data/eb-nerd-once.yaml --embed config/embed/llama-token.yaml  --model config/model/llm/llama-{yourModel}-once.yaml --exp config/exp/tt-llm.yaml --embed_hidden_size 4096 --llm_ver 7b --layer 31 --version small --lr 0.0001 --item_lr 0.00001 --batch_size 32 --acc_batch 2 --epoch_batch -4  

Extension

DIRE with Sentiment

conda activate once

Tokenize the EB-NeRD data located in src/data/ with the tokenizer for the basic data and sentiment:

python src/scripts/scripts_dire/9_processor_sentiment.py
python src/scripts/scripts_dire/10_processor_llama_sentiment.py
python src/scripts/scripts_dire/11_ebnerd_fusion-sentiment.py

Prepare script: open src/lib/Legommenders/model/inputer/llm_concat_inputer.py and uncomment line 38. If you are curious what is behind the list check /src/scripts/scripts_dire/0_get_col_prompts_additions.py.

Move to the Legommenders repository and pre-train. Replace {yourModel} with the recommender model you would like to use out of 'fastformer', 'naml', or 'nrms'. You will need to update the config files to point to src/data/:

cd src/lib/Legommenders/

python worker.py --embed config/embed/llama-token.yaml --model config/model/llm/llama-{yourModel}-sentiment.yaml --exp config/exp/llama-split-sentiment.yaml --data config/data/eb-nerd-sentiment.yaml --version small --llm_ver 7b --hidden_size 64 --layer 0 --lora 0 --fast_eval 0 --embed_hidden_size 4096 --page_size 8 --cuda -1

Training and testing with default parameters:

python worker.py  --data config/data/eb-nerd-sentiment.yaml --embed config/embed/llama-token.yaml  --model config/model/llm/llama-{yourModel}-sentiment.yaml --exp config/exp/tt-llm.yaml --embed_hidden_size 4096 --llm_ver 7b --layer 31 --version small --lr 0.0001 --item_lr 0.00001 --batch_size 32 --acc_batch 2 --epoch_batch -4 

About

Replication and feature extension of a recommender system (Liu et al., 2024) utilising two LLMs at pre-processing and content encoding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •