This project is designed to evaluate the effectiveness of different prompt hacking techniques in extracting sensitive information (secret keys) from Large Language Models (LLLMs). The goal is to use attack prompts to successfully extrack the secret keys in the system prompts. More information in Instructions.md. This is a solution to lab02 from the LLM Agents MOOC
- Design System Prompts
system_prompt_1
is designed to be naive.system_prompt_1_1
is a variation.system_prompt_2
is designed to be more robust and include some defense mechanisms.system_prompt_2_1
is a variation.
- Design Attack Prompts
attack_1
is an attack forsystem_prompt_1
(andsystem_prompt_1_1
).attack_2
is an attack forsystem_prompt_2
(andsystem_prompt_2_1
).
I found useful to have two attack files when starting (attack_1_1
), to quickly test some changes side by side, but it is optional. I used attack_1
and attack_2
for the lab submission.
- Test the Attacks
test_attacks
generates N random secret keys and tests each attack against each system prompt N times. At the time of submission,attack-2
fails 1 test in 15 withsystem_prompt_2
and passes all tests withsystem_prompt_2_1
- Clone the Repository:
git clone <repository_url>
cd <repository_name>
- Create and Activate a Virtual Environment:
python3 -m venv env
source env/bin/activate # For macOS/Linux
.\env\Scripts\activate # For Windows
- Install Dependencies:
pip install -r requirements.txt
- Set Up OpenAI API Key:
- Create an OpenAI API key here.
- Store it as an environment variable
export OPENAI_API_KEY=your_api_key # For macOS/Linux
set OPENAI_API_KEY=your_api_key # For Windows
- Run the Main Script:
python test_attacks.py