LLM Security, Writing Attack Prompts

Overview

This project is designed to evaluate the effectiveness of different prompt hacking techniques in extracting sensitive information (secret keys) from Large Language Models (LLLMs). The goal is to use attack prompts to successfully extrack the secret keys in the system prompts. More information in Instructions.md. This is a solution to lab02 from the LLM Agents MOOC

How It Works

Design System Prompts

system_prompt_1 is designed to be naive. system_prompt_1_1 is a variation.
system_prompt_2 is designed to be more robust and include some defense mechanisms. system_prompt_2_1 is a variation.

Design Attack Prompts

attack_1 is an attack for system_prompt_1 (and system_prompt_1_1).
attack_2 is an attack for system_prompt_2 (and system_prompt_2_1).

I found useful to have two attack files when starting (attack_1_1), to quickly test some changes side by side, but it is optional. I used attack_1 and attack_2 for the lab submission.

Test the Attacks

test_attacks generates N random secret keys and tests each attack against each system prompt N times. At the time of submission, attack-2 fails 1 test in 15 with system_prompt_2 and passes all tests with system_prompt_2_1

Setup and Installation

Clone the Repository:

git clone <repository_url>
cd <repository_name>

Create and Activate a Virtual Environment:

python3 -m venv env
source env/bin/activate  # For macOS/Linux
.\env\Scripts\activate   # For Windows

Install Dependencies:

pip install -r requirements.txt

Set Up OpenAI API Key:

Create an OpenAI API key here.
Store it as an environment variable

export OPENAI_API_KEY=your_api_key  # For macOS/Linux
set OPENAI_API_KEY=your_api_key    # For Windows

Running the Project

Run the Main Script:

python test_attacks.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
system_prompts		system_prompts
.gitignore		.gitignore
Instructions.md		Instructions.md
README.md		README.md
attack-1.txt		attack-1.txt
attack-1_1.txt		attack-1_1.txt
attack-2.txt		attack-2.txt
openai-playground-mock-environment.png		openai-playground-mock-environment.png
requirements.txt		requirements.txt
test_attacks.py		test_attacks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Security, Writing Attack Prompts

Overview

How It Works

Setup and Installation

Running the Project

About

Releases

Packages

Languages

anaoaktree/llmagentsmooc-lab02

Folders and files

Latest commit

History

Repository files navigation

LLM Security, Writing Attack Prompts

Overview

How It Works

Setup and Installation

Running the Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages