A robust system for generating synthetic psychological conversations using Large Language Models (LLMs) via LM Studio. This project creates realistic multi-turn conversations based on demographic profiles, beliefs, and cognitive biases.
This project generates synthetic conversations between personas and AI assistants, where each persona is defined by:
- Geographic location
- Demographics (age, gender, education, etc.)
- Personal beliefs and values
- Cognitive biases
The system processes CSV data containing persona profiles and uses LLMs to generate contextually appropriate conversations that reflect the persona's characteristics.
- Parallel Processing: Configurable concurrent request handling with rate limiting
- Crash Recovery: Automatic resume from last checkpoint after interruptions
- Progress Tracking: Detailed logging and real-time progress monitoring
- Retry Logic: Exponential backoff for failed requests
- Flexible Output: Individual JSON files per persona or consolidated output
- Data Validation: Robust JSON parsing with error handling
- Rate Limiting: Configurable requests per minute to prevent API overload
- Python 3.8+
- LM Studio installed and running
- A compatible LLM loaded in LM Studio (e.g., GPT-OSS-20B)
- Clone the repository:
git clone https://github.com/jithinAB/nudge-genai.git
cd nudge-genai- Set up virtual environment:
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate- Install dependencies:
pip install aiohttp aiofiles- Set up LM Studio:
- Download and install LM Studio
- Load your preferred model (e.g.,
openai/gpt-oss-20b) - Start the local server (default:
http://localhost:1234)
nudge-genai/
βββ data/
β βββ data/
β βββ scenario.csv # Input CSV with persona profiles
βββ scripts/
β βββ lm_studio_processor.py # Main processing script
β βββ test_lm_studio.py # LM Studio connection test
β βββ test_simple.py # Simple API test
β βββ synthetic_data_output/ # Generated conversations
β βββ individual_results/ # Per-persona JSON files
β βββ consolidated_results.json
β βββ synthetic_conversations_final.json
β βββ processing_summary.json
β βββ failed_rows.json
βββ .gitignore
βββ README.md
Edit the configuration section in scripts/lm_studio_processor.py:
# API Configuration
LM_STUDIO_URL = "http://localhost:1234/v1/chat/completions"
MODEL_NAME = "openai/gpt-oss-20b"
# Processing configuration
MAX_CONCURRENT_REQUESTS = 1 # Number of parallel requests
REQUESTS_PER_MINUTE = 10 # Rate limit
MAX_RETRIES = 3 # Maximum retry attempts
REQUEST_TIMEOUT = 180 # Timeout in seconds
SAVE_INTERVAL = 5 # Save checkpoint every N rows
# Model parameters
TEMPERATURE = 0.7
MAX_TOKENS = 2000The CSV file should contain the following columns:
place: Geographic locationdemographics: Age, gender, education, occupation, etc.beliefs: Personal beliefs and valuesbias: Cognitive biases
Example:
place,demographics,beliefs,bias
New York,"35, Male, MBA, Marketing Manager","Values work-life balance, Believes in sustainable living","Confirmation bias, Anchoring bias"
cd scripts
python test_lm_studio.pyStart fresh processing:
python lm_studio_processor.pyResume from checkpoint (after interruption):
python lm_studio_processor.py --resumeThe script provides real-time progress updates:
[INFO] Processing row 10/100 (10.0%) | Row ID: User_10
[INFO] Successfully processed row 10 in 3.45s
[INFO] Progress: 10/100 (10.0%) | Success rate: 90.0%
Each persona generates a file in synthetic_data_output/individual_results/:
{
"row_number": 1,
"row_id": "User_01",
"status": "success",
"processing_time": 3.45,
"timestamp": "2025-01-16T10:30:00",
"input_data": {
"place": "New York",
"demographics": "35, Male, MBA",
"beliefs": "Values work-life balance",
"bias": "Confirmation bias"
},
"output_data": {
"Conversations": {
"career_advice": [
{"role": "person", "message": "..."},
{"role": "AI", "message": "..."}
]
}
}
}synthetic_conversations_final.json contains all successful conversations in a single file.
processing_summary.json provides statistics:
{
"total_rows": 100,
"successful": 95,
"failed": 5,
"success_rate": 95.0,
"total_time": 450.5,
"average_time_per_row": 4.5
}- Ensure LM Studio is running and the server is started
- Check the URL matches your LM Studio settings (default:
http://localhost:1234) - Verify the model name matches the loaded model
- Reduce
MAX_CONCURRENT_REQUESTSto 1 - Decrease
MAX_TOKENSif responses are too large - Process data in smaller batches
- Check debug files in
synthetic_data_output/debug_*.txt - Review the prompt template to ensure it requests valid JSON
- Increase
MAX_TOKENSif responses are being truncated
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with LM Studio for local LLM inference
- Uses OpenAI-compatible API for maximum flexibility
- Inspired by research in synthetic data generation for AI training
For questions or support, please open an issue on GitHub or contact the maintainers.
Note: This tool is designed for research and development purposes. Ensure you comply with all applicable data protection and privacy regulations when generating synthetic data based on real demographic profiles.