Skip to content

Commit 0f671be

Browse files
committed
cleanup
1 parent d099177 commit 0f671be

623 files changed

Lines changed: 1644 additions & 2041 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.13

CLAUDE.md

Lines changed: 26 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,11 @@ PDF → Text Extraction (pdftotext) → LLM Parsing → JSON → Database Loadin
1515

1616
**Python Version:** >=3.12 (specified in pyproject.toml)
1717

18-
**Package Management:** Uses `uv` for modern Python dependency management. The project has both legacy Pipfile and modern pyproject.toml.
18+
**Package Management:** Uses `uv`. Dependencies are declared in `pyproject.toml`.
1919

2020
**Install dependencies:**
2121
```bash
2222
uv sync
23-
# Or legacy approach:
24-
pip install -r requirements.txt
2523
```
2624

2725
## Core Components & Architecture
@@ -32,29 +30,24 @@ pip install -r requirements.txt
3230
- **Input:** PDF files in `input/` directory
3331
- **Output:** Text files in `output/` directory with date-based filenames
3432

35-
### 2. LLM-based Parsers
33+
### 2. LLM-based Parser
3634

37-
**Parser 1** (`parser.py`):
38-
- Legacy implementation using Gemini 1.5 Flash
39-
- Processes entire text files at once
40-
- Basic approach, kept for reference
35+
**`parser.py`** uses the `llm` Python API with `gemini-2.5-pro`:
36+
- Splits each bulletin's text into chunks at MEM-XXX-YY job-ID boundaries
37+
- Processes each chunk independently for accuracy
38+
- UTF-8 normalization, structured JSON output
39+
- Processes both Member and Internship bulletins
40+
- **Output:** JSON files in `json/` directory (one per bulletin)
4141

42-
**Parser 2** (`parser2.py`) - RECOMMENDED:
43-
- Improved implementation using Gemini 2.0 Flash
44-
- **Key Innovation:** Splits text into chunks based on job ID patterns (MEM-xxx-yy)
45-
- Processes each job listing separately for better accuracy
46-
- Better error handling and UTF-8 character normalization
47-
- **Output:** JSON files in `json_gemini_flash/` directory
48-
49-
**Run Parser:**
42+
**Run:**
5043
```bash
51-
python parser2.py
44+
python parser.py
5245
```
5346

5447
**Important Notes:**
55-
- Parsers use the `llm` library with subprocess calls
56-
- Rate limiting built in (7-8 second delays between chunks)
57-
- Requires Gemini API access configured in `llm` tool
48+
- Requires Gemini API access configured via `llm keys set gemini`
49+
- Rate limiting built in (8s between chunks, 5s between files)
50+
- Skips bulletins already present in `json/`
5851

5952
### 3. Job Classification System
6053

@@ -126,7 +119,7 @@ python init_database.py
126119
python db_loader.py --load-file path/to/new_jobs.json
127120

128121
# Load entire directory
129-
python db_loader.py --load-dir json_gemini_flash/
122+
python db_loader.py --load-dir json/
130123

131124
# View statistics
132125
python db_loader.py --stats
@@ -174,9 +167,6 @@ python web_interface.py
174167
```bash
175168
# Test classifier on sample data
176169
python test_classifier.py
177-
178-
# Test validation
179-
python test.py
180170
```
181171

182172
### Processing New PDF Files
@@ -188,39 +178,35 @@ python test.py
188178
```
189179
3. Parse with LLM:
190180
```bash
191-
python parser2.py
181+
python parser.py
192182
```
193183
4. Optionally classify jobs:
194184
```bash
195185
uv run python job_classifier.py
196186
```
197187
5. Load into database:
198188
```bash
199-
python db_loader.py --load-dir json_gemini_flash/
189+
python db_loader.py --load-dir json/
200190
```
201191

202192
### Validating JSON Output
203193
```bash
204-
python validate.py
205-
```
206-
207-
### Creating CSV Export
208-
```bash
209-
python make_csv.py
194+
python validate.py json/ # report-only
195+
python validate.py json/ --delete # destructive
210196
```
211197

212198
## Important Implementation Details
213199

214200
### UTF-8 and Character Normalization
215-
The PDFs often contain smart quotes, em-dashes, and other non-ASCII characters. Both parsers include system prompts to normalize these to UTF-8 equivalents. Pay attention to this when modifying parser prompts.
201+
The PDFs often contain smart quotes, em-dashes, and other non-ASCII characters. The parser system prompt normalizes these to UTF-8 equivalents — preserve those instructions when modifying the prompt.
216202

217203
### Job ID Pattern
218204
House job listings use the pattern `MEM-XXX-YY` where:
219205
- `MEM` = Member office
220206
- `XXX` = Sequential number
221207
- `YY` = Two-digit year
222208

223-
This pattern is used in `parser2.py` to split text into chunks: `re.split(r'(?=MEM-)', text)`
209+
This pattern is used in `parser.py` to split text into chunks: `re.split(r'(?=MEM-)', text)`
224210

225211
### Date Extraction from Filenames
226212
The `db_loader.py` includes logic to extract dates from various filename formats:
@@ -239,9 +225,8 @@ Office name matching in `db_loader.py` uses:
239225
- Committee detection to avoid false matches
240226

241227
### Rate Limiting
242-
Both parsers and the classifier include sleep() calls to respect API rate limits:
243-
- `parser.py`: 7 seconds between files
244-
- `parser2.py`: 8 seconds between chunks, 5 seconds between files
228+
The parser and classifier include sleep() calls to respect API rate limits:
229+
- `parser.py`: 8 seconds between chunks, 5 seconds between files
245230
- `job_classifier.py`: 2 seconds between jobs, 5 seconds between files
246231

247232
Adjust these if you hit rate limits or want faster processing.
@@ -252,13 +237,11 @@ Adjust these if you hit rate limits or want faster processing.
252237
house-jobs/
253238
├── input/ # PDF files (tracked in git)
254239
├── output/ # Extracted text files (tracked in git)
255-
├── json_gemini_flash/ # Parsed JSON output (tracked in git)
256-
├── json_gemini_pro/ # Alternative parser output
257-
├── json_classified/ # Classified job output
240+
├── json/ # Parsed JSON output (tracked in git)
258241
259-
├── parser.py # Legacy parser (Gemini 1.5)
260-
├── parser2.py # Recommended parser (Gemini 2.0)
242+
├── parser.py # Bulletin parser (Gemini 2.5 Pro, llm Python API)
261243
├── job_classifier.py # Job categorization
244+
├── config.py # Shared path/db constants
262245
263246
├── schema.sql # Database schema
264247
├── init_database.py # Initial database setup
@@ -269,9 +252,7 @@ house-jobs/
269252
├── metadata.yml # Datasette configuration
270253
├── templates/ # Flask templates
271254
272-
├── validate.py # JSON validation
273-
├── make_csv.py # CSV export utility
274-
├── test.py # Tests
255+
├── validate.py # JSON validation (report-only by default)
275256
├── test_classifier.py # Classifier tests
276257
└── analyze_classifications.py # Classification analysis
277258
```

0 commit comments

Comments
 (0)