You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The PDFs often contain smart quotes, em-dashes, and other non-ASCII characters. Both parsers include system prompts to normalize these to UTF-8 equivalents. Pay attention to this when modifying parser prompts.
201
+
The PDFs often contain smart quotes, em-dashes, and other non-ASCII characters. The parser system prompt normalizes these to UTF-8 equivalents — preserve those instructions when modifying the prompt.
216
202
217
203
### Job ID Pattern
218
204
House job listings use the pattern `MEM-XXX-YY` where:
219
205
-`MEM` = Member office
220
206
-`XXX` = Sequential number
221
207
-`YY` = Two-digit year
222
208
223
-
This pattern is used in `parser2.py` to split text into chunks: `re.split(r'(?=MEM-)', text)`
209
+
This pattern is used in `parser.py` to split text into chunks: `re.split(r'(?=MEM-)', text)`
224
210
225
211
### Date Extraction from Filenames
226
212
The `db_loader.py` includes logic to extract dates from various filename formats:
@@ -239,9 +225,8 @@ Office name matching in `db_loader.py` uses:
239
225
- Committee detection to avoid false matches
240
226
241
227
### Rate Limiting
242
-
Both parsers and the classifier include sleep() calls to respect API rate limits:
243
-
-`parser.py`: 7 seconds between files
244
-
-`parser2.py`: 8 seconds between chunks, 5 seconds between files
228
+
The parser and classifier include sleep() calls to respect API rate limits:
229
+
-`parser.py`: 8 seconds between chunks, 5 seconds between files
245
230
-`job_classifier.py`: 2 seconds between jobs, 5 seconds between files
246
231
247
232
Adjust these if you hit rate limits or want faster processing.
@@ -252,13 +237,11 @@ Adjust these if you hit rate limits or want faster processing.
252
237
house-jobs/
253
238
├── input/ # PDF files (tracked in git)
254
239
├── output/ # Extracted text files (tracked in git)
255
-
├── json_gemini_flash/ # Parsed JSON output (tracked in git)
0 commit comments