- Primary goal: Extract email addresses and LinkedIn profiles from websites with correct behavior and passing tests
- Secondary goals: Maintain clean architecture with pluggable browsers, extractors, and filters; keep public APIs stable
- Install deps:
uv sync --all-extras - Start dev: N/A (library, not a service)
- Lint:
make lint(runsruff checkandmypy) - Format:
make format(runsruff check --select I --fixandruff format) - Typecheck:
make lint(includes mypy) - Test (all):
make test-all(runs all tests including slow ones) - Test (single):
make test(excludes slow tests) oruv run pytest tests/test_email_extractor.py::test_specific_test - Build:
uv build - Publish:
make publish(builds and publishes to PyPI) - Docs serve:
make docs-serve(runsmkdocs serve) - Docs publish:
make docs-publish(deploys to GitHub Pages)
extract_emails/— main package source codebrowsers/— browser implementations (ChromiumBrowser, HttpxBrowser)data_extractors/— extractors for emails and LinkedIn profilesdata_savers/— data saving implementations (CSV)link_filters/— link filtering logic (ContactInfoLinkFilter, DefaultLinkFilter)models/— data models (PageData)utils/— utility functions (email filtering, TLD validation)workers/— main worker orchestration (DefaultWorker)console/— CLI application entry point
tests/— pytest test suite (mirrors source structure)docs/— mkdocs documentationmkdocs.yml— documentation configuration- Generated / do-not-edit:
.venv/,dist/,build/,*.egg-info/,site/(docs build output)
- Data flow:
DefaultWorker→Browser(PageSourceGetter) →LinkFilter→DataExtractor→PageData - Key entrypoints:
- Library:
extract_emails.DefaultWorker - CLI:
extract_emails.console.application:main(viaextract-emailscommand)
- Library:
- Key configs:
pyproject.toml,pytest.ini,mkdocs.yml,Makefile - Components:
- Workers: Orchestrate extraction with depth-limited crawling
- Browsers: Abstract page source fetching (ChromiumBrowser for JS-rendered pages, HttpxBrowser for static)
- Link Filters: Determine which links to follow (ContactInfoLinkFilter focuses on contact/about pages)
- Data Extractors: Extract specific data types (EmailExtractor, LinkedinExtractor)
- Models: PageData aggregates extracted data per page
- Required versions: Python >=3.10,<3.15, uv (package manager)
- Required services: None (standalone library)
- Optional dependencies:
playwrightfor ChromiumBrowser (requiresplaywright install chromium --with-deps)httpxfor HttpxBrowser
- Env vars: None required
- Migrations/seed: N/A
- Branching: Feature branches from main (conventional commits)
- PR expectations:
- All tests must pass (
make test-all) - Lint and typecheck must pass (
make lint) - Update docs when behavior changes
- Use Google-style docstrings for new code
- All tests must pass (
- Commit format:
type: title(e.g.,feat: add new feature,fix: bug fix,docs: update README) - Release/versioning: Semantic versioning, version in
extract_emails/__init__.pyandpyproject.toml
- Formatting:
ruff format(withruff check --select I --fixfor imports) - Type checking:
mypy(strict mode) - Conventions:
- Snake_case for modules and functions
- Clear separation: browsers, extractors, filters, workers
- Abstract base classes for extensibility (PageSourceGetter, DataExtractor, LinkFilterBase)
- Support both sync and async APIs
- Error handling:
- Log errors via
logurulogger - Continue processing on individual page failures
- Raise exceptions for configuration errors
- Log errors via
- Documentation: Google-style docstrings (required for mkdocs)
- Example (good):
from extract_emails import DefaultWorker from extract_emails.browsers import ChromiumBrowser from extract_emails.models import PageData with ChromiumBrowser() as browser: worker = DefaultWorker("https://example.com", browser) data = worker.get_data() PageData.to_csv(data, Path("output.csv"))