Welcome! This guide helps you onboard any dataset using modular, copy-paste prompts.
Your data onboarding system now has reusable prompt patterns you can use for any dataset:
| File | Purpose | When to Use |
|---|---|---|
| prompts/00-setup-environment.md | 🏗️ First-time AWS setup | Run ONCE after cloning repo into new AWS account |
| prompts/ (01-route through 06-govern) | 📋 Copy-paste templates | Quick lookup for prompt structure |
| prompts/examples.md | 📝 Filled-out examples | See real-world usage with details |
| prompts/regulation/ | 🔒 Regulation-specific controls | When GDPR, CCPA, HIPAA, SOX, or PCI DSS compliance is required |
| SKILLS.md (bottom section) | 📖 Full documentation | Deep dive into each pattern |
| CLAUDE.md | 🏗️ Architecture reference | Understand system design |
| deploy_to_aws.py | 🚀 Deployment script | Deploy workload to AWS (Glue, MWAA, QuickSight) |
If this is a fresh clone into a new AWS account, run the setup prompt first:
Setup AWS environment for the Agentic Data Onboarding platform.
Account details:
- AWS Region: us-east-1
- Project name: data-onboarding
- Environment: dev
What I need created:
- [x] IAM roles
- [x] S3 data lake bucket
- [x] KMS encryption keys
- [x] Glue databases
- [x] Lake Formation LF-Tags
- [x] Lake Formation TBAC grants
Existing resources: none
This creates all AWS prerequisites (IAM roles, S3 bucket, KMS keys, Glue databases, LF-Tags) interactively. See prompts/00-setup-environment.md for full details.
Multi-account deployment: The setup defaults to single-account. If you need the Glue catalog + Lake Formation in one account ("Account A") and Glue jobs + MWAA + S3 in a consumer account ("Account B"), see multi-account-deployment.md — the setup prompt will ask a single-vs-multi question and wire catalog_account_id + sts:AssumeRole across generated artifacts.
Check if data from [YOUR_DATA_DESCRIPTION] has already been onboarded.
Source details:
- Location: [S3_PATH or DATABASE.TABLE]
- Format: [CSV/JSON/Parquet]
- Description: [Brief description]
Report: existing workload status or confirm new data.
What happens: Claude searches all existing workloads and tells you if this data is already onboarded.
If Step 1 says "not found", use this prompt:
Onboard new dataset: [DATASET_NAME]
Source:
- Type: [S3/Database/API]
- Location: [FULL_PATH]
- Format: [CSV/JSON/Parquet]
- Frequency: [Daily/Hourly/One-time]
- Credentials: [AWS Secrets Manager ARN]
Schema:
- column1: type, description, role (measure/dimension/identifier)
- column2: type, description, role
Bronze:
- Keep raw format: YES
- Retention: [DAYS]
Silver:
- Cleaning: [Dedupe on KEY, handle nulls, type casting]
- PII masking: [COLUMNS]
- Format: Apache Iceberg
Gold:
- Use case: [Reporting/Analytics/ML]
- Format: [Star Schema/Flat]
- Quality threshold: 95%
Quality Rules:
- Completeness: [Required columns non-null]
- Uniqueness: [Key must be unique]
Schedule:
- Frequency: [cron expression]
- SLA: [minutes]
Build complete pipeline with tests.
What happens: Claude creates:
- Complete folder structure (
workloads/[name]/) - Config files (source, semantic, transformations, quality, schedule)
- Transformation scripts (Bronze→Silver→Gold)
- Airflow DAG
- Comprehensive tests (50+ tests)
- README documentation
# Run tests
cd workloads/[YOUR_DATASET_NAME]
pytest tests/ -v
# Should see: 50+ tests passing ✓
# Check DAG
python3 dags/[your_dataset]_dag.py
# Should print: "DAG loaded successfully"Need: Create a demo without access to real data
1. Generate synthetic data for customer_demo:
Rows: 100
Columns:
- customer_id: STRING, unique, CUST-001 format
- name: STRING, realistic names
- email: STRING, 10% nulls
- segment: ENUM, Enterprise 20%, SMB 50%, Individual 30%
- country: STRING, US 60%, UK 20%, CA 10%, DE 10%
Output: demo/sample_data/customer_demo.csv with generator script
Then onboard it with ONBOARD prompt above.
Need: Connect orders to customers via foreign key
Add relationship between workloads:
Source: order_transactions
Target: customer_master
Relationship:
- FK: orders.customer_id → customers.customer_id
- Cardinality: many-to-one
- Description: "Each order belongs to one customer"
Integrity:
- Expected validity: 98%
- Orphan handling: QUARANTINE
Update semantic.yaml, scripts, DAG, tests.
What happens: Claude adds FK validation, updates configs, adds tests for referential integrity.
Need: Business users want to visualize the data
Create QuickSight dashboard: [NAME]
Data sources:
- Dataset 1: [TABLE_NAME], mode [SPICE/DIRECT_QUERY]
Visuals:
1. [VISUAL_NAME]: Type [KPI/Bar/Line/Pie], measures [AGGREGATIONS]
2. ...
Permissions: [IAM users/groups]
Create dashboard and return URL.
What happens: Claude creates QuickSight data source, datasets, dashboard with all visuals, and grants permissions.
Need: Compliance team needs lineage documentation
Analyze data lineage for [WORKLOAD_NAME]:
Provide:
1. Source → Bronze → Silver → Gold flow
2. FK relationships
3. Column-level transformations
4. Quality scores
Generate data_product_catalog.yaml and lineage diagram.
What happens: Claude creates lineage diagrams, relationship graphs, and structured metadata for governance.
Need: Team wants cloud-hosted MCP tools without local Python/uv setup, but agent stays on laptop
1. Deploy all 13 MCP servers to Agentcore Gateway:
→ Run prompts/09-deploy-agentcore-gateway.md
2. Switch to Gateway tools:
→ Replace .mcp.json with .mcp.gateway.json
3. Onboard data as usual:
→ Run prompts/03-onboard-build-pipeline.md
What happens:
- Gateway: All 13 MCP servers (4 custom FastMCP + 9 PyPI) hosted in cloud, each with least-privilege IAM
- Agent: Runs in Claude Code on your laptop (human-in-the-loop)
- Sub-agents: Spawned locally via Claude Code
Agenttool - Same onboarding workflow as before -- only the tool transport changes (stdio to SSE)
To revert to fully local: git checkout .mcp.json
Need: Agent accessible via API for production pipelines, integrations, or multi-user access
1. Deploy all 13 MCP servers to Agentcore Gateway:
→ Run prompts/09-deploy-agentcore-gateway.md (if not already deployed)
2. Deploy agent to Agentcore Runtime:
→ Run prompts/10-deploy-agentcore-runtime.md
3. Invoke agent via API:
→ aws bedrock-agent-runtime invoke-agent --agent-id {ID} --input-text "Onboard..."
What happens:
- Gateway: Same Gateway as Scenario 5a (deployed once, shared by both modes)
- Runtime: Data Onboarding Agent accessible via API, connected to all 13 Gateway tools, with persistent memory
- Human-in-the-loop: Optional -- agent can run autonomously or pause for approval via API
See prompts/environment-setup-agent/agentcore/README.md for architecture details.
Need: Deploy the DAG and dependencies to Amazon Managed Workflows for Apache Airflow
Option 1: Using the deployment script
# Deploy workload to MWAA S3 bucket
python3 deploy_to_aws.py --mwaa-bucket=my-mwaa-bucket-name --workload=customer_master
# The script will:
# 1. Upload DAG file to s3://my-mwaa-bucket-name/dags/
# 2. Sync shared utilities to s3://my-mwaa-bucket-name/plugins/
# 3. Upload Glue scripts to configured S3 location
# 4. Verify all files are in placeOption 2: Manual deployment
# Upload DAG
aws s3 cp workloads/customer_master/dags/customer_master_dag.py \
s3://my-mwaa-bucket-name/dags/
# Sync shared utilities
aws s3 sync shared/ s3://my-mwaa-bucket-name/plugins/shared/ \
--exclude "*.pyc" --exclude "__pycache__/*"Set Airflow Variables (in MWAA UI or via CLI):
{
"glue_script_s3_path": "s3://my-glue-scripts-bucket/scripts/",
"glue_iam_role": "arn:aws:iam::123456789012:role/GlueJobRole",
"aws_account_id": "123456789012",
"kms_key_alias": "alias/data-pipeline-key"
}Verify:
- Open MWAA Airflow UI
- Check that
customer_master_dagappears in the DAG list - Unpause the DAG
- Trigger a manual run to test
What happens: Your DAG is deployed to MWAA and ready to run on the configured schedule.
Use these in order for a complete data onboarding:
- ROUTE: Check Existing ✅ Always first
- ONBOARD: Build Pipeline 📥 Create pipeline
- ENRICH: Link Datasets 🔗 Link datasets (optional)
- CONSUME: Create Dashboard 📊 Visualize (optional)
- GOVERN: Trace Lineage 📋 Document (optional)
Plus:
- GENERATE: Create Data 🎲 For demos/testing
Don't specify every detail on first try. Start with basic info:
Minimal prompt (works fine):
Onboard customer data from s3://bucket/customers.csv
Format: CSV
Frequency: Daily
Cleaning: Dedupe on customer_id, mask email/phone
Gold: Star schema for reporting
Claude will ask clarifying questions for anything missing.
Critical details (always specify):
- Source location (exact path)
- Format (CSV/JSON/Parquet)
- Key column (for deduplication)
- PII columns (for masking)
- Quality threshold (80% Silver, 95% Gold)
- Schedule (cron expression)
Less critical (Claude can infer or use defaults):
- Exact partitioning strategy
- Retry counts
- Alert recipients
- SLA minutes
Don't write prompts from scratch:
- Open
PROMPTS_EXAMPLES.md - Find similar scenario (e.g., "CSV from S3, daily batch")
- Copy the whole prompt
- Replace placeholders with your values
First onboarding doesn't need to be perfect:
- Get basic pipeline working
- Run tests, see what fails
- Refine transformations/quality rules
- Re-run until all tests pass
Important: Regulation-specific prompts are NOT loaded by default. Only use them when compliance is explicitly required.
During discovery (Phase 1), if the user mentions compliance requirements:
Does this data require regulatory compliance? (GDPR, CCPA, HIPAA, SOX, PCI DSS)
If YES, load the appropriate prompt from prompts/regulation/:
prompts/regulation/gdpr.md— GDPR (EU data protection)prompts/regulation/ccpa.md— CCPA (California privacy)prompts/regulation/hipaa.md— HIPAA (healthcare data)prompts/regulation/sox.md— SOX (financial reporting)prompts/regulation/pci_dss.md— PCI DSS (payment card data)
These prompts add:
- Mandatory data residency controls
- Enhanced encryption and access controls
- Audit trail requirements
- Data retention and deletion policies
- Consent tracking (GDPR/CCPA)
- Field-level encryption (HIPAA/PCI DSS)
Example:
User: "We need to onboard patient records"
Claude: "Does this data require HIPAA compliance?"
User: "Yes"
Claude: [loads prompts/regulation/hipaa.md] → adds PHI encryption, audit logging, access controls
# Read the test output carefully
pytest tests/unit/test_transformations.py -v
# Tests tell you exactly what's wrong:
# - "FK integrity: expected 98%, got 85%" → Need better data or lower threshold
# - "Schema mismatch: missing column revenue" → Add revenue calculation to transform script
# - "Quality score 89%, threshold 95%" → Need stricter data validationFix one test at a time, re-run, repeat.
# Run the DAG file directly to see Python errors
python3 workloads/my_dataset/dags/my_dataset_dag.py
# Common issues:
# - Missing imports: pip install apache-airflow-providers-amazon
# - Typo in operator name: GlueJobOperator (not AwsGlueJobOperator)
# - Invalid cron: Use "0 6 * * *" not "6am daily"Be more specific in your prompt. Compare:
Vague → Many questions:
Onboard customer data
Specific → Few questions:
Onboard customer_master from s3://bucket/crm/customers.csv
- CSV, daily at 6am
- Columns: customer_id, name, email (PII), phone (PII), segment, status
- Dedupe on customer_id
- Mask email/phone in Silver
- Star schema in Gold (fact: customer_activity, dims: customer, geography)
- Quality: 95% threshold
Read workloads/[NAME]/config/semantic.yaml and workloads/[NAME]/scripts/transform/bronze_to_silver.py
Then: Update the [WHAT YOU WANT TO CHANGE] to [NEW BEHAVIOR]
Claude will edit existing files instead of creating new ones.
- Always run ROUTE (Check Existing) first
- Specify PII columns for masking
- Set quality thresholds (80% Silver, 95% Gold)
- Use Secrets Manager for credentials
- Test with small data first (100 rows)
- Document relationships in semantic.yaml
- Skip ROUTE and create duplicate workloads
- Hardcode secrets in code or config
- Skip quality rules (bad data will reach Gold)
- Use Bronze for queries (use Silver/Gold)
- Modify Bronze zone data (it's immutable)
For your first onboarding:
- Open
prompts/examples.md - Find an example similar to your data
- Copy the ROUTE prompt, fill in your details, send to Claude
- If not found, copy the ONBOARD prompt, fill in, send
- Wait for pipeline generation (~5-10 minutes)
- Run tests:
pytest workloads/[NAME]/tests/ -v - If tests pass, you're done! Deploy to AWS with
deploy_to_aws.py
For ongoing work:
- Keep
prompts/folder open for copy-paste templates - Use GENERATE to create demo data for testing
- Use ENRICH to document relationships between datasets
- Use CONSUME to create dashboards for stakeholders
- Use GOVERN to generate lineage docs for governance
- Load
prompts/regulation/only when compliance is required
Need help?
- See detailed examples:
prompts/examples.md - See full documentation:
SKILLS.md→ Modular Prompt Patterns - See architecture:
CLAUDE.md - See deployment guide:
docs/aws-account-setup.md
You now have:
- ✅ 6 reusable prompt patterns
- ✅ Copy-paste templates
- ✅ Real-world examples
- ✅ Troubleshooting guide
Start with ROUTE (Check Existing) and onboard your first dataset!