Getting Started: Reusable Data Onboarding Prompts

Welcome! This guide helps you onboard any dataset using modular, copy-paste prompts.

📚 Documentation Overview

Your data onboarding system now has reusable prompt patterns you can use for any dataset:

File	Purpose	When to Use
prompts/00-setup-environment.md	🏗️ First-time AWS setup	Run ONCE after cloning repo into new AWS account
prompts/ (01-route through 06-govern)	📋 Copy-paste templates	Quick lookup for prompt structure
prompts/examples.md	📝 Filled-out examples	See real-world usage with details
prompts/regulation/	🔒 Regulation-specific controls	When GDPR, CCPA, HIPAA, SOX, or PCI DSS compliance is required
SKILLS.md (bottom section)	📖 Full documentation	Deep dive into each pattern
CLAUDE.md	🏗️ Architecture reference	Understand system design
deploy_to_aws.py	🚀 Deployment script	Deploy workload to AWS (Glue, MWAA, QuickSight)

🏗️ First Time? Setup Your AWS Environment

If this is a fresh clone into a new AWS account, run the setup prompt first:

Setup AWS environment for the Agentic Data Onboarding platform.

Account details:
- AWS Region: us-east-1
- Project name: data-onboarding
- Environment: dev

What I need created:
- [x] IAM roles
- [x] S3 data lake bucket
- [x] KMS encryption keys
- [x] Glue databases
- [x] Lake Formation LF-Tags
- [x] Lake Formation TBAC grants

Existing resources: none

This creates all AWS prerequisites (IAM roles, S3 bucket, KMS keys, Glue databases, LF-Tags) interactively. See prompts/00-setup-environment.md for full details.

Multi-account deployment: The setup defaults to single-account. If you need the Glue catalog + Lake Formation in one account ("Account A") and Glue jobs + MWAA + S3 in a consumer account ("Account B"), see multi-account-deployment.md — the setup prompt will ask a single-vs-multi question and wire catalog_account_id + sts:AssumeRole across generated artifacts.

🚀 Quick Start: Onboard Your First Dataset

Step 1: Check if it already exists

Check if data from [YOUR_DATA_DESCRIPTION] has already been onboarded.

Source details:
- Location: [S3_PATH or DATABASE.TABLE]
- Format: [CSV/JSON/Parquet]
- Description: [Brief description]

Report: existing workload status or confirm new data.

What happens: Claude searches all existing workloads and tells you if this data is already onboarded.

Step 2: Onboard the dataset

If Step 1 says "not found", use this prompt:

Onboard new dataset: [DATASET_NAME]

Source:
- Type: [S3/Database/API]
- Location: [FULL_PATH]
- Format: [CSV/JSON/Parquet]
- Frequency: [Daily/Hourly/One-time]
- Credentials: [AWS Secrets Manager ARN]

Schema:
- column1: type, description, role (measure/dimension/identifier)
- column2: type, description, role

Bronze:
- Keep raw format: YES
- Retention: [DAYS]

Silver:
- Cleaning: [Dedupe on KEY, handle nulls, type casting]
- PII masking: [COLUMNS]
- Format: Apache Iceberg

Gold:
- Use case: [Reporting/Analytics/ML]
- Format: [Star Schema/Flat]
- Quality threshold: 95%

Quality Rules:
- Completeness: [Required columns non-null]
- Uniqueness: [Key must be unique]

Schedule:
- Frequency: [cron expression]
- SLA: [minutes]

Build complete pipeline with tests.

What happens: Claude creates:

Complete folder structure (workloads/[name]/)
Config files (source, semantic, transformations, quality, schedule)
Transformation scripts (Bronze→Silver→Gold)
Airflow DAG
Comprehensive tests (50+ tests)
README documentation

Step 3: Validate it works

# Run tests
cd workloads/[YOUR_DATASET_NAME]
pytest tests/ -v

# Should see: 50+ tests passing ✓

# Check DAG
python3 dags/[your_dataset]_dag.py
# Should print: "DAG loaded successfully"

📖 Common Scenarios

Scenario 1: Demo with Synthetic Data

Need: Create a demo without access to real data

1. Generate synthetic data for customer_demo:

Rows: 100
Columns:
- customer_id: STRING, unique, CUST-001 format
- name: STRING, realistic names
- email: STRING, 10% nulls
- segment: ENUM, Enterprise 20%, SMB 50%, Individual 30%
- country: STRING, US 60%, UK 20%, CA 10%, DE 10%

Output: demo/sample_data/customer_demo.csv with generator script

Then onboard it with ONBOARD prompt above.

Scenario 2: Link Two Datasets

Need: Connect orders to customers via foreign key

Add relationship between workloads:

Source: order_transactions
Target: customer_master

Relationship:
- FK: orders.customer_id → customers.customer_id
- Cardinality: many-to-one
- Description: "Each order belongs to one customer"

Integrity:
- Expected validity: 98%
- Orphan handling: QUARANTINE

Update semantic.yaml, scripts, DAG, tests.

What happens: Claude adds FK validation, updates configs, adds tests for referential integrity.

Scenario 3: Create a Dashboard

Need: Business users want to visualize the data

Create QuickSight dashboard: [NAME]

Data sources:
- Dataset 1: [TABLE_NAME], mode [SPICE/DIRECT_QUERY]

Visuals:
1. [VISUAL_NAME]: Type [KPI/Bar/Line/Pie], measures [AGGREGATIONS]
2. ...

Permissions: [IAM users/groups]

Create dashboard and return URL.

What happens: Claude creates QuickSight data source, datasets, dashboard with all visuals, and grants permissions.

Scenario 4: Document for Audit

Need: Compliance team needs lineage documentation

Analyze data lineage for [WORKLOAD_NAME]:

Provide:
1. Source → Bronze → Silver → Gold flow
2. FK relationships
3. Column-level transformations
4. Quality scores

Generate data_product_catalog.yaml and lineage diagram.

What happens: Claude creates lineage diagrams, relationship graphs, and structured metadata for governance.

Scenario 5a: Demo with Gateway (Cloud-Hosted Tools, Local Agent)

Need: Team wants cloud-hosted MCP tools without local Python/uv setup, but agent stays on laptop

1. Deploy all 13 MCP servers to Agentcore Gateway:
   → Run prompts/09-deploy-agentcore-gateway.md

2. Switch to Gateway tools:
   → Replace .mcp.json with .mcp.gateway.json

3. Onboard data as usual:
   → Run prompts/03-onboard-build-pipeline.md

What happens:

Gateway: All 13 MCP servers (4 custom FastMCP + 9 PyPI) hosted in cloud, each with least-privilege IAM
Agent: Runs in Claude Code on your laptop (human-in-the-loop)
Sub-agents: Spawned locally via Claude Code Agent tool
Same onboarding workflow as before -- only the tool transport changes (stdio to SSE)

To revert to fully local: git checkout .mcp.json

Scenario 5b: Production with Runtime (Cloud-Hosted Tools + Cloud Agent)

Need: Agent accessible via API for production pipelines, integrations, or multi-user access

1. Deploy all 13 MCP servers to Agentcore Gateway:
   → Run prompts/09-deploy-agentcore-gateway.md (if not already deployed)

2. Deploy agent to Agentcore Runtime:
   → Run prompts/10-deploy-agentcore-runtime.md

3. Invoke agent via API:
   → aws bedrock-agent-runtime invoke-agent --agent-id {ID} --input-text "Onboard..."

What happens:

Gateway: Same Gateway as Scenario 5a (deployed once, shared by both modes)
Runtime: Data Onboarding Agent accessible via API, connected to all 13 Gateway tools, with persistent memory
Human-in-the-loop: Optional -- agent can run autonomously or pause for approval via API

See prompts/environment-setup-agent/agentcore/README.md for architecture details.

Scenario 6: Deploy to MWAA

Need: Deploy the DAG and dependencies to Amazon Managed Workflows for Apache Airflow

Option 1: Using the deployment script

# Deploy workload to MWAA S3 bucket
python3 deploy_to_aws.py --mwaa-bucket=my-mwaa-bucket-name --workload=customer_master

# The script will:
# 1. Upload DAG file to s3://my-mwaa-bucket-name/dags/
# 2. Sync shared utilities to s3://my-mwaa-bucket-name/plugins/
# 3. Upload Glue scripts to configured S3 location
# 4. Verify all files are in place

Option 2: Manual deployment

# Upload DAG
aws s3 cp workloads/customer_master/dags/customer_master_dag.py \
  s3://my-mwaa-bucket-name/dags/

# Sync shared utilities
aws s3 sync shared/ s3://my-mwaa-bucket-name/plugins/shared/ \
  --exclude "*.pyc" --exclude "__pycache__/*"

Set Airflow Variables (in MWAA UI or via CLI):

{
  "glue_script_s3_path": "s3://my-glue-scripts-bucket/scripts/",
  "glue_iam_role": "arn:aws:iam::123456789012:role/GlueJobRole",
  "aws_account_id": "123456789012",
  "kms_key_alias": "alias/data-pipeline-key"
}

Verify:

Open MWAA Airflow UI
Check that customer_master_dag appears in the DAG list
Unpause the DAG
Trigger a manual run to test

What happens: Your DAG is deployed to MWAA and ready to run on the configured schedule.

🎯 The Six Core Patterns

Use these in order for a complete data onboarding:

ROUTE: Check Existing ✅ Always first
ONBOARD: Build Pipeline 📥 Create pipeline
ENRICH: Link Datasets 🔗 Link datasets (optional)
CONSUME: Create Dashboard 📊 Visualize (optional)
GOVERN: Trace Lineage 📋 Document (optional)

Plus:

GENERATE: Create Data 🎲 For demos/testing

📝 Customization Tips

1. Start Simple

Don't specify every detail on first try. Start with basic info:

Minimal prompt (works fine):

Onboard customer data from s3://bucket/customers.csv

Format: CSV
Frequency: Daily
Cleaning: Dedupe on customer_id, mask email/phone
Gold: Star schema for reporting

Claude will ask clarifying questions for anything missing.

2. Be Specific Where It Matters

Critical details (always specify):

Source location (exact path)
Format (CSV/JSON/Parquet)
Key column (for deduplication)
PII columns (for masking)
Quality threshold (80% Silver, 95% Gold)
Schedule (cron expression)

Less critical (Claude can infer or use defaults):

Exact partitioning strategy
Retry counts
Alert recipients
SLA minutes

3. Copy from Examples

Don't write prompts from scratch:

Open PROMPTS_EXAMPLES.md
Find similar scenario (e.g., "CSV from S3, daily batch")
Copy the whole prompt
Replace placeholders with your values

4. Iterate

First onboarding doesn't need to be perfect:

Get basic pipeline working
Run tests, see what fails
Refine transformations/quality rules
Re-run until all tests pass

5. Load Regulation-Specific Prompts When Needed

Important: Regulation-specific prompts are NOT loaded by default. Only use them when compliance is explicitly required.

During discovery (Phase 1), if the user mentions compliance requirements:

Does this data require regulatory compliance? (GDPR, CCPA, HIPAA, SOX, PCI DSS)

If YES, load the appropriate prompt from prompts/regulation/:

prompts/regulation/gdpr.md — GDPR (EU data protection)
prompts/regulation/ccpa.md — CCPA (California privacy)
prompts/regulation/hipaa.md — HIPAA (healthcare data)
prompts/regulation/sox.md — SOX (financial reporting)
prompts/regulation/pci_dss.md — PCI DSS (payment card data)

These prompts add:

Mandatory data residency controls
Enhanced encryption and access controls
Audit trail requirements
Data retention and deletion policies
Consent tracking (GDPR/CCPA)
Field-level encryption (HIPAA/PCI DSS)

Example:

User: "We need to onboard patient records"
Claude: "Does this data require HIPAA compliance?"
User: "Yes"
Claude: [loads prompts/regulation/hipaa.md] → adds PHI encryption, audit logging, access controls

🛠️ Troubleshooting

"Tests are failing"

# Read the test output carefully
pytest tests/unit/test_transformations.py -v

# Tests tell you exactly what's wrong:
# - "FK integrity: expected 98%, got 85%" → Need better data or lower threshold
# - "Schema mismatch: missing column revenue" → Add revenue calculation to transform script
# - "Quality score 89%, threshold 95%" → Need stricter data validation

Fix one test at a time, re-run, repeat.

"DAG won't parse"

# Run the DAG file directly to see Python errors
python3 workloads/my_dataset/dags/my_dataset_dag.py

# Common issues:
# - Missing imports: pip install apache-airflow-providers-amazon
# - Typo in operator name: GlueJobOperator (not AwsGlueJobOperator)
# - Invalid cron: Use "0 6 * * *" not "6am daily"

"Claude asks too many questions"

Be more specific in your prompt. Compare:

Vague → Many questions:

Onboard customer data

Specific → Few questions:

Onboard customer_master from s3://bucket/crm/customers.csv
- CSV, daily at 6am
- Columns: customer_id, name, email (PII), phone (PII), segment, status
- Dedupe on customer_id
- Mask email/phone in Silver
- Star schema in Gold (fact: customer_activity, dims: customer, geography)
- Quality: 95% threshold

"I want to modify an existing workload"

Read workloads/[NAME]/config/semantic.yaml and workloads/[NAME]/scripts/transform/bronze_to_silver.py

Then: Update the [WHAT YOU WANT TO CHANGE] to [NEW BEHAVIOR]

Claude will edit existing files instead of creating new ones.

💡 Best Practices

✅ DO

Always run ROUTE (Check Existing) first
Specify PII columns for masking
Set quality thresholds (80% Silver, 95% Gold)
Use Secrets Manager for credentials
Test with small data first (100 rows)
Document relationships in semantic.yaml

❌ DON'T

Skip ROUTE and create duplicate workloads
Hardcode secrets in code or config
Skip quality rules (bad data will reach Gold)
Use Bronze for queries (use Silver/Gold)
Modify Bronze zone data (it's immutable)

📚 Next Steps

For your first onboarding:

Open prompts/examples.md
Find an example similar to your data
Copy the ROUTE prompt, fill in your details, send to Claude
If not found, copy the ONBOARD prompt, fill in, send
Wait for pipeline generation (~5-10 minutes)
Run tests: pytest workloads/[NAME]/tests/ -v
If tests pass, you're done! Deploy to AWS with deploy_to_aws.py

For ongoing work:

Keep prompts/ folder open for copy-paste templates
Use GENERATE to create demo data for testing
Use ENRICH to document relationships between datasets
Use CONSUME to create dashboards for stakeholders
Use GOVERN to generate lineage docs for governance
Load prompts/regulation/ only when compliance is required

Need help?

See detailed examples: prompts/examples.md
See full documentation: SKILLS.md → Modular Prompt Patterns
See architecture: CLAUDE.md
See deployment guide: docs/aws-account-setup.md

🎉 You're Ready!

You now have:

✅ 6 reusable prompt patterns
✅ Copy-paste templates
✅ Real-world examples
✅ Troubleshooting guide

Start with ROUTE (Check Existing) and onboard your first dataset!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started: Reusable Data Onboarding Prompts

📚 Documentation Overview

🏗️ First Time? Setup Your AWS Environment

🚀 Quick Start: Onboard Your First Dataset

Step 1: Check if it already exists

Step 2: Onboard the dataset

Step 3: Validate it works

📖 Common Scenarios

Scenario 1: Demo with Synthetic Data

Scenario 2: Link Two Datasets

Scenario 3: Create a Dashboard

Scenario 4: Document for Audit

Scenario 5a: Demo with Gateway (Cloud-Hosted Tools, Local Agent)

Scenario 5b: Production with Runtime (Cloud-Hosted Tools + Cloud Agent)

Scenario 6: Deploy to MWAA

🎯 The Six Core Patterns

📝 Customization Tips

1. Start Simple

2. Be Specific Where It Matters

3. Copy from Examples

4. Iterate

5. Load Regulation-Specific Prompts When Needed

🛠️ Troubleshooting

"Tests are failing"

"DAG won't parse"

"Claude asks too many questions"

"I want to modify an existing workload"

💡 Best Practices

✅ DO

❌ DON'T

📚 Next Steps

🎉 You're Ready!

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started: Reusable Data Onboarding Prompts

📚 Documentation Overview

🏗️ First Time? Setup Your AWS Environment

🚀 Quick Start: Onboard Your First Dataset

Step 1: Check if it already exists

Step 2: Onboard the dataset

Step 3: Validate it works

📖 Common Scenarios

Scenario 1: Demo with Synthetic Data

Scenario 2: Link Two Datasets

Scenario 3: Create a Dashboard

Scenario 4: Document for Audit

Scenario 5a: Demo with Gateway (Cloud-Hosted Tools, Local Agent)

Scenario 5b: Production with Runtime (Cloud-Hosted Tools + Cloud Agent)

Scenario 6: Deploy to MWAA

🎯 The Six Core Patterns

📝 Customization Tips

1. Start Simple

2. Be Specific Where It Matters

3. Copy from Examples

4. Iterate

5. Load Regulation-Specific Prompts When Needed

🛠️ Troubleshooting

"Tests are failing"

"DAG won't parse"

"Claude asks too many questions"

"I want to modify an existing workload"

💡 Best Practices

✅ DO

❌ DON'T

📚 Next Steps

🎉 You're Ready!