Cloud Native Architect & Agentic AI Platform Engineer
- π Seoul, Korea
- πΌ LinkedIn: @youngjoonjeong
- π Engineering Playbook: devfloor9.github.io/engineering-playbook
- π GitHub: @devfloor9
Building production-grade cloud native platforms at the intersection of Kubernetes and Generative AI. I specialize in Amazon EKS architecture optimization, GPU infrastructure for large-scale model serving, and designing end-to-end Agentic AI platforms that actually work in production.
My work focuses on turning complex infrastructure challenges into battle-tested engineering patterns β from Cilium networking and Karpenter autoscaling to vLLM/llm-d distributed inference, AWS Neuron/Trainium2 serving, and MCP-based agent orchestration. Beyond infrastructure, I am open-sourcing the AIDLC (AI Development Lifecycle) Γ AgenticOps methodology itself β so the design β construction β operations loop can be closed by agents, not humans at every step. Everything I build gets documented in the Engineering Playbook, an open knowledge base for the cloud native community.
- Deep expertise in Amazon EKS β cluster architecture, networking (Cilium, Gateway API), autoscaling (Karpenter), and operational excellence
- Node strategy across EKS Auto Mode / Karpenter / Managed Node Groups + DRA β including the DRA β Karpenter/Auto Mode incompatibility and MNG-based fallbacks
- Production experience with hybrid infrastructure β EKS Hybrid Nodes, SR-IOV with DGX H200, on-prem/cloud bridging
- GitOps-driven operations with Argo CD, infrastructure as code, and automated deployment pipelines
- Performance optimization β CNI benchmarking, East-West traffic tuning, CoreDNS optimization
- End-to-end Agentic AI platform design on Kubernetes β from GPU resource management to agent observability
- Model serving at scale with vLLM, llm-d, NVIDIA Dynamo, NeMo Framework, and MoE architectures
- AWS Neuron / Trainium2 / Inferentia2 serving via vLLM + optimum-neuron (Llama 4 Scout/Maverick, BF16)
- Inference Gateway routing, KV Cache-aware scheduling, disaggregated serving, LWS multi-node, Bifrost β Bedrock fallback
- Cascade Routing tuning and semantic caching across 3 layers (KV / Prompt / Semantic)
- Self-Improving Agent Loop + continuous training pipeline (Langfuse trace β Ragas/LLM-as-a-Judge β GRPO/DPO β kgateway canary)
- Multi-Agent Collaboration patterns (LangGraph / CrewAI / AutoGen / Strands, Orchestrator/Voting/Debate/Supervisor)
- RAG pipelines with Milvus / Qdrant vector databases, evaluation with RAGAS + Langfuse Custom Evaluators
- Amazon Bedrock AgentCore and MCP (Model Context Protocol) integration for enterprise agent deployments
- AIDLC Γ AgenticOps: open-source methodology for closed-loop agentic delivery β design, construction, and operations automated end-to-end (see
oh-my-aidlcops) - Regulatory Compliance: EU AI Act, NIST AI RMF 1.1, ISO/IEC 42001, Korean AI Basic Act, ISMS-P, and financial supervisory regulations
- Agent Versioning: prompt registry, Shadow / Canary / A-B / Blue-Green release patterns
- AI Gateway Guardrails: input/output guards with Guardrails AI, NeMo Guardrails, Llama Guard 3, Bedrock Guardrails
- Identity-First Security with EKS Pod Identity, IRSA, and Keycloak OIDC + OAuth2-Proxy two-tier flows
- Runtime threat detection with GuardDuty Extended Threat Detection
- Policy-as-code with Kyverno, supply chain security, and compliance automation
- Lead author on the AWS Workshop Studio EKS GenAI workshop (LLMs β scalable agent systems) and maintainer of the upstream starter kit it runs on
Container & Orchestration: Amazon EKS, EKS Auto Mode, Karpenter, MNG + DRA, Helm, Argo CD
Networking: Cilium, Gateway API, kgateway, CoreDNS, ALB/NLB, CloudFront + WAF + Shield
AI/ML Serving: vLLM, SGLang, llm-d, NVIDIA Dynamo, NeMo Framework, AWS Neuron / Trainium2 / Inferentia2, Amazon Bedrock AgentCore
AI Gateway: kgateway + agentgateway, Bifrost, LiteLLM (with Team RBAC), OpenClaw
AI Agents: Kagent, MCP (Model Context Protocol), LangGraph, CrewAI, AutoGen, Strands
Auth & Identity: Keycloak, OAuth2-Proxy (OIDC), EKS Pod Identity, IRSA, External Secrets Operator
GPU Management: NVIDIA GPU Operator, DCGM, DRA, MIG, KAI Scheduler, NIXL, CRIU (experimental)
MLOps: Kubeflow, MLflow, KServe, SageMaker
Observability: Prometheus, Grafana, AMP/AMG + ADOT, Langfuse, Hubble, OpenTelemetry
Security: Kyverno, GuardDuty, EKS Pod Identity, Harbor
Vector DB & Evaluation: Milvus, Qdrant, RAGAS, LLM-as-a-Judge, Langfuse Custom Evaluators
IaC & GitOps: CDK, Terraform, Argo CD, GitHub Actions
oh-my-aidlcops β AIDLC Γ AgenticOps plugin marketplace
Sibling project to oh-my-claudecode. A Claude Code / Kiro compatible marketplace that layers AgenticOps (self-improving loop, autopilot-deploy, incident response, continuous evaluation, cost governance) on top of the AWS-official AIDLC workflows, so the Inception β Construction β Operations lifecycle closes itself. Ships four plugins β agentic-platform, agenticops, aidlc-inception, aidlc-construction β and five Tier-0 workflows (/oma:autopilot, /oma:aidlc-loop, /oma:agenticops, /oma:self-improving, /oma:platform-bootstrap). Built on awslabs/mcp hosted MCP servers. Phase 1 MVP.
engineering-playbook β Cloud Native & Agentic AI engineering reference
77+ production-tested guides across 7 tracks, spanning Agentic AI platforms, EKS best practices, AIDLC/AgenticOps, hybrid infrastructure, security governance, ROSA, and quantitative benchmark reports. Trilingual (Korean / English / Chinese), Docusaurus 3.9.2. The April 2026 restructure introduced a 3-tier hierarchy (foundations / platform-selection / advanced-patterns), the Self-Improving Loop ADR, Continuous Training Pipeline, CRIU GPU Migration, Cascade Routing Tuning, AI Gateway Guardrails, and AWS Neuron Stack documents. Live at devfloor9.github.io/engineering-playbook.
aws-samples/sample-genai-on-eks-starter-kit β core maintainer (MEMBER)
The starter kit that powers the EKS GenAI workshop. Recent contributions:
- PR #131 (in review) β full demo stack: Code Review MCP (RAG with TEI + Qdrant) + Keycloak OIDC + OAuth2-Proxy + kGateway 2-tier ingress with TLS + LiteLLM Team RBAC + per-user Langfuse tracing via OpenWebUI forward-user-info headers
- PR #141 β restore Langfuse
session_id/tagson LangChain traces (Langfuse Python SDK v3 moved these out ofRunnableConfig["tags"]intoconfig["metadata"]), unblocking Module 4 Custom Evaluators - PR #127 / #126 β External Secrets Operator + Amazon Managed Prometheus & Grafana (AMP/AMG) with ADOT collector
- PR #87 β OpenClaw AI Agent Orchestrator component, new AI Agent category + DevOps and Doc Writer example agents
awslabs/ai-on-eks + ai-on-eks-charts β upstream contributor
AWS official AI on EKS guides and Helm charts. Contributed the Llama 4 (Scout & Maverick) on Trainium2 vLLM deployment guide (ai-on-eks#276) and matching Helm values (ai-on-eks-charts#12): Karpenter-provisioned trn2.48xlarge, AWS Neuron DLC (optimum-neuron 0.4.5, Neuron SDK 2.26.1), tensor-parallel-size 16 across all 16 Trainium chips, BF16 inference without quantization, ~20 min end-to-end deploy.
π EKS Best Practices β Cilium ENI, Gateway API, CoreDNS, Karpenter, cost management, control plane scaling, Pod Identity
π Operations & Observability β GitOps (Argo CD), node monitoring, EKS debugging & resiliency, Pod health lifecycle
π Agentic AI Platform β EKS GPU node strategy (Auto Mode / Karpenter / MNG / DRA), vLLM, llm-d, NVIDIA Dynamo, MoE serving, AWS Neuron stack, Inference Optimization (KV Cache-aware / disaggregated serving / LWS multi-node / Bifrost β Bedrock fallback), LLM Gateway 2-tier (kgateway + agentgateway + Bifrost), Kagent, Milvus, RAGAS
π AIDLC & AgenticOps β AIDLC framework principles, Ontology Γ Harness, Built-in/Org/Domain extensions, evaluation framework, multi-agent collaboration, agent versioning, regulatory compliance (EU AI Act / NIST AI RMF / ISO 42001 / Korean AI Basic Act), AgenticOps observability and DevOps Guru predictive operations
π Hybrid Infrastructure β EKS Hybrid Nodes, SR-IOV with DGX H200, file storage strategy, Harbor registry integration
π Security & Governance β Identity-First Security, GuardDuty Extended Threat Detection, Kyverno, supply chain security, AI Gateway guardrails
π ROSA β Red Hat OpenShift on AWS installation, security, and compliance
π Benchmark Reports β CNI performance (VPC CNI vs Cilium), Gateway API implementation, AI/ML workloads, AgentCore vs EKS self-hosted inference, NVIDIA Dynamo, hybrid infrastructure, security operations


