Skip to content
View devfloor9's full-sized avatar

Organizations

@aws-samples

Block or report devfloor9

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
devfloor9/README.md

Cloud Native Architect & Agentic AI Platform Engineer

Summary

Building production-grade cloud native platforms at the intersection of Kubernetes and Generative AI. I specialize in Amazon EKS architecture optimization, GPU infrastructure for large-scale model serving, and designing end-to-end Agentic AI platforms that actually work in production.

My work focuses on turning complex infrastructure challenges into battle-tested engineering patterns β€” from Cilium networking and Karpenter autoscaling to vLLM/llm-d distributed inference, AWS Neuron/Trainium2 serving, and MCP-based agent orchestration. Beyond infrastructure, I am open-sourcing the AIDLC (AI Development Lifecycle) Γ— AgenticOps methodology itself β€” so the design β†’ construction β†’ operations loop can be closed by agents, not humans at every step. Everything I build gets documented in the Engineering Playbook, an open knowledge base for the cloud native community.

Key Skills

Cloud Native Infrastructure

  • Deep expertise in Amazon EKS β€” cluster architecture, networking (Cilium, Gateway API), autoscaling (Karpenter), and operational excellence
  • Node strategy across EKS Auto Mode / Karpenter / Managed Node Groups + DRA β€” including the DRA ↔ Karpenter/Auto Mode incompatibility and MNG-based fallbacks
  • Production experience with hybrid infrastructure β€” EKS Hybrid Nodes, SR-IOV with DGX H200, on-prem/cloud bridging
  • GitOps-driven operations with Argo CD, infrastructure as code, and automated deployment pipelines
  • Performance optimization β€” CNI benchmarking, East-West traffic tuning, CoreDNS optimization

Agentic AI & GPU Platform Engineering

  • End-to-end Agentic AI platform design on Kubernetes β€” from GPU resource management to agent observability
  • Model serving at scale with vLLM, llm-d, NVIDIA Dynamo, NeMo Framework, and MoE architectures
  • AWS Neuron / Trainium2 / Inferentia2 serving via vLLM + optimum-neuron (Llama 4 Scout/Maverick, BF16)
  • Inference Gateway routing, KV Cache-aware scheduling, disaggregated serving, LWS multi-node, Bifrost β†’ Bedrock fallback
  • Cascade Routing tuning and semantic caching across 3 layers (KV / Prompt / Semantic)
  • Self-Improving Agent Loop + continuous training pipeline (Langfuse trace β†’ Ragas/LLM-as-a-Judge β†’ GRPO/DPO β†’ kgateway canary)
  • Multi-Agent Collaboration patterns (LangGraph / CrewAI / AutoGen / Strands, Orchestrator/Voting/Debate/Supervisor)
  • RAG pipelines with Milvus / Qdrant vector databases, evaluation with RAGAS + Langfuse Custom Evaluators
  • Amazon Bedrock AgentCore and MCP (Model Context Protocol) integration for enterprise agent deployments

Enterprise AI Platform

  • AIDLC Γ— AgenticOps: open-source methodology for closed-loop agentic delivery β€” design, construction, and operations automated end-to-end (see oh-my-aidlcops)
  • Regulatory Compliance: EU AI Act, NIST AI RMF 1.1, ISO/IEC 42001, Korean AI Basic Act, ISMS-P, and financial supervisory regulations
  • Agent Versioning: prompt registry, Shadow / Canary / A-B / Blue-Green release patterns
  • AI Gateway Guardrails: input/output guards with Guardrails AI, NeMo Guardrails, Llama Guard 3, Bedrock Guardrails

Security & Governance

  • Identity-First Security with EKS Pod Identity, IRSA, and Keycloak OIDC + OAuth2-Proxy two-tier flows
  • Runtime threat detection with GuardDuty Extended Threat Detection
  • Policy-as-code with Kyverno, supply chain security, and compliance automation

Workshop & Technical Writing

  • Lead author on the AWS Workshop Studio EKS GenAI workshop (LLMs β†’ scalable agent systems) and maintainer of the upstream starter kit it runs on

Tech Stack

Container & Orchestration: Amazon EKS, EKS Auto Mode, Karpenter, MNG + DRA, Helm, Argo CD

Networking: Cilium, Gateway API, kgateway, CoreDNS, ALB/NLB, CloudFront + WAF + Shield

AI/ML Serving: vLLM, SGLang, llm-d, NVIDIA Dynamo, NeMo Framework, AWS Neuron / Trainium2 / Inferentia2, Amazon Bedrock AgentCore

AI Gateway: kgateway + agentgateway, Bifrost, LiteLLM (with Team RBAC), OpenClaw

AI Agents: Kagent, MCP (Model Context Protocol), LangGraph, CrewAI, AutoGen, Strands

Auth & Identity: Keycloak, OAuth2-Proxy (OIDC), EKS Pod Identity, IRSA, External Secrets Operator

GPU Management: NVIDIA GPU Operator, DCGM, DRA, MIG, KAI Scheduler, NIXL, CRIU (experimental)

MLOps: Kubeflow, MLflow, KServe, SageMaker

Observability: Prometheus, Grafana, AMP/AMG + ADOT, Langfuse, Hubble, OpenTelemetry

Security: Kyverno, GuardDuty, EKS Pod Identity, Harbor

Vector DB & Evaluation: Milvus, Qdrant, RAGAS, LLM-as-a-Judge, Langfuse Custom Evaluators

IaC & GitOps: CDK, Terraform, Argo CD, GitHub Actions

Featured Projects

oh-my-aidlcops β€” AIDLC Γ— AgenticOps plugin marketplace

Sibling project to oh-my-claudecode. A Claude Code / Kiro compatible marketplace that layers AgenticOps (self-improving loop, autopilot-deploy, incident response, continuous evaluation, cost governance) on top of the AWS-official AIDLC workflows, so the Inception β†’ Construction β†’ Operations lifecycle closes itself. Ships four plugins β€” agentic-platform, agenticops, aidlc-inception, aidlc-construction β€” and five Tier-0 workflows (/oma:autopilot, /oma:aidlc-loop, /oma:agenticops, /oma:self-improving, /oma:platform-bootstrap). Built on awslabs/mcp hosted MCP servers. Phase 1 MVP.

engineering-playbook β€” Cloud Native & Agentic AI engineering reference

77+ production-tested guides across 7 tracks, spanning Agentic AI platforms, EKS best practices, AIDLC/AgenticOps, hybrid infrastructure, security governance, ROSA, and quantitative benchmark reports. Trilingual (Korean / English / Chinese), Docusaurus 3.9.2. The April 2026 restructure introduced a 3-tier hierarchy (foundations / platform-selection / advanced-patterns), the Self-Improving Loop ADR, Continuous Training Pipeline, CRIU GPU Migration, Cascade Routing Tuning, AI Gateway Guardrails, and AWS Neuron Stack documents. Live at devfloor9.github.io/engineering-playbook.

aws-samples/sample-genai-on-eks-starter-kit β€” core maintainer (MEMBER)

The starter kit that powers the EKS GenAI workshop. Recent contributions:

  • PR #131 (in review) β€” full demo stack: Code Review MCP (RAG with TEI + Qdrant) + Keycloak OIDC + OAuth2-Proxy + kGateway 2-tier ingress with TLS + LiteLLM Team RBAC + per-user Langfuse tracing via OpenWebUI forward-user-info headers
  • PR #141 β€” restore Langfuse session_id / tags on LangChain traces (Langfuse Python SDK v3 moved these out of RunnableConfig["tags"] into config["metadata"]), unblocking Module 4 Custom Evaluators
  • PR #127 / #126 β€” External Secrets Operator + Amazon Managed Prometheus & Grafana (AMP/AMG) with ADOT collector
  • PR #87 β€” OpenClaw AI Agent Orchestrator component, new AI Agent category + DevOps and Doc Writer example agents

awslabs/ai-on-eks + ai-on-eks-charts β€” upstream contributor

AWS official AI on EKS guides and Helm charts. Contributed the Llama 4 (Scout & Maverick) on Trainium2 vLLM deployment guide (ai-on-eks#276) and matching Helm values (ai-on-eks-charts#12): Karpenter-provisioned trn2.48xlarge, AWS Neuron DLC (optimum-neuron 0.4.5, Neuron SDK 2.26.1), tensor-parallel-size 16 across all 16 Trainium chips, BF16 inference without quantization, ~20 min end-to-end deploy.

Engineering Playbook Topics

πŸ“˜ EKS Best Practices β€” Cilium ENI, Gateway API, CoreDNS, Karpenter, cost management, control plane scaling, Pod Identity

πŸ“— Operations & Observability β€” GitOps (Argo CD), node monitoring, EKS debugging & resiliency, Pod health lifecycle

πŸ“™ Agentic AI Platform β€” EKS GPU node strategy (Auto Mode / Karpenter / MNG / DRA), vLLM, llm-d, NVIDIA Dynamo, MoE serving, AWS Neuron stack, Inference Optimization (KV Cache-aware / disaggregated serving / LWS multi-node / Bifrost β†’ Bedrock fallback), LLM Gateway 2-tier (kgateway + agentgateway + Bifrost), Kagent, Milvus, RAGAS

πŸ“” AIDLC & AgenticOps β€” AIDLC framework principles, Ontology Γ— Harness, Built-in/Org/Domain extensions, evaluation framework, multi-agent collaboration, agent versioning, regulatory compliance (EU AI Act / NIST AI RMF / ISO 42001 / Korean AI Basic Act), AgenticOps observability and DevOps Guru predictive operations

πŸ“• Hybrid Infrastructure β€” EKS Hybrid Nodes, SR-IOV with DGX H200, file storage strategy, Harbor registry integration

πŸ““ Security & Governance β€” Identity-First Security, GuardDuty Extended Threat Detection, Kyverno, supply chain security, AI Gateway guardrails

πŸ“— ROSA β€” Red Hat OpenShift on AWS installation, security, and compliance

πŸ“’ Benchmark Reports β€” CNI performance (VPC CNI vs Cilium), Gateway API implementation, AI/ML workloads, AgentCore vs EKS self-hosted inference, NVIDIA Dynamo, hybrid infrastructure, security operations

Pinned Loading

  1. engineering-playbook engineering-playbook Public

    A comprehensive engineering playbook covering AWS architecture patterns, coding standards, live coding templates, diagram specifications, and prompt engineering resources for modern software develo…

    JavaScript 19 1

  2. awslabs/ai-on-eks awslabs/ai-on-eks Public

    AI on EKS - Tested AI/ML for Amazon Elastic Kubernetes Service

    HCL 185 91

  3. aws-samples/sample-genai-on-eks-starter-kit aws-samples/sample-genai-on-eks-starter-kit Public

    A comprehensive toolkit for deploying production-ready Generative AI infrastructure on Amazon EKS. Includes pre-configured components for: πŸš€ AI Gateway (LiteLLM) πŸ€– LLM Serving (vLLM, SGLang, Ollama…

    JavaScript 63 35

  4. aws-samples/sample-oh-my-aidlcops aws-samples/sample-oh-my-aidlcops Public

    AIDLC Γ— AgenticOps β€” plugin marketplace that extends Claude Code and Kiro with agentic operations for the AWS AI-Driven Development Lifecycle (Inception β†’ Construction β†’ Operations).

    Python 7 2