Architecture¶
The platform is a layered system: a conversational LLM agent on top, GitOps reconciliation in the middle, and three provisioning layers (Kubernetes namespaces, Kubernetes clusters via CAPI, and bare-metal/VM via Ansible) at the bottom.
High-level system view¶
flowchart TD
User([User / CI Pipeline]) -->|chat or REST| API[FastAPI<br/>Agent REST + WebSocket]
API --> Router[Router<br/>intent classifier]
Router --> Planner[Planner<br/>tool selection]
Planner --> Executor[Executor<br/>tool invocation]
Executor --> Reflector{Reflector<br/>continue?}
Reflector -->|loop| Planner
Reflector -->|done| Responder[Responder<br/>format reply]
Responder --> User
Executor -.commits.-> GitOps[(infra-provisioning<br/>Git repository)]
GitOps -.reconciles.-> Flux[Flux Controllers<br/>Kustomize + Helm]
Flux --> NS[K8s Namespace<br/>via Kustomize]
Flux --> CAPI[Cluster API<br/>CAPV / Metal3]
Flux --> Job[Ansible Job<br/>bare-metal / VM]
Executor -.events.-> NATS[(NATS JetStream<br/>3 streams)]
Executor -.state.-> PG[(PostgreSQL<br/>13-state machine)]
Executor -.traces.-> LF[(LangFuse)]
style API fill:#e3f2fd
style GitOps fill:#fff3e0
style PG fill:#f3e5f5
Four layers¶
1. Conversational layer (FastAPI + LangGraph)¶
The agent is a LangGraph state machine with five nodes:
| Node | Responsibility |
|---|---|
| Router | Classify user intent (provision / release / extend / diagnose / query / chat). Fast keyword matcher with LLM fallback for ambiguous cases. |
| Planner | Decide which tools to invoke. Hybrid: deterministic for well-known intents, LLM-driven for free-form queries. |
| Executor | Run the tools. 30 total tools; only 19 are exposed to the LLM (the other 11 — GitOps writes, namespace deletion, secret encryption, NATS events — are reserved for the deterministic provisioner to prevent hallucinated infrastructure changes). |
| Reflector | Decide whether to loop back to planner or proceed to responder. Hard cap at 20 iterations. |
| Responder | Format the final response. |
State is persisted to PostgreSQL via langgraph-checkpoint-postgres so a
killed agent pod resumes mid-flight.
Service classes (v0.21.0 SOLID refactor)¶
The route handlers (chat.py, reservations.py) and the FastAPI
lifespan are intentionally thin — non-trivial logic lives behind
focused service classes in src/api/services/, src/api/lifecycle.py,
src/api/nats_consumer.py, src/agent/parsers.py, and
src/state/machine.py:
| Class | Pattern | Role |
|---|---|---|
ResponseBuilder |
Extract Class + DI | Compose the chat/stream response from a graph result + injected continuations |
ProvisioningResultParser / ProvisioningOrchestrator |
Parser + Value Object | Turn tool_outputs into a typed ProvisioningRequest, dispatch to the deterministic provisioner, return a ProvisioningSummary |
ReservationReleaseService |
Facade + DI | Compose DB lookup + state advancement + (BM-only) NetBox cleanup for the post-delete_from_gitops continuation |
ReservationStateMachine |
State Machine | OO facade over the transition table; plan_release(), plan_release_finalization(), find_path() BFS |
InfrastructureCleanupService |
Strategy + Composite + DIP | CleanupStep protocol with GitOpsCleanup / VsphereVmCleanup / KubernetesSecretCleanup concrete steps |
LifecycleManager |
Composition + Builder | 9 startup phases; start(app) composes them, stop() unwinds |
NatsConsumer / NatsEventParser / ReservationReclaimedHandler |
Strategy + DI | NATS JetStream subscription with parsed-event dispatch |
ResourceSpecParser |
Parser + Value Object | Free-form English → ProvisionSpec(env_type, ResourceSpec) |
GitConflictResolver |
Extract Class | Encapsulates fetch + reset + reapply for rejected GitOps pushes |
DatabaseContext |
Service Locator | Injectable session_factory + env_id_generator for unit tests |
Adding new behavior (e.g. another infrastructure backend, another NATS event subject, another LLM-judge rubric) is a new class plus registration — the existing classes are not modified (Open/Closed).
2. GitOps reconciliation (Flux + Kustomize + Helm)¶
The agent never touches the live cluster directly for provisioning. It
commits to infra-provisioning, and Flux reconciles. This keeps the
source of truth in Git and makes every change auditable.
- 7 Flux
Kustomizationresources:flux-system,infrastructure,apps,environments,capi-clusters,capi-addons,ansible-runner - 11
HelmReleaseresources: PostgreSQL, NATS, Sealed Secrets, Jenkins, OpenSearch, OpenSearch Dashboards, SonarQube, kube-prometheus-stack, Fluent Bit, NetBox, LangFuse - 9
HelmRepositorysources (NATS, Jenkins, OpenSearch, SonarQube, Prometheus community, Fluent, etc.)
prune: true on the environments Kustomization means deleting the
overlay folder from Git deletes the environment in the cluster.
3. Provisioning layers (3 patterns)¶
| Pattern | Use Case | How |
|---|---|---|
| K8s namespace | Lightweight test envs (most common) | Kustomize overlay → Flux reconciles → Namespace + ResourceQuota + NetworkPolicy + RBAC |
| K8s cluster (CAPI) | Cluster-scoped tests, multi-tenant isolation | Cluster + KubeadmControlPlane + MachineDeployment + VSphereCluster + IPAM InClusterIPPool + Calico CNI via ClusterResourceSet |
| Bare-metal / VM | OS-level testing, kernel modules, IPMI flows | K8s Job runs ansible-runner image → Ansible vSphere collection clones from template on vCenter (uses /api/, not /rest/) |
4. Observability + reporting¶
- Prometheus + Grafana — RED metrics for the agent + standard kube-prometheus-stack
- OpenSearch + Fluent Bit — container logs, audit logs, application logs; ISM lifecycle policy (hot 30d → warm → delete 365d)
- LangFuse — LLM trace, token, cost tracking with
env_idcorrelation - SonarQube — code coverage and quality gates from CI
- Jaeger (OpenTelemetry) — distributed traces across agent + tools
All five share the env_id correlation key so you can pivot from a
metric to a log to a trace for the same environment.
State machine (13 states)¶
stateDiagram-v2
[*] --> REQUESTED
REQUESTED --> VALIDATING
VALIDATING --> PROVISIONING
VALIDATING --> QUEUED: quota / capacity
VALIDATING --> REJECTED: invalid
QUEUED --> VALIDATING: capacity freed
PROVISIONING --> READY
PROVISIONING --> FAILED
READY --> IN_USE: heartbeat
READY --> RELEASING: explicit release
IN_USE --> RELEASING: TTL or release
RELEASING --> DEPROVISIONING
RELEASING --> RELEASE_FAILED
DEPROVISIONING --> RECLAIMED
DEPROVISIONING --> TEARDOWN_FAILED
RECLAIMED --> [*]
REJECTED --> [*]
FAILED --> [*]
Background services watch the state machine:
- TTL supervisor — warns at 80% TTL, auto-releases at 100% (60-second tick)
- Heartbeat monitor — releases on 2× heartbeat timeout (30-second tick)
- Orphan detector — reconciles state store vs GitOps directories every
15 minutes; force-cleans CAPI clusters stuck in
Deletingfor >15 min - Queue processor — re-evaluates
QUEUEDrequests when anenvironment-releasedevent hits NATS
3-tier LLM routing¶
┌─────────────┐ fallback ┌────────────┐ fallback ┌─────────┐
│ Tier 1 │ ──────────────▶ │ Tier 2 │ ──────────────▶ │ Tier 3 │
│ Local LLM │ │ OpenRouter │ │Anthropic│
│ (SSO) │ │ │ │ Claude │
└─────────────┘ └────────────┘ └─────────┘
gpt-oss-120b openrouter/free claude-sonnet-4
langchain-openai langchain-openai langchain-anthropic
All three speak the same .bind_tools() API, but Anthropic is invoked via
its native Messages API (not the OpenAI compatibility shim — that has
known function-calling quirks for chained tool use). The SSO token for the
local LLM flows through RunnableConfig["configurable"]["sso_token"] so
LangGraph nodes can forward it to nested LLM calls.
RBAC¶
5 roles, 11 permissions:
| Role | Can do |
|---|---|
| platform-admin | All 11 permissions |
| team-lead | CREATE, VIEW (own + team), RELEASE_OWN, PREEMPT, EXTEND_TTL, AUDIT_VIEW_OWN |
| developer | CREATE, VIEW_OWN, RELEASE_OWN, AUDIT_VIEW_OWN |
| ci-service | CREATE, VIEW_OWN, RELEASE_OWN, AUDIT_VIEW_OWN |
| viewer | VIEW_OWN, VIEW_TEAM |
Auth is header-based (X-User, X-Role, X-Team) at the proxy edge,
enforced by FastAPI middleware + @require_permission(...) decorators.
LLM tool access is further restricted via an LLM_TOOLS allowlist
(19 of 30 tools).
Key design decisions¶
- State store > infra-provisioning > live cluster — when they disagree, the database is right; the orphan detector reconciles the rest.
- vSphere
/api/only —/rest/wraps everything in{"value": ...}and has broken filtering. vCenter 7.0+ exposes/api/natively. - Flux paths under
clusters/must contain only valid Kubernetes YAML. Non-K8s files (READMEs, scripts) cause reconciliation failure. - Kubeadm clusters need namespace pre-creation before HelmReleases target them. Pre-managed K8s often auto-creates namespaces; kubeadm doesn't.
Secret envFromoverridesConfigMapkeys — empty Secret values silently override ConfigMap values. Never leave placeholder Secrets with empty keys that exist in a ConfigMap.