Skip to content

Architecture

The platform is a layered system: a conversational LLM agent on top, GitOps reconciliation in the middle, and three provisioning layers (Kubernetes namespaces, Kubernetes clusters via CAPI, and bare-metal/VM via Ansible) at the bottom.

High-level system view

flowchart TD
    User([User / CI Pipeline]) -->|chat or REST| API[FastAPI<br/>Agent REST + WebSocket]
    API --> Router[Router<br/>intent classifier]
    Router --> Planner[Planner<br/>tool selection]
    Planner --> Executor[Executor<br/>tool invocation]
    Executor --> Reflector{Reflector<br/>continue?}
    Reflector -->|loop| Planner
    Reflector -->|done| Responder[Responder<br/>format reply]
    Responder --> User

    Executor -.commits.-> GitOps[(infra-provisioning<br/>Git repository)]
    GitOps -.reconciles.-> Flux[Flux Controllers<br/>Kustomize + Helm]

    Flux --> NS[K8s Namespace<br/>via Kustomize]
    Flux --> CAPI[Cluster API<br/>CAPV / Metal3]
    Flux --> Job[Ansible Job<br/>bare-metal / VM]

    Executor -.events.-> NATS[(NATS JetStream<br/>3 streams)]
    Executor -.state.-> PG[(PostgreSQL<br/>13-state machine)]
    Executor -.traces.-> LF[(LangFuse)]

    style API fill:#e3f2fd
    style GitOps fill:#fff3e0
    style PG fill:#f3e5f5

Four layers

1. Conversational layer (FastAPI + LangGraph)

The agent is a LangGraph state machine with five nodes:

Node Responsibility
Router Classify user intent (provision / release / extend / diagnose / query / chat). Fast keyword matcher with LLM fallback for ambiguous cases.
Planner Decide which tools to invoke. Hybrid: deterministic for well-known intents, LLM-driven for free-form queries.
Executor Run the tools. 30 total tools; only 19 are exposed to the LLM (the other 11 — GitOps writes, namespace deletion, secret encryption, NATS events — are reserved for the deterministic provisioner to prevent hallucinated infrastructure changes).
Reflector Decide whether to loop back to planner or proceed to responder. Hard cap at 20 iterations.
Responder Format the final response.

State is persisted to PostgreSQL via langgraph-checkpoint-postgres so a killed agent pod resumes mid-flight.

Service classes (v0.21.0 SOLID refactor)

The route handlers (chat.py, reservations.py) and the FastAPI lifespan are intentionally thin — non-trivial logic lives behind focused service classes in src/api/services/, src/api/lifecycle.py, src/api/nats_consumer.py, src/agent/parsers.py, and src/state/machine.py:

Class Pattern Role
ResponseBuilder Extract Class + DI Compose the chat/stream response from a graph result + injected continuations
ProvisioningResultParser / ProvisioningOrchestrator Parser + Value Object Turn tool_outputs into a typed ProvisioningRequest, dispatch to the deterministic provisioner, return a ProvisioningSummary
ReservationReleaseService Facade + DI Compose DB lookup + state advancement + (BM-only) NetBox cleanup for the post-delete_from_gitops continuation
ReservationStateMachine State Machine OO facade over the transition table; plan_release(), plan_release_finalization(), find_path() BFS
InfrastructureCleanupService Strategy + Composite + DIP CleanupStep protocol with GitOpsCleanup / VsphereVmCleanup / KubernetesSecretCleanup concrete steps
LifecycleManager Composition + Builder 9 startup phases; start(app) composes them, stop() unwinds
NatsConsumer / NatsEventParser / ReservationReclaimedHandler Strategy + DI NATS JetStream subscription with parsed-event dispatch
ResourceSpecParser Parser + Value Object Free-form English → ProvisionSpec(env_type, ResourceSpec)
GitConflictResolver Extract Class Encapsulates fetch + reset + reapply for rejected GitOps pushes
DatabaseContext Service Locator Injectable session_factory + env_id_generator for unit tests

Adding new behavior (e.g. another infrastructure backend, another NATS event subject, another LLM-judge rubric) is a new class plus registration — the existing classes are not modified (Open/Closed).

2. GitOps reconciliation (Flux + Kustomize + Helm)

The agent never touches the live cluster directly for provisioning. It commits to infra-provisioning, and Flux reconciles. This keeps the source of truth in Git and makes every change auditable.

  • 7 Flux Kustomization resources: flux-system, infrastructure, apps, environments, capi-clusters, capi-addons, ansible-runner
  • 11 HelmRelease resources: PostgreSQL, NATS, Sealed Secrets, Jenkins, OpenSearch, OpenSearch Dashboards, SonarQube, kube-prometheus-stack, Fluent Bit, NetBox, LangFuse
  • 9 HelmRepository sources (NATS, Jenkins, OpenSearch, SonarQube, Prometheus community, Fluent, etc.)

prune: true on the environments Kustomization means deleting the overlay folder from Git deletes the environment in the cluster.

3. Provisioning layers (3 patterns)

Pattern Use Case How
K8s namespace Lightweight test envs (most common) Kustomize overlay → Flux reconciles → Namespace + ResourceQuota + NetworkPolicy + RBAC
K8s cluster (CAPI) Cluster-scoped tests, multi-tenant isolation Cluster + KubeadmControlPlane + MachineDeployment + VSphereCluster + IPAM InClusterIPPool + Calico CNI via ClusterResourceSet
Bare-metal / VM OS-level testing, kernel modules, IPMI flows K8s Job runs ansible-runner image → Ansible vSphere collection clones from template on vCenter (uses /api/, not /rest/)

4. Observability + reporting

  • Prometheus + Grafana — RED metrics for the agent + standard kube-prometheus-stack
  • OpenSearch + Fluent Bit — container logs, audit logs, application logs; ISM lifecycle policy (hot 30d → warm → delete 365d)
  • LangFuse — LLM trace, token, cost tracking with env_id correlation
  • SonarQube — code coverage and quality gates from CI
  • Jaeger (OpenTelemetry) — distributed traces across agent + tools

All five share the env_id correlation key so you can pivot from a metric to a log to a trace for the same environment.

State machine (13 states)

stateDiagram-v2
    [*] --> REQUESTED
    REQUESTED --> VALIDATING
    VALIDATING --> PROVISIONING
    VALIDATING --> QUEUED: quota / capacity
    VALIDATING --> REJECTED: invalid
    QUEUED --> VALIDATING: capacity freed
    PROVISIONING --> READY
    PROVISIONING --> FAILED
    READY --> IN_USE: heartbeat
    READY --> RELEASING: explicit release
    IN_USE --> RELEASING: TTL or release
    RELEASING --> DEPROVISIONING
    RELEASING --> RELEASE_FAILED
    DEPROVISIONING --> RECLAIMED
    DEPROVISIONING --> TEARDOWN_FAILED
    RECLAIMED --> [*]
    REJECTED --> [*]
    FAILED --> [*]

Background services watch the state machine:

  • TTL supervisor — warns at 80% TTL, auto-releases at 100% (60-second tick)
  • Heartbeat monitor — releases on 2× heartbeat timeout (30-second tick)
  • Orphan detector — reconciles state store vs GitOps directories every 15 minutes; force-cleans CAPI clusters stuck in Deleting for >15 min
  • Queue processor — re-evaluates QUEUED requests when an environment-released event hits NATS

3-tier LLM routing

┌─────────────┐     fallback     ┌────────────┐     fallback     ┌─────────┐
│ Tier 1      │ ──────────────▶  │ Tier 2     │ ──────────────▶  │ Tier 3  │
│ Local LLM   │                  │ OpenRouter │                  │Anthropic│
│ (SSO)       │                  │            │                  │ Claude  │
└─────────────┘                  └────────────┘                  └─────────┘
   gpt-oss-120b                  openrouter/free                  claude-sonnet-4
   langchain-openai              langchain-openai                 langchain-anthropic

All three speak the same .bind_tools() API, but Anthropic is invoked via its native Messages API (not the OpenAI compatibility shim — that has known function-calling quirks for chained tool use). The SSO token for the local LLM flows through RunnableConfig["configurable"]["sso_token"] so LangGraph nodes can forward it to nested LLM calls.

RBAC

5 roles, 11 permissions:

Role Can do
platform-admin All 11 permissions
team-lead CREATE, VIEW (own + team), RELEASE_OWN, PREEMPT, EXTEND_TTL, AUDIT_VIEW_OWN
developer CREATE, VIEW_OWN, RELEASE_OWN, AUDIT_VIEW_OWN
ci-service CREATE, VIEW_OWN, RELEASE_OWN, AUDIT_VIEW_OWN
viewer VIEW_OWN, VIEW_TEAM

Auth is header-based (X-User, X-Role, X-Team) at the proxy edge, enforced by FastAPI middleware + @require_permission(...) decorators. LLM tool access is further restricted via an LLM_TOOLS allowlist (19 of 30 tools).

Key design decisions

  • State store > infra-provisioning > live cluster — when they disagree, the database is right; the orphan detector reconciles the rest.
  • vSphere /api/ only/rest/ wraps everything in {"value": ...} and has broken filtering. vCenter 7.0+ exposes /api/ natively.
  • Flux paths under clusters/ must contain only valid Kubernetes YAML. Non-K8s files (READMEs, scripts) cause reconciliation failure.
  • Kubeadm clusters need namespace pre-creation before HelmReleases target them. Pre-managed K8s often auto-creates namespaces; kubeadm doesn't.
  • Secret envFrom overrides ConfigMap keys — empty Secret values silently override ConfigMap values. Never leave placeholder Secrets with empty keys that exist in a ConfigMap.

Phase status → Test Automation →