Back to Projects

AI Inference Platform / RAG / Multi-Tenant Systems

CampusRAG: Multi-Tenant AI Inference Platform

A university computing-center prototype for secure RAG-based AI services: tenant-isolated document retrieval, OpenAI-compatible inference APIs, usage accounting, request limits, and Prometheus-style monitoring.

Why it matches RRZE / HPC@FAU

  • Web UI and API for AI inference workflows
  • RAG-capable environment with tenant separation
  • Usage accounting and fair-resource controls
  • Docker-first architecture, Kubernetes and Slurm ready
  • Monitoring-ready service metrics for operations

3

Tenants

separate documents, vector collections, and usage records

5

Core APIs

/upload, /chat, /tenants, /status, and /metrics

RAG

Retrieval

source-grounded answers from tenant-local knowledge

SSO

Ready Design

planned Keycloak/OIDC integration for institutions

Interactive Prototype

Tenant-Isolated RAG Demo

Select tenant

Isolated documents

No documents uploaded for this tenant yet.

Upload tenant document

Choose a .txt file or paste text manually.

No document uploaded in this session.

RAG answer

Choose a tenant and run a query. The response will come from the CampusRAG API route.

source: not queried yettenant: medicinelimit remaining: 10

Architecture

Inference Service Layers

01

Web UI

Next.js chat, upload, tenant dashboard

02

FastAPI

/chat, /upload, /usage, /metrics

03

Tenant Layer

separate docs, vectors, limits, accounting

04

RAG Store

ChromaDB or pgvector collections per tenant

05

Model Gateway

OpenAI-compatible, LiteLLM/vLLM-ready

06

Metrics

Prometheus-ready usage and error counters

Operations Control Plane

Routing, Access, and GPU Resource Status

Access Control

tenant_id + role check before document retrieval

decision: allowed_tenant_scoped

Model Routing

OpenAI-compatible LiteLLM-style gateway

route: litellm/clinical-llama

GPU/HPC Resource

Slurm/Kubernetes-ready dispatch layer

gpu-clinical: 2 GPU share

Observability

Prometheus/Grafana-ready labels

latency: 420 ms, queue: 3

Runtime uploads and accounting mutations are persisted through a file-backed service store under data/campusrag-state.json, keeping the API layer ready for a later SQLite or PostgreSQL swap.

Accounting

Usage and Resource Management

0

Requests

Medicine chat calls this month

0

Documents

tenant-local uploads indexed for RAG

0

Tokens

estimated input and output usage

EUR 0.00

Cost

simple attribution estimate

Operations

Prometheus-Style Metrics

campusrag_requests_total{tenant="medicine"} 0
campusrag_documents_total{tenant="medicine"} 0
campusrag_tokens_total{tenant="medicine"} 0
campusrag_limit_remaining{tenant="medicine"} 10
campusrag_errors_total{tenant="medicine"} 0

Operational behavior

  • Every request is tagged with tenant_id
  • Usage counters can feed cost attribution
  • Rate limits protect shared GPU capacity
  • Metrics are shaped for Grafana dashboards

Implementation Plan

From Prototype to Real Service

01

Replace demo retrieval with ChromaDB or pgvector-backed embeddings

02

Route model calls through LiteLLM to Ollama, vLLM, or external APIs

03

Add Keycloak/OIDC SSO with project and group based access control

04

Deploy with Docker Compose today, Kubernetes or Slurm workers later

05

Attach Grafana dashboards for request, latency, token, and cost views