Building an EU-Only AI Stack: Nextcloud MCP on Leaf.cloud

A journey through self-hosted LLMs, MCP integration challenges, and cost-effective observability

The promise is compelling: connect your personal knowledge base to AI assistants while keeping everything within EU borders. No data leaving the continent. Full control over your infrastructure. GDPR compliance by design.

We set out to build exactly this—a private AI stack running on EU-only infrastructure, integrating with Nextcloud for notes, files, and project management. Here's what we learned.

The Infrastructure

Leaf.cloud caught our attention as an EU-only cloud provider running managed Kubernetes via Gardener. They offer a two-week free tier for evaluation, which gave us time to properly test GPU workloads without upfront commitment.

Our test cluster:

2 worker nodes running eg1.v100x1.2xlarge
8 vCPU, 16GB RAM, 1x Nvidia V100 GPU (16GB VRAM) per node
Managed Kubernetes with automatic updates and built-in DNS/TLS via Gardener

The pricing is competitive for GPU instances:

Instance	GPU	$/hr	$/month
eg1.v100x1.2xlarge	V100 16GB	$1.22	~$890
eg1.a100x1.V12-84	A100 80GB	$1.61	~$1,174
eg1.h100x1.V24_96	H100	$4.12	~$3,006

For our 2-node V100 cluster: approximately $1,780/month at full utilization.

The Stack

Our architecture connects several components:

Open WebUI serves as our chat interface, chosen for its MCP client support and clean UI. It connects to Ollama running on the GPU nodes for local model inference.

The Nextcloud MCP Server bridges the gap between LLMs and Nextcloud APIs—exposing Deck boards, Notes, and WebDAV file operations as MCP tools that AI assistants can invoke.

The MCP Server

The Nextcloud MCP Server exposes several Nextcloud apps as MCP tools:

Deck - Kanban boards for project management
Notes - Markdown note-taking with categories
WebDAV - Full file system operations
Calendar - Event management (available but not enabled in our test)

Deployment is straightforward:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nextcloud-mcp
  namespace: ai
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: mcp
          image: ghcr.io/cbcoutinho/nextcloud-mcp-server:latest
          command:
            - /app/.venv/bin/nextcloud-mcp-server
            - run
            - --host
            - 0.0.0.0
            - --enable-app
            - deck
            - --enable-app
            - webdav
            - --transport
            - streamable-http
          envFrom:
            - secretRef:
                name: nextcloud-mcp-secret

The server uses streamable HTTP transport, making it accessible to MCP clients over the network.

The Context Overhead Problem

Here's where reality diverges from the ideal. With all tools enabled, the MCP server presents approximately 20,000 tokens of tool definitions to the LLM. This includes detailed schemas for every Deck operation (create board, create card, assign labels, move cards between stacks), every WebDAV operation (list, read, write, copy, move, search), and all Notes functionality.

For cloud LLMs with 100k+ context windows, this overhead is negligible. For local models running on a V100 with 16GB VRAM, it's a significant constraint.

Model Performance Reality

We tested a range of models through Ollama:

Model	Size	Tool Use Reliability
mistral:7b	7B	Unreliable with 20k context overhead
deepseek-r1:8b	8B	Inconsistent tool selection
qwen2.5:14b	14B	Better but still misses tool calls
deepseek-r1:14b	14B	Moderate success rate
ministral-3:14b	14B	Similar to qwen2.5
gpt-oss:20b	20B	Improved but not reliable
deepseek-r1:32b	32B	Best local option, still imperfect

Key findings:

Small models (7B-14B) struggle with the cognitive load of 60+ tool definitions. They often hallucinate tool names, miss required parameters, or fail to recognize when a tool should be used at all.
Larger models (32B+) perform better but still show inconsistency. The V100's 16GB VRAM limits which models we can run effectively—an A100 80GB would significantly expand our options.
Cloud LLMs (Claude, Mistral AI) handle the tool definitions without issue. They correctly identify when to use tools, select the right ones, and structure arguments properly.

This isn't a criticism of local models—they're impressive for their size. But MCP's design assumes LLMs can handle large tool catalogs gracefully, which is currently only reliable with frontier models.

MCP Client Limitations

Open WebUI supports MCP connections, but with significant limitations:

No MCP Sampling Support - The MCP specification includes a "sampling" feature that lets servers request LLM completions for sub-tasks. Open WebUI doesn't implement this, nor do other MCP clients like claude-code and gemini-cli, meaning the MCP server can only provide tools, not leverage the LLM for intelligent operations.
Static Tool Listing - Tools are loaded once when the connection is established. There's no dynamic tool registration based on context or user needs.
No Tool Filtering - You can't selectively enable/disable tools per conversation or assistant.

The "App Expert" Workaround

To reduce context overhead and improve reliability, we found success with an App Expert pattern:

Instead of one assistant with all tools, create multiple specialized assistants:

Deck Expert - Only Deck tools enabled
Notes Expert - Only Notes tools enabled
Files Expert - Only WebDAV tools enabled

Each expert has a smaller tool set (~5-8k tokens instead of 20k), which smaller models handle more reliably. Users switch between experts based on their current task.

This works, but it's a workaround for what should be a protocol-level feature. The MCP specification supports dynamic tool sets, but clients need to implement it.

Observability on a Budget

Grafana Cloud's free tier provides:

1,500 samples/second ingestion rate
15,000 sample burst limit
Prometheus metrics, Loki logs, and basic dashboards

The challenge: a Kubernetes cluster generates thousands of metrics per scrape. Without filtering, we'd exceed the free tier immediately.

Our solution uses Grafana Alloy with aggressive metric filtering:

prometheus.relabel "cadvisor_filter" {
  # Drop all histogram buckets (huge cardinality)
  rule {
    source_labels = ["__name__"]
    regex = ".*_bucket"
    action = "drop"
  }
  # Keep only essential container metrics
  rule {
    source_labels = ["__name__"]
    regex = "container_(cpu_usage_seconds_total|memory_working_set_bytes|memory_usage_bytes|network_receive_bytes_total|network_transmit_bytes_total|fs_usage_bytes|fs_limit_bytes)|machine_(cpu_cores|memory_bytes)"
    action = "keep"
  }
  # Drop kube-system containers to reduce noise
  rule {
    source_labels = ["namespace"]
    regex = "kube-system"
    action = "drop"
  }
}

We apply similar filtering to node exporter, kubelet, and DCGM (GPU) metrics. The result: comprehensive visibility into what matters while staying within free tier limits.

Key metrics we kept:

GPU: utilization, memory usage, temperature, power consumption
Containers: CPU, memory, network I/O for our workloads
Nodes: CPU, memory, disk, network at the host level
MCP Server: Request rates and latencies

What the Metrics Revealed

We ran the POC over five working days, with the cluster auto-hibernating overnight and over weekends. This gave us clean data on actual usage patterns versus idle overhead.

Cluster Activity Windows:

Day	Active Hours (CET)	Duration
Jan 26-30	08:25 - 16:55	~8.5 hrs/day

Resource Utilization Summary:

Metric	Idle	Peak (during inference)
GPU Utilization	0%	85%
GPU Power	26-27W	145.5W
GPU Temperature	35°C	69°C
Total AI Namespace Memory	~1 GB	7.1 GB
Ollama Memory (model loaded)	15 MB	5.3 GB

MCP Server Performance:

Metric	Value
Median latency (GET)	175-212 ms
Median latency (POST/PUT)	~175 ms
P95 latency	244-470 ms
Error rate	0%

The V100 GPU was genuinely utilized—85% utilization during inference with power draw jumping from 27W idle to 145W. This confirms we weren't just burning GPU hours on CPU-bound work.

The Honest Assessment:

The infrastructure performed well. Zero errors across the five-day POC, sub-500ms API latencies, and efficient auto-hibernation. However, the observability data confirmed what we suspected from qualitative testing: smaller models struggled with MCP tool interactions due to context constraints.

With 20,000+ tokens of tool definitions competing for context space, models in the 7B-14B range frequently:

Failed to recognize when tools should be invoked
Hallucinated tool names or parameters
Lost track of multi-step operations

The 32B models showed improvement but still exhibited inconsistency. The V100's 16GB VRAM ceiling limits us to these smaller models—running a 70B parameter model that might handle the full tool catalog reliably would require an A100 80GB or H100.

Future Investigation:

A follow-up evaluation with an A100 instance ($1.61/hr vs $1.22/hr for the V100) would let us test whether larger models like deepseek-r1:70b or qwen2.5:72b can reliably handle the full MCP tool catalog. The 5x VRAM increase (80GB vs 16GB) opens up model sizes that may cross the threshold from "sometimes works" to "reliably works."

For now, the App Expert pattern (specialized assistants with reduced tool sets) remains the practical path for self-hosted deployments on V100-class hardware.

Lessons Learned

1. MCP Specification vs. Reality

The MCP specification is thoughtful and comprehensive. Client implementations are still catching up. Features like sampling, dynamic tools, and resource subscriptions exist in the spec but are rare in practice.

Recommendation for MCP server developers: Design for the lowest common denominator. Provide fewer, more focused tools rather than comprehensive coverage. Consider offering multiple tool "profiles" that clients can select.

2. Context Reduction Strategies

If you're building MCP servers:

Minimize tool descriptions - Every token counts for small models
Consolidate related operations - One manage_card tool with an action parameter beats five separate tools
Make parameters optional with sensible defaults
Consider tool "tiers" - Basic tools always available, advanced tools on request

3. GPU Memory is the Constraint

For local LLM deployments, GPU VRAM determines what's possible more than compute. The V100's 16GB limits us to models that fit with room for context. The A100 80GB at only $0.40/hr more would dramatically expand model options.

4. EU Infrastructure is Viable

Leaf.cloud proved capable for this workload. Gardener-based Kubernetes "just works"—automated TLS via cert-manager, DNS management, and straightforward GPU scheduling. The two-week free trial is genuinely useful for evaluation.

Where This Goes Next

The pieces are almost there. We need:

Better MCP client implementations - Sampling support, dynamic tools, tool filtering
Smarter tool presentation - Lazy-load tool definitions based on conversation context
Smaller, more capable models - The gap between 14B and 70B models is closing
Quantization improvements - Running larger models in less VRAM

The dream of a private AI assistant that knows your notes, manages your projects, and respects your data sovereignty is achievable today—with the right model and some workarounds. It'll be seamless within a year or two.

Try It Yourself

The stack we tested:

Leaf.cloud - EU Kubernetes with GPU instances
Open WebUI - Chat interface with MCP support
Ollama - Local model serving
Nextcloud MCP Server - MCP bridge to Nextcloud
Grafana Alloy - Observability pipeline

Start with cloud LLMs (Claude, Mistral) for reliable tool use, then experiment with local models once your MCP server is working. And if you're building MCP clients or servers—please prioritize the sampling specification. The ecosystem needs it.

Questions or experiences to share? The Nextcloud MCP server is open source and welcomes contributions.

Building an EU-Only AI Stack: Nextcloud MCP on Leaf.cloud

The Infrastructure

The Stack