Search This Blog

Showing posts with label Infrastructure. Show all posts
Showing posts with label Infrastructure. Show all posts

Wednesday, July 1, 2026

How We Run Our Own LLMs on Our Own Hardware (and Why We Don't Just Call the Big APIs)

When you need an LLM, the default move in 2026 is to grab an API key from one of the big providers and start calling. For a lot of what we build at Megam, that's the wrong default. Our automation reads client back-office data. Our telehealth platform touches patient records. Our conferencing tool handles private calls. None of that can be shipped off to someone else's cloud — and honestly, even where it could, the economics and the loss of control often don't make sense.

So we run our own inference. Here's how we stand it up, and where the line actually falls between self-hosting and calling an API.

Why self-host at all

  • Data residency & privacy. The data never leaves infrastructure we control. For regulated clients — healthcare, finance, GCC data-residency rules — this isn't a preference, it's the requirement.
  • Cost at volume. Per-token pricing is cheap for a demo and brutal at scale. A fixed box you've already paid for doesn't meter you by the request.
  • Control. The model doesn't change under you overnight, doesn't deprecate, doesn't rate-limit you mid-batch. You pin the version and it stays put.
  • Offline capability. Air-gapped and on-prem deployments simply can't reach a public API. Local is the only option.

Step 1 — Pick the box, size the model to the VRAM

The single constraint that matters is GPU memory. Rough working guide, using 4-bit quantized models:

~7-8B  model  ->  ~6 GB VRAM   (fits almost anything)
~14B   model  ->  ~10-12 GB
~27-32B model  ->  ~20-24 GB    (a 32 GB card handles this comfortably)
~70B   model  ->  ~40 GB+       (big card or multi-GPU)

A modern 24-32 GB card covers the vast majority of real workloads. You don't need frontier-scale hardware to get genuinely useful results.

Step 2 — Install Ollama and pull your models

Ollama is the simplest way to serve open models. One server can host several — chat, vision, and embeddings — and load them on demand.

# Linux install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a few models for different jobs
ollama pull qwen2.5          # general chat / reasoning
ollama pull qwen2.5vl        # vision (images, screenshots, documents)
ollama pull nomic-embed-text # embeddings for RAG / search

Step 3 — Serve it on the network with an OpenAI-compatible API

By default Ollama listens only on localhost. To let your apps reach it, bind it to all interfaces:

# Expose Ollama on the LAN
export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Important: Ollama has no built-in authentication, so never expose port 11434 to the open internet directly. Put Nginx in front of it for TLS and an API key, so only your apps get through:

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    # ssl_certificate ... ssl_certificate_key ...;

    location /v1/ {
        # simple shared-secret gate
        if ($http_authorization != "Bearer YOUR_SECRET_KEY") {
            return 401;
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;   # long generations
    }
}

Step 4 — Call it from any app (it's just the OpenAI SDK)

Because Ollama speaks the OpenAI API format, your application code doesn't care that it's talking to your own box. You change the base_url and nothing else. Migrating from a public API — or back — is a one-line change.

from openai import OpenAI

client = OpenAI(
    base_url="https://ai.yourdomain.com/v1",
    api_key="YOUR_SECRET_KEY",
)

resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": "Summarize this ticket in one line: ..."}],
)
print(resp.choices[0].message.content)

Step 5 — One server, many jobs

The same endpoint routes different tasks to different models — send chat to qwen2.5, screenshots and documents to qwen2.5vl, and use nomic-embed-text to build the vectors behind a RAG search. Ollama swaps models in and out of VRAM as calls come in, so a single machine covers a surprising amount of ground.

Where this powers our products

This isn't a lab experiment — it's the shared engine under everything we ship. The reasoning layer in MeBot (our agentic RPA) calls it to decide next actions. Datalytics uses it for natural-language analytics. Olivasal runs meeting summaries on it. Saffron, our telehealth platform, runs clinical models on it. One on-prem inference tier, many products — and no client data ever leaves the building.

The honest tradeoffs

Self-hosting isn't free. You own the ops — uptime, drivers, model updates. Open models are excellent now but a very hardest reasoning task may still favour a frontier API. And you'll think about concurrency and batching that a managed API hides from you. Our rule of thumb: self-host anything touching sensitive data or running at volume; reach for a frontier API only for the rare task that genuinely needs it. For most real work, your own hardware wins on cost, privacy, and control at the same time.

If you need AI running inside your own walls — because your data can't leave, or the API bill doesn't scale — that's precisely what we build for clients → megamtech.com