Search This Blog

Showing posts with label Ollama. Show all posts
Showing posts with label Ollama. Show all posts

Wednesday, July 1, 2026

How We Run Our Own LLMs on Our Own Hardware (and Why We Don't Just Call the Big APIs)

When you need an LLM, the default move in 2026 is to grab an API key from one of the big providers and start calling. For a lot of what we build at Megam, that's the wrong default. Our automation reads client back-office data. Our telehealth platform touches patient records. Our conferencing tool handles private calls. None of that can be shipped off to someone else's cloud — and honestly, even where it could, the economics and the loss of control often don't make sense.

So we run our own inference. Here's how we stand it up, and where the line actually falls between self-hosting and calling an API.

Why self-host at all

  • Data residency & privacy. The data never leaves infrastructure we control. For regulated clients — healthcare, finance, GCC data-residency rules — this isn't a preference, it's the requirement.
  • Cost at volume. Per-token pricing is cheap for a demo and brutal at scale. A fixed box you've already paid for doesn't meter you by the request.
  • Control. The model doesn't change under you overnight, doesn't deprecate, doesn't rate-limit you mid-batch. You pin the version and it stays put.
  • Offline capability. Air-gapped and on-prem deployments simply can't reach a public API. Local is the only option.

Step 1 — Pick the box, size the model to the VRAM

The single constraint that matters is GPU memory. Rough working guide, using 4-bit quantized models:

~7-8B  model  ->  ~6 GB VRAM   (fits almost anything)
~14B   model  ->  ~10-12 GB
~27-32B model  ->  ~20-24 GB    (a 32 GB card handles this comfortably)
~70B   model  ->  ~40 GB+       (big card or multi-GPU)

A modern 24-32 GB card covers the vast majority of real workloads. You don't need frontier-scale hardware to get genuinely useful results.

Step 2 — Install Ollama and pull your models

Ollama is the simplest way to serve open models. One server can host several — chat, vision, and embeddings — and load them on demand.

# Linux install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a few models for different jobs
ollama pull qwen2.5          # general chat / reasoning
ollama pull qwen2.5vl        # vision (images, screenshots, documents)
ollama pull nomic-embed-text # embeddings for RAG / search

Step 3 — Serve it on the network with an OpenAI-compatible API

By default Ollama listens only on localhost. To let your apps reach it, bind it to all interfaces:

# Expose Ollama on the LAN
export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Important: Ollama has no built-in authentication, so never expose port 11434 to the open internet directly. Put Nginx in front of it for TLS and an API key, so only your apps get through:

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    # ssl_certificate ... ssl_certificate_key ...;

    location /v1/ {
        # simple shared-secret gate
        if ($http_authorization != "Bearer YOUR_SECRET_KEY") {
            return 401;
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;   # long generations
    }
}

Step 4 — Call it from any app (it's just the OpenAI SDK)

Because Ollama speaks the OpenAI API format, your application code doesn't care that it's talking to your own box. You change the base_url and nothing else. Migrating from a public API — or back — is a one-line change.

from openai import OpenAI

client = OpenAI(
    base_url="https://ai.yourdomain.com/v1",
    api_key="YOUR_SECRET_KEY",
)

resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": "Summarize this ticket in one line: ..."}],
)
print(resp.choices[0].message.content)

Step 5 — One server, many jobs

The same endpoint routes different tasks to different models — send chat to qwen2.5, screenshots and documents to qwen2.5vl, and use nomic-embed-text to build the vectors behind a RAG search. Ollama swaps models in and out of VRAM as calls come in, so a single machine covers a surprising amount of ground.

Where this powers our products

This isn't a lab experiment — it's the shared engine under everything we ship. The reasoning layer in MeBot (our agentic RPA) calls it to decide next actions. Datalytics uses it for natural-language analytics. Olivasal runs meeting summaries on it. Saffron, our telehealth platform, runs clinical models on it. One on-prem inference tier, many products — and no client data ever leaves the building.

The honest tradeoffs

Self-hosting isn't free. You own the ops — uptime, drivers, model updates. Open models are excellent now but a very hardest reasoning task may still favour a frontier API. And you'll think about concurrency and batching that a managed API hides from you. Our rule of thumb: self-host anything touching sensitive data or running at volume; reach for a frontier API only for the rare task that genuinely needs it. For most real work, your own hardware wins on cost, privacy, and control at the same time.

If you need AI running inside your own walls — because your data can't leave, or the API bill doesn't scale — that's precisely what we build for clients → megamtech.com

Build a Self-Hosted Meeting-Summary Pipeline (No Cloud, No Data Leaving Your Box)

Every AI notetaker on the market has the same catch: to summarize your meeting, it uploads your audio to its cloud. For a regulated team — healthcare, finance, anyone under data-residency rules — that's a non-starter. Your private calls quietly become someone else's data.

This is exactly why we built the summary engine inside Olivasal to run entirely on infrastructure you control. In this post I'll show you the core of it: take a recorded meeting, produce a clean transcript, and generate a structured summary with action items — all on your own box, nothing leaving the network.

What you'll build

A three-stage pipeline: recording → transcript → AI summary. Stage 1 uses your meeting recording (from LiveKit egress, or any file). Stage 2 transcribes with faster-whisper. Stage 3 summarizes with a local Qwen model served by Ollama. No external API is ever called.

Prerequisites

# System: ffmpeg for audio, plus Python 3.10+
# A GPU makes transcription much faster, but CPU works too

pip install faster-whisper openai

# Ollama for the local LLM (https://ollama.com)
ollama pull qwen2.5

Step 1 — Get the audio

If you're running a conference stack, LiveKit's egress writes a recording when the call ends. Composite egress gives you one mixed file; per-track egress gives you one file per participant (we'll use that later for speaker labels). Either way, extract a clean 16 kHz mono WAV — the format Whisper likes best:

ffmpeg -i meeting.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

Step 2 — Transcribe locally with faster-whisper

faster-whisper is a reimplementation of Whisper that's several times quicker and lighter on memory. The vad_filter flag trims silence so you don't waste compute on dead air.

from faster_whisper import WhisperModel

# use device="cpu", compute_type="int8" if you have no GPU
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "audio.wav",
    beam_size=5,
    vad_filter=True,
)

transcript = " ".join(seg.text.strip() for seg in segments)
print(f"Detected language: {info.language}")
print(transcript)

That's your full transcript, produced without a single byte leaving the machine.

Step 3 — Summarize with a local LLM

Ollama exposes an OpenAI-compatible endpoint, so you can use the standard openai client and just point it at localhost. We ask for structured JSON so the output is easy to store and render.

from openai import OpenAI
import json

# api_key is required by the client but ignored by Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

prompt = f"""You are a meeting assistant. From the transcript below, return JSON with:
- "summary": a 3-sentence overview
- "decisions": a list of key decisions made
- "action_items": a list of objects with "task" and "owner"

Transcript:
{transcript}"""

resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    temperature=0.2,
)

notes = json.loads(resp.choices[0].message.content)
print(json.dumps(notes, indent=2))

You now have clean, structured meeting notes — summary, decisions, and owner-tagged action items — generated entirely on your own hardware.

Step 4 — Add speaker labels (optional)

This is where per-track egress earns its keep. Instead of one mixed file, you get a separate audio track per participant. Transcribe each track on its own, tag every segment with that participant's identity, then merge all segments back together in timestamp order. The result is a speaker-attributed transcript — "Priya: … / Arun: …" — which makes the LLM's action-item ownership far more accurate, because it can see who actually committed to what.

Why self-hosted matters here

The convenience of AI summaries usually comes at the cost of handing your conversations to a vendor. This pipeline gives you the convenience and keeps every stage — media, transcript, and model — inside your own walls. For regulated teams, that's not a nice-to-have; it's the whole requirement.

This is the engine behind Olivasal, our self-hosted video conferencing with built-in AI summaries — and the same core powers the clinical scribe in our telehealth platform, Saffron. If you'd rather have this running out of the box than wire it up yourself, that's exactly what we built → olivasal.com