Search This Blog

Showing posts with label Megam. Show all posts
Showing posts with label Megam. Show all posts

Wednesday, July 1, 2026

How We Run Our Own LLMs on Our Own Hardware (and Why We Don't Just Call the Big APIs)

When you need an LLM, the default move in 2026 is to grab an API key from one of the big providers and start calling. For a lot of what we build at Megam, that's the wrong default. Our automation reads client back-office data. Our telehealth platform touches patient records. Our conferencing tool handles private calls. None of that can be shipped off to someone else's cloud — and honestly, even where it could, the economics and the loss of control often don't make sense.

So we run our own inference. Here's how we stand it up, and where the line actually falls between self-hosting and calling an API.

Why self-host at all

  • Data residency & privacy. The data never leaves infrastructure we control. For regulated clients — healthcare, finance, GCC data-residency rules — this isn't a preference, it's the requirement.
  • Cost at volume. Per-token pricing is cheap for a demo and brutal at scale. A fixed box you've already paid for doesn't meter you by the request.
  • Control. The model doesn't change under you overnight, doesn't deprecate, doesn't rate-limit you mid-batch. You pin the version and it stays put.
  • Offline capability. Air-gapped and on-prem deployments simply can't reach a public API. Local is the only option.

Step 1 — Pick the box, size the model to the VRAM

The single constraint that matters is GPU memory. Rough working guide, using 4-bit quantized models:

~7-8B  model  ->  ~6 GB VRAM   (fits almost anything)
~14B   model  ->  ~10-12 GB
~27-32B model  ->  ~20-24 GB    (a 32 GB card handles this comfortably)
~70B   model  ->  ~40 GB+       (big card or multi-GPU)

A modern 24-32 GB card covers the vast majority of real workloads. You don't need frontier-scale hardware to get genuinely useful results.

Step 2 — Install Ollama and pull your models

Ollama is the simplest way to serve open models. One server can host several — chat, vision, and embeddings — and load them on demand.

# Linux install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a few models for different jobs
ollama pull qwen2.5          # general chat / reasoning
ollama pull qwen2.5vl        # vision (images, screenshots, documents)
ollama pull nomic-embed-text # embeddings for RAG / search

Step 3 — Serve it on the network with an OpenAI-compatible API

By default Ollama listens only on localhost. To let your apps reach it, bind it to all interfaces:

# Expose Ollama on the LAN
export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Important: Ollama has no built-in authentication, so never expose port 11434 to the open internet directly. Put Nginx in front of it for TLS and an API key, so only your apps get through:

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    # ssl_certificate ... ssl_certificate_key ...;

    location /v1/ {
        # simple shared-secret gate
        if ($http_authorization != "Bearer YOUR_SECRET_KEY") {
            return 401;
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;   # long generations
    }
}

Step 4 — Call it from any app (it's just the OpenAI SDK)

Because Ollama speaks the OpenAI API format, your application code doesn't care that it's talking to your own box. You change the base_url and nothing else. Migrating from a public API — or back — is a one-line change.

from openai import OpenAI

client = OpenAI(
    base_url="https://ai.yourdomain.com/v1",
    api_key="YOUR_SECRET_KEY",
)

resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": "Summarize this ticket in one line: ..."}],
)
print(resp.choices[0].message.content)

Step 5 — One server, many jobs

The same endpoint routes different tasks to different models — send chat to qwen2.5, screenshots and documents to qwen2.5vl, and use nomic-embed-text to build the vectors behind a RAG search. Ollama swaps models in and out of VRAM as calls come in, so a single machine covers a surprising amount of ground.

Where this powers our products

This isn't a lab experiment — it's the shared engine under everything we ship. The reasoning layer in MeBot (our agentic RPA) calls it to decide next actions. Datalytics uses it for natural-language analytics. Olivasal runs meeting summaries on it. Saffron, our telehealth platform, runs clinical models on it. One on-prem inference tier, many products — and no client data ever leaves the building.

The honest tradeoffs

Self-hosting isn't free. You own the ops — uptime, drivers, model updates. Open models are excellent now but a very hardest reasoning task may still favour a frontier API. And you'll think about concurrency and batching that a managed API hides from you. Our rule of thumb: self-host anything touching sensitive data or running at volume; reach for a frontier API only for the rare task that genuinely needs it. For most real work, your own hardware wins on cost, privacy, and control at the same time.

If you need AI running inside your own walls — because your data can't leave, or the API bill doesn't scale — that's precisely what we build for clients → megamtech.com

Catching Insurance Claim Denials Before You Submit

In revenue-cycle management, the denial is the most expensive message you'll ever receive. Every denied claim is money you already earned — care delivered, work done — now frozen behind rework, appeals, and a filing clock that's ticking down. Some of it comes back after weeks of effort. A painful share never comes back at all, and quietly gets written off.

What makes it worse is when most teams deal with denials: after they happen. The claim goes out, the payer sends it back, and only then does someone open a worklist and start the reactive scramble. It's the most expensive possible point to intervene — you're now paying staff to recover revenue you'd already booked.

Here's the thing we kept noticing across RCM work: denials aren't random. They cluster into a handful of predictable patterns — a missing or expired prior authorization, an eligibility lapse, a diagnosis-to-service code mismatch, a modifier issue, a timely-filing miss, a duplicate. And if a pattern is predictable, it's catchable — before the claim leaves the building, when the fix costs a few minutes instead of a multi-week appeal.

That single shift — from working denials to preventing them — is where the money is. So we've been building it.

Why a generic scrubber doesn't cut it

The hard part is that "what causes a denial" depends entirely on where you're billing.

  • In the US, claims run on X12, and the reasons come back as CARC/RARC codes buried in the remittance.
  • In KSA, it's NPHIES — a FHIR R4 world with its own adjudication outcomes and error structures.
  • In the UAE, it's eClaimLink / DHPO — an XML format with DHA and DOH code sets of their own.

A rule that catches a denial in one market is meaningless in another. And on top of the market differences, every individual payer has its own quirks — the undocumented reasons this insurer rejects that service. A one-size-fits-all checker misses most of what actually matters.

What we're building

The approach we're rolling into our RCM stack has two layers, on purpose:

  1. An explainable rule engine. Market-aware, payer-aware rules that check a claim before submission — eligibility, authorization, coding logic, completeness, timely-filing windows. Rules are the right first layer because they're editable and they tell you exactly why a claim is risky. When a client says "this payer always denies X," that becomes a rule in minutes.
  2. A denial-probability model on top. Trained on the client's own historical claims and remittances, it scores each new claim and surfaces the likely denial reason plus a suggested fix — learning the payer quirks the rules haven't caught yet.

Everything lands where the client can see it: a live dashboard (built on Datalytics) showing first-pass denial rate, the top denial reasons by payer and service, and — the number that actually matters — how that rate drops over time.

The whole philosophy fits on a sticky note: the cheapest denial is the one that never happens. Prevention beats appeal every time, because prevention costs minutes and appeals cost weeks.

We're putting this to work in RCM across the UAE, KSA, and US markets. If denials are eating into your margin and your team is stuck working them after the fact, that's exactly the problem we're solving — let's talk → megamtech.com