Spreading Knol: Tutorial

Showing posts with label Tutorial. Show all posts

Wednesday, July 1, 2026

Give Your LLM Hands: Turn Browser Automation Into MCP Tools

LLMs are brilliant at deciding what to do and useless at actually doing it. Automation tools are the opposite — they execute flawlessly but can't reason about anything. The obvious move is to bolt the two together: let the model plan, and let deterministic automation carry out each step. The question is how you connect them cleanly.

The answer that's become a standard is MCP — the Model Context Protocol. It's a common language for exposing "tools" to a model. Wrap your automation actions as MCP tools, point a reasoning model at them, and the model can drive real software. This is the exact pattern under MeBot's agentic RPA. Let's build a small version of it.

Step 1 — Install the pieces

pip install "mcp[cli]" playwright
playwright install chromium

Step 2 — Stand up an MCP server

FastMCP (bundled with the MCP SDK) turns plain Python functions into tools. Any function you decorate becomes something a model can call — the docstring and type hints are the tool's description, so the model knows when to use it.

from mcp.server.fastmcp import FastMCP
from playwright.sync_api import sync_playwright

mcp = FastMCP("mebot-browser")

_pw = sync_playwright().start()
browser = _pw.chromium.launch(headless=False)
page = browser.new_page()

Step 3 — Expose automation actions as tools

Here's the key design decision: the tools themselves stay strictly deterministic. Each one does exactly one concrete thing — no cleverness, no improvising. You never want an LLM "creatively interpreting" a click on a banking screen. The intelligence lives in the model's choice of which tool to call; the execution is boringly reliable.

@mcp.tool()
def open_url(url: str) -> str:
    """Navigate the browser to a URL."""
    page.goto(url)
    return f"Opened {url}"

@mcp.tool()
def click(text: str) -> str:
    """Click the visible element matching the given text."""
    page.get_by_text(text, exact=False).first.click()
    return f"Clicked '{text}'"

@mcp.tool()
def type_text(label: str, value: str) -> str:
    """Type text into the field with the given label."""
    page.get_by_label(label).fill(value)
    return f"Typed into '{label}'"

@mcp.tool()
def read_screen() -> str:
    """Return the visible text on the current page."""
    return page.inner_text("body")

if __name__ == "__main__":
    mcp.run()

Step 4 — Let a reasoning model drive

Now connect a model as an MCP client. It requests the tool list, then works toward a goal by calling tools and reading results — a loop of observe → decide → act. Give it an objective like:

Goal: "Log into the portal and download this month's invoice."

The model plans and calls tools in sequence:
  open_url("https://portal.example.com")
  read_screen()                      -> sees a login form
  type_text("Username", "...")
  type_text("Password", "...")
  click("Sign in")
  read_screen()                      -> sees the dashboard
  click("Invoices")
  ...

Crucially, the model reads the screen between steps. When a button moves or a field is renamed, it adapts — because it's reasoning about what's actually there, not replaying recorded coordinates. That's the difference between a bot that breaks on the first UI change and one that copes.

Step 5 — Guardrails

Autonomy without brakes is a liability, so real deployments add two things: an approvals gate for irreversible or high-risk actions (payments, deletions, submissions) that pauses for a human, and vision — OCR and UI detection — for legacy screens where there's no clean text or label to target. Together they make the agent safe to point at production systems, including the old Windows apps that classic RPA struggles with most.

Deterministic execution, reasoning on top

That's the whole philosophy in one line: keep the doing precise, put the thinking in a layer above it. Wrapping actions as MCP tools is what lets a model orchestrate real work without you hard-coding every branch — and it's exactly how MeBot turns brittle scripts into automation that adapts.

If you'd rather deploy this than assemble it — with the vision layer, approvals inbox, and legacy-app support already built — that's what we ship → takemebot.com

Embed Analytics Into Your Product in an Afternoon (Without Building a BI Team)

Sooner or later, every SaaS product hears the same request from customers: "Can we get reporting on our own data, inside the app?" And every engineering team gives the same sigh, because building analytics from scratch — a query layer, a charting engine, per-tenant security, a dashboard builder — is months of work that has nothing to do with your actual product.

That's the whole reason Datalytics ships as an embeddable SDK. Instead of building BI, you drop a component into your app and pass it a secure token. Here's the full integration, end to end.

Step 1 — Drop in the component

Datalytics dashboards are web components (built on Angular Elements), which means they run in any frontend — React, Vue, Angular, or plain HTML. Add the script and place the element:

<script src="https://cdn.getdatalytics.com/embed.js"></script>

<datalytics-dashboard
    dashboard-id="sales-overview"
    theme="light">
</datalytics-dashboard>

At this point the component exists but shows nothing — because it doesn't yet know who's asking or what they're allowed to see. That's the important part.

Step 2 — Mint a signed token on your backend

Embedded analytics lives or dies on security: customer A must never see customer B's data. The clean way to enforce this is a two-token handshake. Token one is an identity assertion your own backend signs — you already know who the logged-in user is and which tenant they belong to, so you vouch for them.

Your server signs a short-lived RS256 JWT with your private key. Datalytics verifies it with your public key — so the browser never holds any long-lived secret:

// Node / NestJS backend
import jwt from 'jsonwebtoken';

function mintDatalyticsToken(user) {
  return jwt.sign(
    {
      sub: user.id,
      tenant: user.tenantId,          // scopes all queries to this tenant
      scope: ['dashboard:sales-overview'],
    },
    PRIVATE_KEY,                       // your RSA private key
    {
      algorithm: 'RS256',
      expiresIn: '10m',                // short-lived on purpose
      audience: 'datalytics',
      issuer: 'your-app',
    },
  );
}

Step 3 — The two-token exchange

Your frontend fetches that identity token and hands it to the component. Datalytics verifies the signature, reads the tenant and scope, and exchanges it for its own short-lived session token that governs the live dashboard connection. Two tokens, two jobs: yours proves identity, theirs runs the session.

const el = document.querySelector('datalytics-dashboard');

// fetch the signed identity token from YOUR backend
const res = await fetch('/api/analytics-token');
const { token } = await res.json();

// hand it over — the component does the exchange internally
el.token = token;

The dashboard renders, and every query it runs is automatically filtered to that tenant. Row-level security is enforced by the signed tenant claim — not by anything editable in the browser.

Step 4 — Make it look like your app

An embedded dashboard should feel native, not bolted-on. Theme it with CSS custom properties so it inherits your product's look:

datalytics-dashboard {
  --dl-accent: #0e9384;
  --dl-font: 'Inter', sans-serif;
  --dl-radius: 10px;
}

What you just avoided

No query engine, no charting library, no dashboard builder, no per-tenant security model — and no BI team to maintain any of it. Your users get live, self-serve analytics inside your product; you shipped it in an afternoon and moved on with your roadmap.

If you've been putting off "in-app reporting" because building it is a project of its own, this is the shortcut → getdatalytics.com

How We Run Our Own LLMs on Our Own Hardware (and Why We Don't Just Call the Big APIs)

When you need an LLM, the default move in 2026 is to grab an API key from one of the big providers and start calling. For a lot of what we build at Megam, that's the wrong default. Our automation reads client back-office data. Our telehealth platform touches patient records. Our conferencing tool handles private calls. None of that can be shipped off to someone else's cloud — and honestly, even where it could, the economics and the loss of control often don't make sense.

So we run our own inference. Here's how we stand it up, and where the line actually falls between self-hosting and calling an API.

Why self-host at all

Data residency & privacy. The data never leaves infrastructure we control. For regulated clients — healthcare, finance, GCC data-residency rules — this isn't a preference, it's the requirement.
Cost at volume. Per-token pricing is cheap for a demo and brutal at scale. A fixed box you've already paid for doesn't meter you by the request.
Control. The model doesn't change under you overnight, doesn't deprecate, doesn't rate-limit you mid-batch. You pin the version and it stays put.
Offline capability. Air-gapped and on-prem deployments simply can't reach a public API. Local is the only option.

Step 1 — Pick the box, size the model to the VRAM

The single constraint that matters is GPU memory. Rough working guide, using 4-bit quantized models:

~7-8B  model  ->  ~6 GB VRAM   (fits almost anything)
~14B   model  ->  ~10-12 GB
~27-32B model  ->  ~20-24 GB    (a 32 GB card handles this comfortably)
~70B   model  ->  ~40 GB+       (big card or multi-GPU)

A modern 24-32 GB card covers the vast majority of real workloads. You don't need frontier-scale hardware to get genuinely useful results.

Step 2 — Install Ollama and pull your models

Ollama is the simplest way to serve open models. One server can host several — chat, vision, and embeddings — and load them on demand.

# Linux install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a few models for different jobs
ollama pull qwen2.5          # general chat / reasoning
ollama pull qwen2.5vl        # vision (images, screenshots, documents)
ollama pull nomic-embed-text # embeddings for RAG / search

Step 3 — Serve it on the network with an OpenAI-compatible API

By default Ollama listens only on localhost. To let your apps reach it, bind it to all interfaces:

# Expose Ollama on the LAN
export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Important: Ollama has no built-in authentication, so never expose port 11434 to the open internet directly. Put Nginx in front of it for TLS and an API key, so only your apps get through:

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    # ssl_certificate ... ssl_certificate_key ...;

    location /v1/ {
        # simple shared-secret gate
        if ($http_authorization != "Bearer YOUR_SECRET_KEY") {
            return 401;
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;   # long generations
    }
}

Step 4 — Call it from any app (it's just the OpenAI SDK)

Because Ollama speaks the OpenAI API format, your application code doesn't care that it's talking to your own box. You change the base_url and nothing else. Migrating from a public API — or back — is a one-line change.

from openai import OpenAI

client = OpenAI(
    base_url="https://ai.yourdomain.com/v1",
    api_key="YOUR_SECRET_KEY",
)

resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": "Summarize this ticket in one line: ..."}],
)
print(resp.choices[0].message.content)

Step 5 — One server, many jobs

The same endpoint routes different tasks to different models — send chat to qwen2.5, screenshots and documents to qwen2.5vl, and use nomic-embed-text to build the vectors behind a RAG search. Ollama swaps models in and out of VRAM as calls come in, so a single machine covers a surprising amount of ground.

Where this powers our products

This isn't a lab experiment — it's the shared engine under everything we ship. The reasoning layer in MeBot (our agentic RPA) calls it to decide next actions. Datalytics uses it for natural-language analytics. Olivasal runs meeting summaries on it. Saffron, our telehealth platform, runs clinical models on it. One on-prem inference tier, many products — and no client data ever leaves the building.

The honest tradeoffs

Self-hosting isn't free. You own the ops — uptime, drivers, model updates. Open models are excellent now but a very hardest reasoning task may still favour a frontier API. And you'll think about concurrency and batching that a managed API hides from you. Our rule of thumb: self-host anything touching sensitive data or running at volume; reach for a frontier API only for the rare task that genuinely needs it. For most real work, your own hardware wins on cost, privacy, and control at the same time.

If you need AI running inside your own walls — because your data can't leave, or the API bill doesn't scale — that's precisely what we build for clients → megamtech.com

Build a Self-Hosted Meeting-Summary Pipeline (No Cloud, No Data Leaving Your Box)

Every AI notetaker on the market has the same catch: to summarize your meeting, it uploads your audio to its cloud. For a regulated team — healthcare, finance, anyone under data-residency rules — that's a non-starter. Your private calls quietly become someone else's data.

This is exactly why we built the summary engine inside Olivasal to run entirely on infrastructure you control. In this post I'll show you the core of it: take a recorded meeting, produce a clean transcript, and generate a structured summary with action items — all on your own box, nothing leaving the network.

What you'll build

A three-stage pipeline: recording → transcript → AI summary. Stage 1 uses your meeting recording (from LiveKit egress, or any file). Stage 2 transcribes with faster-whisper. Stage 3 summarizes with a local Qwen model served by Ollama. No external API is ever called.

Prerequisites

# System: ffmpeg for audio, plus Python 3.10+
# A GPU makes transcription much faster, but CPU works too

pip install faster-whisper openai

# Ollama for the local LLM (https://ollama.com)
ollama pull qwen2.5

Step 1 — Get the audio

If you're running a conference stack, LiveKit's egress writes a recording when the call ends. Composite egress gives you one mixed file; per-track egress gives you one file per participant (we'll use that later for speaker labels). Either way, extract a clean 16 kHz mono WAV — the format Whisper likes best:

ffmpeg -i meeting.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

Step 2 — Transcribe locally with faster-whisper

faster-whisper is a reimplementation of Whisper that's several times quicker and lighter on memory. The vad_filter flag trims silence so you don't waste compute on dead air.

from faster_whisper import WhisperModel

# use device="cpu", compute_type="int8" if you have no GPU
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "audio.wav",
    beam_size=5,
    vad_filter=True,
)

transcript = " ".join(seg.text.strip() for seg in segments)
print(f"Detected language: {info.language}")
print(transcript)

That's your full transcript, produced without a single byte leaving the machine.

Step 3 — Summarize with a local LLM

Ollama exposes an OpenAI-compatible endpoint, so you can use the standard openai client and just point it at localhost. We ask for structured JSON so the output is easy to store and render.

from openai import OpenAI
import json

# api_key is required by the client but ignored by Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

prompt = f"""You are a meeting assistant. From the transcript below, return JSON with:
- "summary": a 3-sentence overview
- "decisions": a list of key decisions made
- "action_items": a list of objects with "task" and "owner"

Transcript:
{transcript}"""

resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    temperature=0.2,
)

notes = json.loads(resp.choices[0].message.content)
print(json.dumps(notes, indent=2))

You now have clean, structured meeting notes — summary, decisions, and owner-tagged action items — generated entirely on your own hardware.

Step 4 — Add speaker labels (optional)

This is where per-track egress earns its keep. Instead of one mixed file, you get a separate audio track per participant. Transcribe each track on its own, tag every segment with that participant's identity, then merge all segments back together in timestamp order. The result is a speaker-attributed transcript — "Priya: … / Arun: …" — which makes the LLM's action-item ownership far more accurate, because it can see who actually committed to what.

Why self-hosted matters here

The convenience of AI summaries usually comes at the cost of handing your conversations to a vendor. This pipeline gives you the convenience and keeps every stage — media, transcript, and model — inside your own walls. For regulated teams, that's not a nice-to-have; it's the whole requirement.

This is the engine behind Olivasal, our self-hosted video conferencing with built-in AI summaries — and the same core powers the clinical scribe in our telehealth platform, Saffron. If you'd rather have this running out of the box than wire it up yourself, that's exactly what we built → olivasal.com

Search This Blog