2026-03-04 / projects

mpv AI Companion

A floating panel that captures frames from mpv and sends them to vision LLMs. Ask questions about what you're watching, with conversation memory.

The problem

I watch a lot of foreign films and obscure stuff. Sometimes I want to know what a character just said, what a sign in the background reads, or what's going on in a visually dense scene. I already run Ollama with vision models locally. Why not just point them at the current frame?

The idea is simple. You're watching something in mpv. You type a question. The tool captures the exact frame on screen, sends it to a vision model along with your question, and shows the response in a side panel. It remembers your conversation, so follow-up questions work without resending the image.

How it works

mpv exposes a JSON IPC interface over a Unix socket (or named pipe on Windows). You start mpv with --input-ipc-server=/tmp/mpvsocket, and then any process can connect and send commands. The key command is screenshot-to-file, which grabs the current video frame and writes it to disk.

Here's the IPC class that handles the socket communication:

class MpvIPC:
    """JSON IPC bridge to a running mpv instance."""

    def __init__(self, path: str):
        self.path = path
        self._sock = None
        self._lock = threading.Lock()
        self._req_id = 0

    def connect(self):
        self._sock = socket.socket(
            socket.AF_UNIX, socket.SOCK_STREAM
        )
        self._sock.connect(self.path)
        self._sock.settimeout(5.0)

    def screenshot(self, path: str) -> bool:
        result = self._send(["screenshot-to-file", path, "video"])
        return result.get("error") == "success"

Every command gets a unique request_id. mpv sends back events and replies on the same socket, so the _send method has to loop through incoming lines and match the right reply. This was one of the trickier parts. mpv is chatty. It fires property-change events, seek events, all kinds of stuff. You have to filter for your specific response by ID.

Frame capture details

When you ask a question, the panel grabs the current playback position with get_property time-pos, then calls screenshot-to-file with the "video" flag. That flag means "just the video, no subtitles or OSD." The frame goes to a temp file as PNG.

# Capture frame at current position
t = mpv.get_time_pos()
path = os.path.join(tmp_dir, f"mpv_comp_{int(t * 1000)}.png")
if mpv.screenshot(path):
    _downscale_image(path)  # cap at 720px wide
    image_paths.append(path)

The GUI version downscales the image to 720px wide before sending. Vision models don't need 4K frames, and smaller images mean faster inference. The downscaling uses Qt's built-in QImage.scaledToWidth, so there's no PIL dependency.

After the response comes back, the temp file gets deleted. No frame data accumulates on disk.

Conversation memory

The model remembers earlier questions within a session. But there's a catch with vision models: sending images in every message gets expensive and slow. So I only send the current frame image with the current question. Previous turns are text-only.

if not s["history"]:
    prompt = (
        f"[System: {SYSTEM_PROMPT}]\n\n"
        f"Film: {s['media_title']}\n"
        f"Timestamp: {ts_str}\n\n"
        f"{self.user_input}"
    )
else:
    prompt = f"[{ts_str}] {self.user_input}"

The first question includes the system prompt, film title, and timestamp. Follow-ups just get a timestamp prefix. History is capped at 20 turns (40 messages) so the context window doesn't blow up during a long movie session.

Multi-provider support

It started as Ollama-only. Then I added Gemini, OpenAI, and Anthropic because sometimes you want a better model for a hard question, or you want to compare answers. Each provider has its own client class, but they all share the same interface: query(prompt, image_paths, history).

Provider	Default Model	Env Variable
Ollama (local)	`granite3.2-vision`	None needed
Google Gemini	`gemini-2.5-flash`	`GEMINI_API_KEY`
OpenAI	`gpt-4.1-mini`	`OPENAI_API_KEY`
Anthropic	`claude-sonnet-4-6`	`ANTHROPIC_API_KEY`

A factory function picks the right client:

def create_client(provider: str, model: str = "", **kwargs):
    if provider == "gemini":
        return GeminiClient(model or "gemini-2.5-flash")
    elif provider == "openai":
        return OpenAIClient(model or "gpt-4.1-mini")
    elif provider == "anthropic":
        return AnthropicClient(model or "claude-sonnet-4-6")
    else:
        return OllamaClient(model or DEFAULT_MODEL, **kwargs)

Each cloud provider handles images differently. Gemini wants inline_data with a mime type. OpenAI wants image_url with a data URI. Anthropic wants a source object with base64. The client classes hide all of that. Same thing with the history format. Gemini uses "model" for assistant turns while everyone else uses "assistant".

The floating panel

The GUI is a frameless, translucent PyQt6 window that snaps to the right edge of mpv. Every 100ms, a timer finds the mpv window position using platform APIs and repositions the panel.

def _snap_to_mpv(self):
    rect = get_mpv_window_rect()
    if rect:
        x, y, w, h = rect
        pw = COLLAPSED_WIDTH if self.collapsed else PANEL_WIDTH
        self.move(x + w, y)
        self.resize(pw, h)

On macOS, window detection uses CGWindowListCopyWindowInfo from the Quartz framework. On Windows, it enumerates windows with user32.EnumWindows and filters by title. Both paths are handled in one function, no external dependencies beyond pyobjc-framework-Quartz on Mac.

The panel collapses to a 36px strip when you close it, so it stays accessible without covering the video. LLM queries run on a QThread to keep the UI responsive, with a pulsing opacity animation on the status text while waiting.

CLI mode

There's also a terminal-only version for people who don't want a GUI. It uses pynput for a global hotkey (Ctrl+Space) that pre-captures a frame while you're still watching. Then you type your question and hit Enter.

The CLI uses Rich for formatted output. The hotkey listener runs in a daemon thread so it doesn't block the input loop. If you don't press Ctrl+Space before asking, it just grabs the live frame when you hit Enter.

Things I learned

mpv's IPC is noisy

The socket sends unsolicited events constantly. Property changes, seek notifications, file loads. The _send method has to scan every line for a matching request_id and ignore everything else. On Windows, named pipes don't support timeouts on readline(), so I had to wrap reads in a thread with a deadline.

Image size matters more than quality

Sending a 1920x1080 PNG to a local vision model is noticeably slower than sending 720px wide. The model doesn't need pixel-perfect resolution to tell you what's on screen. Downscaling before encoding is worth it.

Text-only history works surprisingly well

I expected follow-up questions to fail without re-sending the image. They mostly don't. The model remembers what it described earlier and can answer based on that context. This keeps requests fast and token counts low.

Running it

# Start mpv with IPC enabled
mpv --input-ipc-server=/tmp/mpvsocket your_film.mkv

# In another terminal, start the panel
python panel.py

# Or use CLI mode (Ollama only)
python companion.py --model gemma3:4b

The whole thing is three files. core.py has the IPC bridge and all four LLM clients. panel.py is the PyQt6 GUI. companion.py is the CLI version. No config files, no database, no build step.

Source is on GitHub.

back to index