Give your LLM eyes.

screenshot | llm — now a reality.

Vision|Pipe is a lightweight open source utility that captures your screen and pipes it, along with your voice, text, or visual annotations plus rich contextual metadata, directly into any LLM.

Built for developers who think in pipes.

Download for Mac

Windows support coming soon

View on GitHub →

The Loop You’re Stuck In

You’re working with an AI and need to show it what’s on your screen. You describe it in words. It misunderstands. You describe again. Repeat.

Every time you type “the button in the top right of the modal” instead of just pointing at it, you’re paying a productivity tax that compounds across every debugging session, every code review, every UI bug report you file.

The gap between what you see and what your AI understands is costing you hours.

Vision|Pipe Skips the Description

Capture the screen. Annotate however feels natural — speak it, type it, or draw it. Paste the full context — image, annotation, and metadata — into your LLM in one action.

No uploads. No integrations. No UI sprawl.

Just the Unix philosophy applied to AI vision: do one thing, do it well, compose it with everything else.

Every other tool

Capture

↓

Upload image

↓

Switch to LLM

↓

Type context

↓

Submit

Vision|Pipe

Capture + Comment

↓

Paste

↓

Submit

Five Steps. One Keystroke.

Screenshot or GIF goes here

Press your hotkey

Cmd+Shift+CMacCtrl+Shift+CWindows

One keystroke activates the capture overlay. No menus, no clicks. Default is Cmd+Shift+C (Mac) / Ctrl+Shift+C (Windows) — configurable to whatever you prefer.

Screenshot or GIF goes here

Select a region

Drag to capture any area of your screen. Full screen or surgical precision.

Screenshot or GIF goes here

Annotate your intent

Speak it, type it, or draw it. Voice, text, and markup — all at the moment of capture.

Screenshot or GIF goes here

Hit Enter

Screenshot + annotation + rich metadata are bundled into one clipboard payload.

Screenshot or GIF goes here

Paste into any LLM

GPT-4, Claude, Gemini, Codex — any AI that accepts images. Your LLM gets it right on the first try.

Captures What You Mean,
Not Just What You See

Every other screenshot tool captures pixels. Vision|Pipe captures intent.

Speak It

Record a voice note alongside your screenshot. Vision|Pipe transcribes it automatically using on-device Whisper and bundles the transcript with the image.

"This dropdown is rendering below the viewport on Safari — why?"

Type It

Add a written comment at the exact moment of capture. Your intent travels with the image as a single payload.

Why is this button misaligned in dark mode?

Draw It

Circle the problem. Highlight the element. Draw an arrow. A lightweight markup layer so your LLM knows exactly what to focus on.

All three, combined. Voice, text, and drawing can be used simultaneously. The full context is bundled into one clipboard payload. Paste once — your AI has everything.

Your LLM Gets the Full Picture

Vision|Pipe doesn’t just send a screenshot. It sends the complete context of where and what the image was captured from — automatically appended to every clipboard payload.

Spatial & Display

Capture regionx: 240, y: 180

Capture dimensions1200 × 800 px

Screen resolution2560 × 1600

DPI / scale factor2x (Retina)

MonitorLG UltraFine 5K

Color profileDisplay P3

Window & Application

Active applicationVisual Studio Code

Window titlevisionpipe — README.md

Window size1440 × 900

Window stateWindowed

Process IDPID 4821

Browser Context

BrowserChrome 124.0

Active tab URLgithub.com/visionpipe

Page titleVision|Pipe — GitHub

Viewport1440 × 789

System

Timestamp2026-04-11T14:32:01Z

Operating systemmacOS 15.3.2

Hostnamecolins-macbook-pro

Screen count2

Captured via macOS Accessibility API and Windows UI Automation. No browser extension required.

Every Other Tool Was Built for Humans

Vision|Pipe was built for your AI.

Tool	Built For	Annotate at Capture	Rich Metadata
Playwright	Programmatic browser automation		Partial
Zight / CleanShot X	Sharing with humans	Post-capture only
Snagit	Documentation & tutorials	Post-capture only
macOS Screenshot	General capture
Vision\|Pipe	Piping visual context into LLMs	Voice, text, drawing

“If Playwright gives your test suite vision, Vision|Pipe gives you vision.”

Built the Right Way

Tauri

Lightweight and secure — not Electron. Minimal memory footprint.

Rust

Systems-level metadata capture, performance, and reliability.

Whisper

On-device transcription — no audio leaves your machine.

Built in the Open

Vision|Pipe is source-available and community-driven. The code is visible, forkable, and we welcome pull requests.

# Fork the repo

git checkout -b feature/your-feature

git commit -am 'Add your feature'

# Open a Pull Request

View on GitHub Contributing Guide

Questions? Open an issue or reach out on X @Vision_Pipe.

Stop Describing. Start Showing.

Free for personal use. Open for contributions. Built for developers.

Download for Mac

Windows support coming soon

Everything It Does. Nothing It Doesn’t.

Lightweight

Minimal CPU and memory footprint — Tauri, not Electron

Fast

Capture and copy in milliseconds

Multi-modal

Voice, text, and drawing annotation in one tool

Auto-transcription

Voice notes converted to text on-device via Whisper

Rich metadata

Spatial, window, browser, and system context bundled automatically

Open source

See exactly what you're running

Cross-platform

Mac and Windows

Keyboard-first

One hotkey does everything

LLM-agnostic

Works with any AI that accepts images

No accounts

No API keys, no logins, no cloud dependency

Clipboard-native

Composes naturally with every LLM UI on the planet

Give your LLM eyes.

The Loop You’re Stuck In

Vision|Pipe Skips the Description

Five Steps. One Keystroke.

Press your hotkey

Select a region

Annotate your intent

Hit Enter

Paste into any LLM

Captures What You Mean,Not Just What You See

Speak It

Type It

Draw It

Your LLM Gets the Full Picture

Spatial & Display

Window & Application

Browser Context

System

Every Other Tool Was Built for Humans

Built the Right Way

Built in the Open

Stop Describing. Start Showing.

Everything It Does. Nothing It Doesn’t.

Captures What You Mean,
Not Just What You See