Vision|PipeVision|Pipe
Vision|Pipe logo

Give your LLM eyes.

screenshot | llm — now a reality.

Vision|Pipe is a lightweight open source utility that captures your screen and pipes it, along with your voice, text, or visual annotations plus rich contextual metadata, directly into any LLM.

Built for developers who think in pipes.

Download for Mac

Windows support coming soon

View on GitHub →

The Loop You’re Stuck In

You’re working with an AI and need to show it what’s on your screen. You describe it in words. It misunderstands. You describe again. Repeat.

Every time you type “the button in the top right of the modal” instead of just pointing at it, you’re paying a productivity tax that compounds across every debugging session, every code review, every UI bug report you file.

The gap between what you see and what your AI understands is costing you hours.

Vision|Pipe Skips the Description

Capture the screen. Annotate however feels natural — speak it, type it, or draw it. Paste the full context — image, annotation, and metadata — into your LLM in one action.

No uploads. No integrations. No UI sprawl.

Just the Unix philosophy applied to AI vision: do one thing, do it well, compose it with everything else.

Every other tool

Capture

Upload image

Switch to LLM

Type context

Submit

Vision|Pipe

Capture + Comment

Paste

Submit

Five Steps. One Keystroke.

1

Screenshot or GIF goes here

1

Press your hotkey

Cmd+Shift+CMacCtrl+Shift+CWindows

One keystroke activates the capture overlay. No menus, no clicks. Default is Cmd+Shift+C (Mac) / Ctrl+Shift+C (Windows) — configurable to whatever you prefer.

2

Screenshot or GIF goes here

2

Select a region

Drag to capture any area of your screen. Full screen or surgical precision.

3

Screenshot or GIF goes here

3

Annotate your intent

Speak it, type it, or draw it. Voice, text, and markup — all at the moment of capture.

4

Screenshot or GIF goes here

4

Hit Enter

Screenshot + annotation + rich metadata are bundled into one clipboard payload.

5

Screenshot or GIF goes here

5

Paste into any LLM

GPT-4, Claude, Gemini, Codex — any AI that accepts images. Your LLM gets it right on the first try.

Captures What You Mean,
Not Just What You See

Every other screenshot tool captures pixels. Vision|Pipe captures intent.

Speak It

Record a voice note alongside your screenshot. Vision|Pipe transcribes it automatically using on-device Whisper and bundles the transcript with the image.

"This dropdown is rendering below the viewport on Safari — why?"

Type It

Add a written comment at the exact moment of capture. Your intent travels with the image as a single payload.

Why is this button misaligned in dark mode?

Draw It

Circle the problem. Highlight the element. Draw an arrow. A lightweight markup layer so your LLM knows exactly what to focus on.

All three, combined. Voice, text, and drawing can be used simultaneously. The full context is bundled into one clipboard payload. Paste once — your AI has everything.

Your LLM Gets the Full Picture

Vision|Pipe doesn’t just send a screenshot. It sends the complete context of where and what the image was captured from — automatically appended to every clipboard payload.

Spatial & Display

Capture regionx: 240, y: 180
Capture dimensions1200 × 800 px
Screen resolution2560 × 1600
DPI / scale factor2x (Retina)
MonitorLG UltraFine 5K
Color profileDisplay P3

Window & Application

Active applicationVisual Studio Code
Window titlevisionpipe — README.md
Window size1440 × 900
Window stateWindowed
Process IDPID 4821

Browser Context

BrowserChrome 124.0
Active tab URLgithub.com/visionpipe
Page titleVision|Pipe — GitHub
Viewport1440 × 789

System

Timestamp2026-04-11T14:32:01Z
Operating systemmacOS 15.3.2
Hostnamecolins-macbook-pro
Screen count2

Captured via macOS Accessibility API and Windows UI Automation. No browser extension required.

Every Other Tool Was Built for Humans

Vision|Pipe was built for your AI.

ToolBuilt ForLLM-NativeAnnotate at CaptureRich Metadata
PlaywrightProgrammatic browser automationPartial
Zight / CleanShot XSharing with humansPost-capture only
SnagitDocumentation & tutorialsPost-capture only
macOS ScreenshotGeneral capture
Vision|PipePiping visual context into LLMs
Voice, text, drawing
“If Playwright gives your test suite vision, Vision|Pipe gives you vision.”

Built the Right Way

Tauri

Lightweight and secure — not Electron. Minimal memory footprint.

Rust

Systems-level metadata capture, performance, and reliability.

Whisper

On-device transcription — no audio leaves your machine.

Built in the Open

Vision|Pipe is source-available and community-driven. The code is visible, forkable, and we welcome pull requests.

# Fork the repo

git checkout -b feature/your-feature

git commit -am 'Add your feature'

# Open a Pull Request

Questions? Open an issue or reach out on X @Vision_Pipe.

Stop Describing. Start Showing.

Free for personal use. Open for contributions. Built for developers.

Download for Mac

Windows support coming soon

Everything It Does. Nothing It Doesn’t.

Lightweight

Minimal CPU and memory footprint — Tauri, not Electron

Fast

Capture and copy in milliseconds

Multi-modal

Voice, text, and drawing annotation in one tool

Auto-transcription

Voice notes converted to text on-device via Whisper

Rich metadata

Spatial, window, browser, and system context bundled automatically

Open source

See exactly what you're running

Cross-platform

Mac and Windows

Keyboard-first

One hotkey does everything

LLM-agnostic

Works with any AI that accepts images

No accounts

No API keys, no logins, no cloud dependency

Clipboard-native

Composes naturally with every LLM UI on the planet