Tips and tricks · Level 2

Image and screen input

10 min beginner

Most people only ever feed AI tools text. The chat agents have eyes too. Learning to use them turns “I am stuck on this thing in front of me” into a five-second prompt.

The obvious uses

Most of the time, an image is the obvious right answer: the broken UI, the whiteboard at the end of the meeting, the error message you do not want to retype, the chart you cannot quite read. If you have used a chat agent for a week or two, you have probably already done this without anyone teaching you. Three patterns that cover almost all of it:

The “what is wrong with this” screenshot. Take a screenshot of the broken UI, the failing chart, the email that does not look right. Drag it in and ask the agent what is off. Especially good for visual layouts where the bug is “looks weird” and you cannot quite name it.
The whiteboard photo. Snap your phone camera at a whiteboard or notepad. Ask the agent to extract the structure into clean text, group the related items, or turn it into a working outline. Beats retyping six bullet points squinting at a blurry photo.
The screen-share or live mode. Gemini Live (Android, Chrome) and ChatGPT Advanced Voice (iOS, Android) let you share your screen or camera live during a voice conversation. Useful when “look at this with me” is faster than “let me describe what I am looking at”. Claude does not have an equivalent at the time of writing, but you can still drag screenshots into a regular chat. Wherever you do use live screen-share, the conversational model is the one talking back, so use it for thinking and walking through, not for careful written work. (Same warning we hit in Talking to your AI.)

The model reads the image into its context the same way it reads text. Same rules apply: relevant content beats more content, a tight crop around the bug beats a full-screen capture.

The non-obvious one: when a file is not enough

Sometimes you already have the file the agent could read, but the agent still needs the screenshot to actually see what you are looking at. This is the move most people miss for months.

The classic case is PowerPoint. Drop the .pptx into a chat and the agent reads the text and structure: titles, bullets, speaker notes. It does not see the layout. The headline that runs off the slide, the chart that overlaps the photo, the colour combo that fights itself: all of that is invisible to the model when it parses the file. Screenshot the rendered slide, drop that in alongside, and ask “what is wrong with how this looks?”. Suddenly the agent sees what your audience would see.

Same pattern shows up across formats:

Web design. The agent reads your HTML and CSS as code. It does not know whether the rendered page is broken until you screenshot it.
Documents with layout. Word docs, PDFs with columns or boxed callouts, anything where the visual hierarchy carries meaning. The text content lands; the visual question stays unanswered.
Charts inside a doc. The agent gets the underlying numbers (or worse, the alt-text). The bad axis label, the unreadable colour, the misaligned legend, those are only visible in the rendered image.
Spreadsheets where the formatting carries the meaning. Conditional formatting, merged cells, embedded charts. The cells read fine; the dashboard does not exist for the agent until you screenshot it.

Rule of thumb: if the answer depends on what something looks like, not what it says, give the agent the screenshot, not the file. Often you want both.

What images cannot replace

An image is great context, not a substitute for the underlying data. A screenshot of a 200-row spreadsheet is a worse prompt than the spreadsheet attached as a file; the model can read the file directly but has to OCR the screenshot. Same for a PDF when the question is about the text, not the layout: attach the PDF, do not photograph it. Reach for an image when the visual itself is the question. When the question is about the data, use the file.

Hands-on

Pick something on your screen right now that is mildly annoying or confusing: a chart, a UI you do not understand, an error message, a doc layout. Take a screenshot, drop it into your chat agent, and ask “What is going on here? What would you do?”

Pick a slide deck or web page you are working on. Attach the source file (.pptx, the page URL, or the rendered HTML) to a chat agent and ask “review this for visual problems”. Then screenshot one slide or section that looks weak, drop the screenshot into the same chat, and ask the same question. Compare the two answers. The screenshot answer will catch things the file alone cannot.

Open live screen-share in Gemini Live (Android or Chrome) or ChatGPT Advanced Voice (iOS or Android). Spend two minutes walking the agent through one of your open windows out loud. Notice the gap between this and a typed prompt: faster for thinking, looser for careful answers.

Reflect

Where in your week do you describe something visual in words because handing over the picture has not occurred to you? What is one thing you could screenshot tomorrow instead of writing about?
Pick the slide deck, doc, or web page you would have shared with someone for feedback. Have you been giving the agent the file or the screenshot? Which one would actually tell you more about how it looks to your audience?

View as plain markdown for LLMs and copy-paste

2 / 4 in Tips and tricks