For 25 years, optimizing a landing page for search meant optimizing the text. The HTML, the headings, the meta tags, the alt text, the schema. Images were "important for users" but functionally invisible to ranking engines except through their alt attributes. Videos were even less visible. The visual contents of your page were, from the search engine's perspective, an opaque blob of pixels next to the actual content that mattered.

That model is over.

Gemini Omni, the multimodal flagship Google announced at I/O 2026, treats images and video as first-class input on both ends — query and content. ChatGPT and Claude's vision layers have been growing in capability through Q1 and Q2. The new Google Search box accepts images, files, and videos as query input, and AI Mode matches those visual queries against the visual content of indexed pages. When a buyer attaches a screenshot of their current tool and asks "is there a better version of this?" — the engine is doing a real cross-modal match against the images on the pages it considers.

The practical consequence: your landing page's screenshots, product images, hero illustrations, and embedded demo videos are now AEO signals in the same way your headings and body copy have always been. They count. They get cited. They get matched against queries. They influence whether your page surfaces.

And — based on the 50+ landing pages I've audited for multimodal-readiness in the last six weeks — almost nobody has noticed.

What Multimodal Engines Actually See

Before getting into what to do, it's worth being concrete about what these engines actually extract from a page's visual content. The behavior varies by engine, but the shared patterns are clear enough now.

Images get visual embeddings. Every image on your page that the engine fetches gets converted to a vector that captures its visual content — what's depicted, what UI patterns are present, what colors and shapes dominate, what objects appear. Those embeddings are compared against query-attached images for similarity, and against the embeddings of competing pages' images for relative relevance. A page whose hero image visually matches the user's query gets a meaningful lift in candidate ranking.

Images get OCR'd. Any text inside an image — UI labels, button copy, headings inside a screenshot, captions inside an infographic — is extracted via OCR and added to the page's textual content for ranking purposes. This is why screenshots are now structurally valuable: a clean screenshot of your product UI is roughly equivalent to publishing all the labels and copy from your product as additional text on the page. Without the screenshot, those labels live behind a login wall and aren't indexed at all.

Videos get frame-sampled at roughly 1 fps. Multimodal AI engines don't watch your video at 30 frames per second. They sample frames at approximately one per second, run each sampled frame through the same visual embedding pipeline as images, and combine those frame embeddings with the audio transcript (if captions exist) to construct a representation of the video's content. This has a critical implication: if your demo video uses fast cuts where key UI is on screen for under a second, the AI engine literally never sees it. Slower-paced, longer-hold visuals get indexed. Fast cuts disappear.

Captions are primary, not secondary. Closed captions (a .vtt or .srt file delivered alongside the video) are the cleanest semantic anchor multimodal engines have for video content. Auto-generated YouTube captions are okay; hand-cleaned captions are dramatically better. The video itself adds visual signal on top, but the caption file is the textual representation the engine ranks against.

Open Graph images get preview placement. When AI engines surface your URL in conversational responses, the preview thumbnail is your Open Graph image (og:image). For 90% of landing pages, this is still the auto-generated screenshot or a stale brand graphic. For a small but growing minority, it's a deliberately chosen canonical image — and those pages get visibly better engagement when previewed.

The Multimodal-Readiness Audit

Here's a 20-minute audit that surfaces where your landing page is leaving multimodal AEO signal on the table. You can do this in a browser tab with view-source open.

Audit 1: Above-the-fold visual. Open your landing page. Look at the image that's most prominent above the fold. Is it (a) a real screenshot of your product UI, (b) a custom illustration of an abstract concept, or (c) a stock photo of people looking at laptops? Only option (a) produces multimodal signal that helps you rank. Options (b) and (c) are visually invisible to query-attached image matching — they don't depict anything a buyer would screenshot when looking for alternatives in your category.

Audit 2: Alt text quality. Open each meaningful image on the page and check the alt attribute. Three categories: (a) empty alt or generic ("dashboard," "hero image"), (b) a short label ("product dashboard view"), or (c) a full sentence describing what's depicted and what use case it represents. Only (c) gives the engine the contextual anchor it needs to match the image to a buyer's intent. (a) and (b) are technical compliance, not multimodal AEO.

Audit 3: Image captions on the page. Many landing pages have images with no visible caption — the image sits alone, with only an alt attribute the user never sees. The pattern that earns more multimodal citations: every meaningful image has a visible caption underneath, in the same prose tone as the body copy, describing the use case. Your alt text serves the screen reader and the OCR layer; your visible caption serves both the human reader and the engine's text-context layer for the adjacent image.

Audit 4: Demo video presence and quality. Do you have a video embedded above the fold? If yes: does it have a real closed caption file, or just auto-generated YouTube captions? Are key UI elements held on screen for 2-3 seconds or longer, or do you cut every 0.5 seconds? Most demo videos fail one or both of these. The few that pass both produce some of the highest-yielding multimodal AEO assets on the entire page.

Audit 5: Open Graph image. Check your og:image meta tag. Is it the canonical screenshot of your product (good) or a stale brand graphic from 2023 (bad)? AI engines preview your page using this image when surfacing it in conversational responses. If the preview doesn't depict what your product actually does, you're losing the visual handshake.

Audit 6: Schema markup for visual content. Search your page source for ImageObject and VideoObject in JSON-LD blocks. If they're absent (likely), your visual content has no structured-data layer telling the engine what it is, when it was created, who made it, and what it depicts. The schema adds explicit semantic context on top of what the engine can infer visually.

Most landing pages I audit fail four or five of these six checks. Each failure is a missed multimodal signal. Each fix is a small, cheap engineering or design change.

What to Build Instead of What You Have

The pattern that's working for landing pages competing in multimodal AI search, based on the rebuilt pages I've watched climb citation rates in April and May 2026:

The canonical product screenshot

The single highest-value visual asset on your page is a clean, recognizable screenshot of your product that visually represents what the product does. Not a marketing rendering. Not a 3D illustration of an abstract metaphor. The actual UI, captured cleanly, showing the canonical use case.

"Canonical" is doing real work here. If your product is a kanban tool, the canonical screenshot is a kanban board with realistic-looking columns and cards. If it's a code editor, it's the editor with realistic code on screen. If it's a CRM, it's a contact view with realistic contact data. The visual content of the screenshot should match what a buyer would picture in their head when imagining a tool in your category. Visual canonicity is the rough multimodal equivalent of keyword match.

Three rules for canonical screenshots:

Use realistic content, not placeholder. "Lorem ipsum" in your screenshot tells the OCR layer nothing useful. Real-looking contact names, real-looking task descriptions, real-looking numbers — these become part of the indexable text of the page through OCR.

Show the use case, not the empty state. An empty kanban with three columns labeled "To Do / Doing / Done" is visually less distinctive than a populated board with realistic tasks across realistic categories. The populated version generates a richer visual embedding and a more useful OCR text layer.

Include UI labels at readable resolution. Don't compress your screenshot so aggressively that the UI text becomes illegible. OCR fails at low resolution. The text you wanted indexed gets lost. Serve images at the resolution they were captured at, with appropriate srcset handling for performance.

Captions written in buyer language

Every meaningful image gets a visible caption underneath it. Not a label. A sentence. The sentence describes the use case in the language a buyer would use.

The format that works: "[Who] uses [what view] to [accomplish what outcome]." Concrete example: "The dashboard a 5-person agency sees when reviewing this month's retainer reports across 14 clients on different billing cycles." That caption tells the engine the image is relevant to queries about agencies, retainer reporting, multi-client management, and billing cycles — none of which the visual embedding alone would have surfaced.

Most landing pages caption images as labels: "Dashboard view." That tells the engine almost nothing useful about who the image is for or what they'd be doing with it. Switching from labels to sentences is a one-afternoon edit that materially improves multimodal AEO yield.

Short demo videos with real captions

A 30-90 second demo video, embedded above or near the fold, with a properly hand-cleaned closed caption file, is one of the highest-leverage multimodal AEO assets you can ship. The combination is doing several things:

The video adds visual signal across roughly 30-90 frame embeddings (at 1 fps sampling). Each frame the engine sees becomes part of the page's visual representation.

The captions add a clean, dense text layer that's specifically anchored to what's on screen at each moment — far more contextually rich than the same text appearing as body copy.

The combination of video + captions tells the engine the page is producing experiential proof, not just claims. Pages with demo video evidence get weighed differently than pages with claims alone.

Three rules for demo videos that get cited:

Hold key UI on screen for 2-3 seconds. If you're showing the user clicking through a flow, give each step enough screen time that at least 2-3 frames at 1 fps capture it. Fast tutorials look impressive to humans and disappear to AI engines.

Hand-clean the captions. Auto-generated captions hallucinate words, miss proper nouns, and ignore UI labels. Take 30 minutes after recording to clean them. Hand-cleaned captions consistently outperform auto-generated ones in multimodal citation rate by a meaningful margin in our tracking.

Self-host or use a hosting solution that exposes the caption file. If your video player hides the caption file behind a stream URL the AI engine can't reach, you lose the semantic anchor. YouTube exposes captions accessibly; Vimeo with proper settings does; some embedded SaaS players don't. Check.

ImageObject and VideoObject schema

JSON-LD schema for your visual content tells the engine explicitly what each asset is, when it was made, what it depicts, and who created it. The minimum-viable version looks like this:

{
  "@context": "https://schema.org",
  "@type": "ImageObject",
  "contentUrl": "https://yoursite.com/images/dashboard.png",
  "caption": "The dashboard a 5-person agency sees when reviewing retainer reports.",
  "creator": { "@type": "Organization", "name": "Your Company" },
  "datePublished": "2026-05-15"
}

For video:

{
  "@context": "https://schema.org",
  "@type": "VideoObject",
  "name": "60-second product walkthrough",
  "description": "Quick walkthrough of how a project manager uses the tool to...",
  "thumbnailUrl": "https://yoursite.com/video-thumb.jpg",
  "uploadDate": "2026-05-15",
  "duration": "PT1M30S",
  "contentUrl": "https://yoursite.com/demo.mp4",
  "transcript": "Full transcript text here..."
}

Including the full transcript inside the VideoObject schema is the move most teams skip. It gives the engine the entire text of the video in a structured form, attached explicitly to the video entity. Pages that include video transcripts in schema get cited noticeably more often for query-from-video matches.

An Open Graph image worth previewing

Your og:image is the thumbnail AI engines preview when they surface your page in a response. For most B2B landing pages this is auto-generated or has been ignored for a year. The fix:

Create a deliberately-designed Open Graph image (1200x630) that's the canonical screenshot of your product — the visual equivalent of "this is what this URL is about." Add your brand wordmark in a corner. Add a one-line description of the page's primary value. Save it once. Reference it from every key page's <meta property="og:image"> tag.

This is a one-hour design task that improves the visual handshake for every conversational AI surface that previews your URL. It pays dividends every time a user sees your card in a response and decides whether to click.

What to Stop Doing

Three patterns that were defensible in 2024 and are now actively costing you multimodal AEO signal:

Stop using custom illustrations as primary hero imagery. Custom illustrations look beautiful on the page. They contain no information a multimodal engine can match against a buyer's screenshot-attached query. If your hero is a vector illustration of a person interacting with floating UI elements, you're winning a design award and losing visual queries. Replace it (or supplement it) with a real product screenshot.

Stop fast-cutting demo videos. A 30-second demo with 60 cuts looks impressive to humans and is invisible to AI engines that sample at 1 fps. Slow down. Hold each shot for 2-3 seconds. The video should still feel deliberate to human viewers, and it becomes 10x more useful to multimodal indexers.

Stop relying on auto-generated captions. Auto-captions on YouTube, Loom, and most video tools mis-transcribe product names, miss UI labels, and sometimes hallucinate words entirely. Each error is a piece of bad data the AI engine indexes against your page. Hand-clean the captions for any video that's a primary AEO asset. The 30 minutes pays off.

The Compounding Advantage

The reason multimodal AEO is worth investing in now, before it's widely understood, is that the work compounds in a particularly durable way. Text-based SEO and AEO advantages can be replicated quickly — a competitor can write a similar blog post, refresh similar meta descriptions, build similar schema. Multimodal advantages are slower to replicate.

Replacing a stock photo with a real product screenshot requires having a product worth screenshotting. Recording a slow-paced demo with hand-cleaned captions requires the discipline to actually do it. Building a library of captioned, schema-marked-up images requires sustained ongoing investment — not a one-time content sprint.

The teams that build out a full multimodal AEO surface in the next two quarters are accumulating visual signal that competitors can't catch up on without doing the same multi-month work themselves. The accumulated visual signal becomes part of the page's authority in AI search the way accumulated backlinks became part of authority in 2010-era Google.

For the categories where multimodal queries are highest — design tools, dev tools, productivity software, ecommerce, hardware — the gap between teams who shipped multimodal AEO in 2026 and teams who didn't is going to be one of the most consequential SEO patterns of the year.

A 14-Day Sprint to Multimodal-Readiness

Days 1-2: Take new product screenshots. Real data, realistic content, full-resolution. Replace the hero illustration on your highest-traffic landing page with the canonical product screenshot.

Days 3-4: Rewrite alt text and add visible captions for every meaningful image on your top 3 landing pages. Buyer's language. Full sentences. Use-case-specific.

Days 5-7: Record a 60-90 second demo video with deliberate pacing. Each step held for 2-3 seconds. Upload to YouTube or self-host. Embed above the fold.

Days 8-9: Hand-clean the video captions. Upload as a .vtt file. Add to the embed.

Day 10: Add JSON-LD ImageObject and VideoObject schema for the canonical screenshot and the demo video. Include the full transcript inside the VideoObject.

Days 11-12: Design and ship a new Open Graph image — canonical screenshot, wordmark, one-line description. Update og:image tags across the site.

Days 13-14: Test in AI Mode by attaching a screenshot from a competitor's product and asking "is there a better version of this?" — see whether your page surfaces in the response. Iterate on the screenshot composition and caption based on what's cited above you.

Two weeks. One product designer, one engineer, one writer. The result is a landing page that produces multimodal signal across every dimension AI engines now weigh — and a meaningful first-mover advantage in a query class that's growing fast.

If you want a starting-line audit of how your existing landing page handles multimodal AEO — including image alt quality, caption density, video pacing, and schema coverage — our free landing page analysis includes a visual-content audit alongside the standard copy and structure checks. Most pages discover they're failing on 4 of the 6 multimodal signals on their first audit. The fixes are small. The compounding advantage of shipping them in mid-2026 is not.

AI Search Can See Your Landing Page Now. Your Screenshots Are AEO Signals — and Most of Them Are Wasted.