What is AI Vision?
AI Vision uses multimodal AI models to understand and interact with web pages through screenshots rather than DOM selectors. While traditional Browser Automation relies on CSS selectors and DOM traversal to find elements, AI Vision looks at the page the same way a human does — visually.
This is a critical fallback for the many situations where selector-based automation breaks down:
Canvas-rendered content — applications that draw directly to HTML canvas have no DOM elements to select
Complex dynamic layouts — heavily JavaScript-driven UIs that restructure the DOM on every render
Shadow DOM and web components — encapsulated components that resist external selector access
Obfuscated class names — sites that randomize CSS classes on every page load to prevent scraping
PDF viewers and embedded content — rendered content that exists as pixels, not DOM nodes
How It Works Under the Hood
When the agent encounters a page that resists traditional selector-based interaction, it switches to vision mode:
- A full-page screenshot is captured
- The screenshot is sent to a multimodal AI model (like GPT-4V or Claude Vision)
- The model analyzes the image and identifies text, UI elements, layout structure, and interactive components
- Based on the analysis, the agent can extract text, click at specific coordinates, or describe what it sees
Visual Data Extraction
AI Vision excels at extracting data from visually structured content that lacks proper DOM structure:
Charts and Graphs
Bar charts, line graphs, pie charts — these are often rendered on canvas or as SVG images. AI Vision can read the data values, labels, and trends directly from the visual representation. Combined with Data Processing, you can turn chart images into actual data tables.
Complex Tables
Some websites render tables using absolute positioning, overlapping divs, or canvas rather than semantic HTML table elements. Traditional extraction sees a mess of positioned elements; AI Vision sees a clean table and extracts it correctly.
Infographics and Dashboards
Business intelligence dashboards, analytics platforms, and data visualization tools often render their output in ways that defeat DOM-based extraction. AI Vision reads KPIs, metrics, and data points directly from the visual display.
Smart Fallback System
AI Vision integrates with the AI Agent Chat as an intelligent fallback. The agent first attempts standard selector-based interaction. If that fails — element not found, wrong element selected, or interaction doesn't produce expected results — the agent automatically switches to vision mode. This happens seamlessly without any configuration:
Attempt CSS selector interaction
If it fails, capture screenshot and use vision analysis
If vision succeeds, cache the approach for future runs via Cross-Session Learning
Coordinate-Based Interaction
When vision mode identifies an element, the agent can click at specific pixel coordinates on the page. This bypasses all DOM-level protections and works on absolutely any visible element — even those inside iframes, shadow DOM, canvas elements, or embedded content.
Practical Applications
AI Vision shines in scenarios where traditional automation struggles:
Legacy enterprise applications — older web apps with non-standard HTML that modern selectors can't target
Canvas-based applications — design tools, mapping applications, and data visualization platforms
Dynamic dashboards — analytics and BI tools where the DOM structure changes with every data update
Image-heavy content — extracting text from product images, screenshots, or embedded graphics
[PDF & OCR](/features/pdf-ocr) integration — vision analysis complements OCR for complex document layouts
Best Practices
Getting the most out of AI Vision requires understanding when and how to deploy it effectively. Here are practical tips for optimizing your vision-based automations:
Let the smart fallback system work first. AI Vision is most efficient when used as a fallback rather than a default. The agent's selector-based interaction is faster and cheaper. Vision kicks in automatically when selectors fail, so there is rarely a reason to force it on every step. Reserve forced vision mode for known problematic pages — canvas apps, obfuscated sites, or embedded content where selectors simply cannot reach.
Capture high-quality screenshots. Vision accuracy depends heavily on image quality. Ensure the browser viewport is large enough to display content clearly, and avoid triggering vision on pages that are still loading. The smart wait system handles most timing issues, but for particularly heavy dashboards, add a short delay before vision analysis to let all chart animations and data renders complete.
Combine vision with [Data Processing](/features/data-processing) for validation. When extracting numerical data from charts or dashboards, pipe the vision output through a data processing step that validates ranges, formats, and data types. This catches occasional misreads — for example, a "5" misread as an "S" — before the data reaches your spreadsheets or databases.
Use coordinate clicking sparingly. While clicking at pixel coordinates bypasses all DOM-level protections, coordinates can shift when page layouts change. Whenever possible, let the agent try selector-based clicking first. Coordinate clicking is best for truly static UI elements inside canvas or embedded content.
Cache and reuse successful approaches. If you find that a specific site consistently requires vision mode, the Cross-Session Learning system automatically caches the successful approach. Future runs on that site will apply vision immediately, skipping the initial selector attempt and saving time. You can verify cached approaches in your workspace settings.
Security & Compliance
AI Vision involves sending screenshots to multimodal AI models for analysis, which raises important considerations for teams handling sensitive data. Autonoly takes several steps to protect your information throughout this process.
All screenshots are processed in memory and are not persisted to disk beyond the duration of the analysis. They are transmitted over encrypted TLS 1.3 connections to the vision model API, and the response is returned directly to the agent. Screenshots are never stored in logs, never shared across workspaces, and never used for model training. The credential vault ensures that any login credentials visible on a page are masked before screenshot capture when possible, and execution environments are destroyed after each run, eliminating any residual image data.
For organizations with strict data handling policies, it is worth noting that vision analysis uses external AI model APIs (such as Claude Vision or GPT-4V). If your compliance requirements prohibit sending page screenshots to third-party services, you can disable vision mode for specific workflows in the workflow settings. Discuss enterprise deployment options with on-premises vision processing by contacting our team. For a deeper understanding of how Autonoly protects data across all features, visit the Security feature page.
Common Use Cases
AI Vision unlocks automation scenarios that are simply impossible with traditional selector-based tools. Here are real-world examples from teams using the feature:
Extracting Data from Business Intelligence Dashboards
A marketing analytics team needed to pull weekly KPI summaries from a proprietary BI tool that renders all charts on HTML canvas. Traditional scrapers saw nothing — no DOM elements, no text nodes, just a blank canvas tag. With AI Vision, the agent captures a screenshot of the dashboard, reads the KPI values, chart labels, and trend indicators directly from the visual output, and pushes the data to Google Sheets for the weekly report. Combined with Scheduled Execution, this runs every Monday morning without human intervention. For teams building similar monitoring workflows, our guide on ecommerce price monitoring covers related strategies for tracking visual data across sites.
Automating Legacy Enterprise Applications
A logistics company relies on a legacy web application built in the early 2000s with non-standard HTML, nested frames, and ActiveX components. CSS selectors cannot reliably target input fields or buttons because the DOM structure changes unpredictably. AI Vision identifies the login form, navigation menus, and data entry fields visually, enabling full automation of daily shipment tracking updates. The agent fills out forms by clicking at the correct coordinates and typing values, entirely bypassing the broken DOM. Teams interested in similar approaches can read more about automating data entry across difficult interfaces.
Reading Charts and Graphs for Competitive Intelligence
A product research team monitors competitor pricing trends displayed as line charts on a competitor's public dashboard. The charts are SVG-based with no accessible data attributes. AI Vision reads the data points, axis labels, and trend lines directly from the chart image. The extracted values feed into a Data Processing pipeline that calculates week-over-week price changes, flags significant movements, and generates a summary report sent to Slack. The best web scraping tools guide explains how vision-augmented scraping compares to traditional approaches for these scenarios.
Navigating Map and Design Applications
A real estate data team needed to extract property boundary information from a web-based mapping application that renders entirely on canvas. No DOM elements exist for the map pins, boundaries, or labels. AI Vision identifies property markers, reads address labels, and captures boundary shapes from the visual display. The data is structured and exported for GIS analysis. For real estate teams exploring similar workflows, see our guide on real estate automation.
Visit pricing for details on AI Vision usage limits and model availability.