Omni-Vision (Visual Intelligence)
Omni-Vision is Xenon's visual intelligence layer. It enables coordinate-free element interaction and visual assertions — testing your app as a human sees it, not as the DOM describes it.
The implementation is OCR-first: every screen is run through Tesseract.js (with image preprocessing via sharp to clean small fonts and low-contrast text) to produce a list of on-screen text + bounding boxes. For matches that OCR can't disambiguate — visual descriptions, icon recognition, complex assertions — the configured LLM provider (Gemini / OpenAI / Anthropic) is used as a fallback over the same screenshot.
For interactive exploration and code generation powered by Omni-Vision, see the OmniInspector.
Capabilities
Visual Element Detection
Instead of fragile XPath or accessibility selectors, describe what you want to interact with:
// Traditional approach (breaks when DOM changes):
driver.findElement(By.xpath("//XCUIElementTypeButton[@name='Submit Order']"));
// Omni-Vision approach (resilient to UI changes):
// Xenon exposes Omni-Vision as session-scoped plugin endpoints.
// Use the AI locator test endpoint with a custom strategy:
//
// POST /session/:sessionId/xenon/test-locator
// Body: { "strategy": "-custom:ai-icon", "selector": "the green Submit Order button at the bottom of the screen" }
//
// Returns: an array of matches with rect + confidence (virtual elements).
The returned coordinates are mapped to the device's actual screen resolution, accounting for DPI scaling and viewport offsets.
Visual Assertions
Verify screen state using natural language instead of element queries:
// Assert that a success message is visible
//
// POST /session/:sessionId/xenon/assert
// Body: { "instruction": "A green success toast message is visible at the top" }
//
// Returns: { "result": true/false, "message": "..." }
Smart Interaction (Enterprise)
Omni‑Vision integrates with Xenon’s Omni‑Interaction layer:
smartTap:POST /session/:sessionId/xenon/smart-tap— OCR-driven tap by visible textuiInventory:POST /session/:sessionId/xenon/ui-inventory— export UI metadata derived from OCR + lightweight heuristics
These commands can also be invoked via executeScript from any Appium client:
// Java: Smart Tap
driver.executeScript("xe:smartTap", Map.of("text", "Submit", "index", 1, "takeANewScreenShot", true));
// Java: UI Inventory
driver.executeScript("xe:uiInventory", Map.of("maxItems", 200, "takeANewScreenShot", true));
Aliases: xe:omniClick → smartTap, xe:uiScanExport → uiInventory
Architecture
OmniVisionService handles:
- Screenshot capture and preprocessing (
sharp— contrast adjustment, conditional upscaling) - Tesseract.js OCR with a shared worker
- AI-vision fallback through the configured LLM provider for ambiguous cases
- Coordinate mapping from OCR/model output to device screen coordinates
- DPI and viewport offset correction
- A short-lived UI-lens cache (10 s TTL) to avoid repeated OCR on rapid calls
VisionAssertionService handles:
- Natural language assertion parsing
- Confidence scoring
- Pass/fail determination with explanation
Integration with Self-Healing
Omni-Vision powers Tier 4 of the Self-Healing Engine. When structural matching (Tiers 1–3) fails to find a broken element, the healing orchestrator falls back to visual detection:
- Captures a screenshot of the current screen
- Runs Tesseract.js OCR over the preprocessed screenshot
- Tries to match the broken element's expected text against the OCR results
- If OCR can't pin a unique match (or the element has no useful text), falls back to the AI-vision provider for description-based grounding
- Returns coordinates for interaction
This makes Xenon's self-healing resilient even when the DOM changes completely — as long as the element is visually present on screen.
Requirements
- Tesseract.js is bundled with the plugin — OCR works out of the box, no setup needed.
- AI-vision fallback requires a configured LLM provider. Set
XENON_AI_PROVIDERand the matching API key (see AI Features). Without this, Omni-Vision still works for OCR-driven flows but gracefully degrades on visually ambiguous matches. - Screenshot capabilities must be enabled on the session.
- Works with both iOS and Android devices.
Omni-Vision works best for elements with distinctive visual characteristics (unique text, icons, colors). For visually ambiguous elements (plain list items, generic buttons), structural matching in Tiers 1–3 is more reliable.