Vision Layer Interface VLI

Software interfaces were built for humans. VLI enables AI Agents to use it too!

We have unlocked the vision capabilities for AI Agents to harness

Building AI Agents beyond chat bots or simplistic tasks requires interacting with a variety of applications, file systems, and unstructured data.

Office workflows involve:

Spreadsheets (embedded logic)
Multiple web portals (sensitive to bot detection)
PDFs (unstructured and often scanned images)
Desktop apps (without APIs or backend integrations)
Files systems (ever changing)

This is all optimized for human workers; they can navigate and interact with screens using ‘vision’.

Office applications and interfaces connected for vision-based AI agents

VLI sits in the architecture when there
is no other way besides a screen

AI agent connecting through API, MCP, and VLI to structured data, tools, and human interfaces

How VLI Works

01
VLI captures the current screen state.
It sees the way a human would:
- open windows
- browser tabs
- forms
- tables
- buttons
- menus
- popups
- errors
- document viewers
VLI interprets the screen semantically. Process pixels to determine (x,y) coordinates. Identifies meaningful interface objects such as buttons, checkboxes, disabled fields, table rows, windows, text, form elements, etc.
Launch AI workflow engine determines the next action. Some actions are deterministic, such as precise key strokes to type a predetermined value.

Others require in situ reasoning, such as deciding whether a popup is relevant or needs to be discarded.
VLI sends instructions to the OS via keyboard and mouse functions.

In summary, VLI is an interactive layer that manipulates the screen for the Agent.

  action = {      0: "open TMS dashboard",      1: "click 'Exceptions'",      2: "scrape 'all visible shipment exceptions with shipment ID, status, carrier, ETA, and  customer'",      3: "scroll 'through shipment exceptions table without missing rows'",      4: "for each delayed shipment, open carrier portal",      5: "type '{{tracking_number}}' into tracking search",      6: "scrape 'latest tracking event, location, timestamp, and delay reason'",      7: "return to TMS",      8: "open shipment '{{shipment_id}}'",      9: "update status notes using carrier tracking result",      10: "click 'Save'",      11: "scrape 'confirmation that shipment record was updated'"  }     decision = {      0: "if carrier portal shows customs hold, flag shipment for human review",      1: "if shipment ETA changed by more than 24 hours, create customer notification draft",      2: "if tracking number is not found, mark exception as data issue"  }  

VLI training

In order to intelligently navigate the interface of any software (web or desktop based), the VLI references embedded skills acquired during training sessions.

In a training session the VLI teaches models about all possible navigation routes- almost like traversing the DOM of the application.
Every navigation route is mapped into a tree where the nodes of the tree are buttons, fields, or navigable elements in the UI.
This way when your Agent has a task, the VLI is aware of how to accomplish the involved steps in advance.
Navigation speeds up, there is no need to make decisions insitu, and most importantly the token cost is vastly reduced.

Sequence diagram of VLI training flow between User, Training Agent, Target Application, UI Observer, VLI Knowledge Base, and Route validator

Command example 1 of 7

  The agent must locate the visible interface element best matching  “Approve invoice,” determine its screen coordinates, and execute a click.     click 'Approve invoice'  

Precise coordinates + Intelligent models + App knowledge = Success

VLI agent analyzing overlapping application interfaces to derive precise screen coordinates

The success of the paradigm depends on the Agent's ability to translate visual understanding into precise screen action. It is not enough for the Agent to know that a button exists (LLMs do that part pretty well).

The system must identify where that button exists on the screen, determine whether it is interactable, and execute the correct mouse or keyboard event.

VLI connects model-level reasoning, app-specific knowledge, and OS-level control

See Demo

A vision model may determine that the target is the "Submit Claim" button near the bottom-right of a form. VLI then converts that understanding into coordinates, validates the region, performs the click, captures the post-action screen, and determines whether the expected state change occurred.

PyAutoGUI or an equivalent OS-level automation layer can be used for the physical execution of mouse and keyboard events. But the core value of VLI is not the click itself. The value is deciding what to click, why to click it, how to recover if the click fails, and how to prove what happened afterward.

That is the difference between desktop scripting
and vision-based agentic execution

What does VLI do during runtime?

Load

Read the next action from the workflow map
Observe

Capture a before screenshot
Resolve

Determine the target element, region, value, or coordinate
Act

Execute mouse, keyboard, scroll, or scrape behavior
Capture

Take an after screenshot
Annotate

Draw a red box around the interaction or screen region used
Verify

Confirm the expected screen change or extracted result
Log

Save screenshots, action metadata, coordinates, and output

VLI Reference sheet

Atomic Step Purpose Example Call

click

Select a visible UI element by description

click 'Approve button'

click with coordinates

Click a known screen position

click '840, 512'

left_click

Explicit left mouse click on an element

left_click 'Save changes'

double_click

Open a file, folder, row, app, or record

double_click 'tenant ledger PDF'

right_click

Open a context menu or secondary action menu

right_click 'shipment exception row'

middle_click

Trigger middle-click behavior, such as opening a link in a new tab

middle_click 'customer profile link'

hover

Move cursor over an element to reveal menus, tooltips, or hidden controls

hover 'Reports menu'

move_to

Move cursor to an element or coordinate without clicking

move_to 'download icon'

drag

Click, hold, move, and release an item onto a target

drag 'invoice PDF' to 'Upload box'

drag_select

Select text, rows, cells, or a screen region by dragging

drag_select 'rows 10 through 25 in table'

resize

Resize a window, pane, column, or object by dragging its edge

resize 'spreadsheet column B' to fit content

type

Enter static text into the active field

type 'Approved after validation'

type with variable

Enter runtime data into the active field

type '{{invoice_number}}'

type from previous action

Reuse captured output from an earlier action

type 'action 6'

clear

Clear text or selected content from a field

clear 'Search field'

copy

Copy selected text, file, table, or region

copy 'selected invoice rows'

paste

Paste clipboard contents into the active field or location

paste '{{copied_table}}'

cut

Cut selected text, file, table rows, or content

cut 'selected spreadsheet rows'

press

Send a single keyboard input or shortcut

press 'enter'

press with modifier

Send a keyboard combination

press 'cmd+space'

key_sequence

Send multiple key events in sequence

key_sequence 'tab, tab, enter'

scroll

Scroll to reveal hidden content or find a described element

scroll 'to find payment history table'

scroll_down

Scroll downward within the active region

scroll_down 'invoice line items table'

scroll_up

Scroll upward within the active region

scroll_up 'return to table header'

scroll_left

Scroll horizontally left within a wide table or panel

scroll_left 'wide spreadsheet table'

scroll_right

Scroll horizontally right within a wide table or panel

scroll_right 'to find total amount column'

page_down

Move down by one page of content

page_down 'PDF document viewer'

page_up

Move up by one page of content

page_up 'PDF document viewer'

go_to_top

Move to the top of a page, table, or document

go_to_top 'shipment table'

go_to_bottom

Move to the bottom of a page, table, or document

go_to_bottom 'activity log'

scrape

Extract visible text or values from the screen

scrape 'invoice total and due date'

scrape_table

Extract structured rows and columns from a visible table

scrape_table 'shipment exceptions table'

scrape_region

Extract data from a specific visual region

scrape_region 'top-right summary card'

screenshot

Capture the current screen or described context

screenshot 'invoice approval page'

capture_region

Capture a specific area of the screen

capture_region 'confirmation message area'

find

Locate a visible element, label, field, value, or text on screen

find 'Submit claim button'

wait

Pause until a described condition becomes true

wait 'until export completes'

wait_for_element

Wait for a specific UI element to appear

wait_for_element 'Download complete message'

wait_for_disappear

Wait until an element, spinner, modal, or loader disappears

wait_for_disappear 'loading spinner'

verify

Confirm that an expected screen state is visible

verify 'approval confirmation is visible'

verify_text

Confirm that specific text is present on screen

verify_text 'Invoice approved successfully'

verify_value

Confirm that a visible value matches the expected value

verify_value 'invoice total equals {{po_total}}'

compare

Compare visible values, extracted outputs, or screen states

compare 'invoice total' with '{{purchase_order_total}}'

download

Trigger and capture a file download from the interface

download 'monthly statement CSV'

upload

Upload a file through a file picker or drag-and-drop zone

upload '{{invoice_pdf}}' to 'Upload invoice box'

open_file_picker

Open a system file picker from the interface

open_file_picker 'Upload supporting document'

choose_file

Select a file inside a system dialog

choose_file '{{lease_document_path}}'

save_file

Save a file through a system dialog

save_file 'reconciliation_report.csv'

annotate

Mark an interaction or extraction region in the screenshot log

annotate 'clicked Approve button region'

log

Record structured metadata, output, coordinates, or decision state

log 'approval confirmation ID and timestamp'

Vision Layer Interface VLI

VLI sits in the architecture when there is no other way besides a screen

How VLI Works

VLI training

VLI connects model-level reasoning, app-specific knowledge, and OS-level control

What does VLI do during runtime?

Load

Observe

Resolve

Act

Capture

Annotate

Verify

Log

VLI Reference sheet

VLI sits in the architecture when there
is no other way besides a screen