ByteDance’s UI-TARS can take over your PC, outperforming GPT-4o and Claude

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more


A new AI agent has emerged from TikTok’s parent company to control your computer and perform complex workflows.

Just like Anthropic’s Computer Use, ByteDance’s new UI-TARS interface understands graphical user interfaces (GUIs), applies logic and takes autonomous actions step-by-step.

PC/MacOS agents are trained on approximately 50 billion codes and are offered in 7B and 72B parameter releases, and achieve state-of-the-art (SOTA) performance on 10+ GUI benchmarks across performance, awareness, grounding, and overall agent capabilities, consistently outperforming outside of GPT. 4o from OpenAI and Cloud and Gemini from Google.

“Through iterative training and reflexive tuning, UI-TARS constantly learns from its mistakes and adapts to unexpected situations with minimal human intervention,” researchers from ByteDance and Tsinghua University wrote in an article. New research paper.

Source: Arkiv

How UI-TARS explains its thinking

UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments.

Its user interface features two tabs – one on the left that shows step-by-step “thinking,” and a larger one on the right where you drag files, websites, and apps and take action automatically.

For example, in a demo video released today, the model is asked to “search for round-trip flights from SEA to New York City on the 5th and back on the 10th of the following month and filter by price in ascending order.”

In response, UI-TARS goes to the Delta Airlines website, fills out the From and To fields, clicks on the relevant dates, and sorts and filters by price, explaining each step in its thinking box before taking action.

In another scenario, you are asked to install the autoDocstring extension in VS Code. Here’s his step-by-step thinking as he completes this task:

  • It first states, “To continue installing the autoDocstring extension in VS Code, I first need to open the VS Code application.”
  • Recognizing that the VS Code window is not fully loaded, the form notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionality is available for the next steps.
  • He then explains that he must access the Add-ons view in VS Code, which can be done by clicking a tab in the left sidebar.
  • Running into a minor glitch, resulting in the following: “The previous attempt to click on the Extensions tab may not have been accurate enough, so I will try clicking on it again to ensure the Extensions view opens successfully.”
  • Once in the Add-ons view, he types “autoDocstring” before concluding: “Since the installation process has already started, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.”

Outperforming its competitors

Across a variety of benchmarks, researchers reported that UI-TARS consistently outperformed OpenAI’s GPT-4o; Anthropic Claude 3.5 Sonnet; Gemini-1.5-Pro ​​and Gemini-2.0; Four queen models. And many academic models.

For example, in VisualWebBench — which measures a model’s ability to ground web elements including web page quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, besting GPT-4o (78.5%) and CLOUD 3.5 (78.2%) .

It also performed much better on the WebSRC (understanding semantic content and layout in web contexts) and ScreenQA-short (understanding complex mobile screen layouts and web architecture) benchmarks. The UI-TARS-7B achieved leading scores of 93.6% on WebSRC, while the UI-TARS-72B achieved 88.6% on ScreenQA-short, beating Qwen, Gemini, Claude 3.5, and GPT-4o.

“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers wrote. “This cognitive ability lays the foundation for agent tasks, where accurate environmental understanding is critical for task execution and decision making.”

UI-TARS has also shown impressive results in ScreenSpot Pro and ScreenSpot v2, which evaluates a model’s ability to understand and localize elements in GUIs. Furthermore, the researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments, comparing it to OSWorld (which evaluates open-ended computer tasks) and AndroidWorld (which scores autonomous agents in 116 programming tasks across 20 mobile applications). ).

Source: Arkiv
Source: Arkiv

Under the hood

To help it take step-by-step actions and recognize what it sees, UI-TARS was trained on an extensive dataset of screenshots that analyzed metadata including item description and type, visual description, bounding boxes (positional information), item function, and texts from websites and apps And different operating systems. This allows the model to provide a comprehensive and detailed description of the screenshot, capturing not only elements but also spatial relationships and overall layout.

The form also uses state transition captions to identify and describe differences between two consecutive screenshots and determine whether an action has occurred – such as a mouse click or keyboard entry. Meanwhile, the Set Marker (SoM) prompt allows superimposing distinct markers (letters and numbers) onto specific areas of the image.

The model is equipped with short- and long-term memory to handle the tasks at hand while also retaining historical interactions to improve decision-making later. The researchers trained the model to perform both System 1 (fast, automatic, and intuitive) and System 2 (slow, deliberate). This allows for multi-step decision making, “reflective” thinking, landmark recognition and error correction.

The researchers emphasized that it is crucial for the model to be able to maintain consistent goals and engage in trial and error to hypothesize, test, and evaluate potential actions before completing the task. They provide two types of data to support this: error correction and afterthought data. To correct errors, identify errors and describe corrective actions; After thinking, they simulated the recovery steps.

“This strategy ensures that the agent not only learns to avoid errors but also dynamically adapts when they occur,” the researchers wrote.

UI-TARS clearly shows impressive potential, and it will be interesting to see its evolving use cases in the increasingly competitive field of AI agents. As the researchers note: “Looking to the future, while local agents represent a major leap forward, the future lies in the integration of active learning and lifelong learning, where agents autonomously drive their own learning through ongoing real-world interactions.”

The researchers note that Claude Computer Use “performs strongly on web-based tasks but has significant difficulties in mobile scenarios, suggesting that Claude’s ability to operate a GUI has not transferred well to the mobile domain.”

By contrast, “UI-TARS shows excellent performance in both the website and mobile domain.”

Leave a Comment