The OpenAI agent tool may be about to be released

OpenAI may be about to release an AI tool that can control your computer and perform actions on your behalf.

Tibor Blaho, a software engineer known for accurately leaking upcoming AI products. Claims to reveal evidence of OpenAI’s long-rumored trigger tool. Publications Included Bloomberg has previously I mentioned On the operator, which is said to be an “agent” system capable of autonomously handling tasks such as writing code and booking travel.

According to According to the information, OpenAI is targeting January as the launcher release month. The code revealed by Blaho this weekend adds credence to those reports.

OpenAI’s ChatGPT client for macOS has gained options, hidden for now, to define “Toggle Operator” and “Force Quit Operator” shortcuts, according to Blaho. OpenAI has added references to Operator on its website — though the references are not yet visible to the public, Blaho said.

The OpenAI website already contains references to Operator/OpenAI CUA (Computer Usage Agent) – “Operator System Card Table”, “Operator Research Evaluation Table”, and “Operator Rejection Rate Table”

Including comparison with Claude 3.5 Sonnet for PC, Google Mariner, etc.

(View tables… pic.twitter.com/OOBgC3ddkU

– Tibor Blaho (@btibor91) January 20, 2025

According to Blaho, the OpenAI website also contains not-yet-public tables that compare the operator’s performance to other AI systems that use computers. Tables may be placeholders. But if the numbers are accurate, they indicate that the agent is not 100% reliable, depending on the task.

The OpenAI website already contains references to Operator/OpenAI CUA (Computer Usage Agent) – “Operator System Card Table”, “Operator Research Evaluation Table”, and “Operator Rejection Rate Table”

Including comparison with Claude 3.5 Sonnet for PC, Google Mariner, etc.

(View tables… pic.twitter.com/OOBgC3ddkU

– Tibor Blaho (@btibor91) January 20, 2025

In OSWorld, a benchmark that attempts to simulate a real computer environment, the “OpenAI Computer Use Agent (CUA)” — perhaps the operator running the AI model — scored 38.1%, ahead of Anthropic’s computer control model but well below humans’ 72.4%. a result. OpenAI CUA outperforms humans on WebVoyager, which evaluates AI’s ability to navigate and interact with websites. But the model falls short of the human level in another web-based benchmark, WebArena, according to the leaked benchmarks.

The operator also has difficulty performing tasks that can easily be performed by a human, if the leak is to be believed. In the test that tasked the operator with registering with a cloud provider and launching a virtual machine, the operator was successful only 60% of the time. The operator was given the task of creating a Bitcoin wallet, and was successful in only 10% of cases.

OpenAI’s imminent entry into the AI agent space comes as competitors, including the aforementioned Anthropic, Google and others, play into the nascent sector. It may be artificial intelligence agents Risky and speculativeBut tech giants are already touting it as the next big thing in AI. According to According to analytics firm Markets and Markets, the AI agent market could be worth $47.1 billion by 2030.

Agents today are fairly primitive. But some experts have raised concerns about their safety if the technology improves quickly.

One leaked chart shows the operator performing well on specific safety assessments, including tests that attempt to get the system to perform “illicit activities” and search for “sensitive personal data.” It is saidSafety testing is one of the reasons for the long development cycle of the actuator. In the last X mailWojciech Zaremba, co-founder of OpenAI, criticized Anthropic for launching an agent that he claims lacks safety mitigations.

“I can only imagine the negative reactions if OpenAI made a similar version,” Zaremba wrote.

It should be noted that OpenAI has been criticized by AI researchers, including former employees, for allegedly not focusing on safety work in favor of producing technology quickly.

Leave a Comment Cancel reply