Langchain shows that artificial intelligence agents are not yet at the human level because they immersed them tools

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Once artificial intelligence agents showed a promise, organizations were forced to deal with knowing whether one agent is sufficient, or if they should invest in building a broader multi -agent network that touches more points in their institution.

Concorty frame company Linjshen He sought to approach this question. The artificial intelligence agent has undergone several experiments that found that individual factors have an end to the context and tools before their performance begins to deteriorate. These experiments can lead to a better understanding of the architecture needed to maintain multi -agent factors and systems.

in Blog postLangchain detailed a set of experiences he had with one reaction factor and measuring her performance. The main question of Lengshen was hoping to answer, “At what point is one reaction to the instructions and tools, and then he sees a decrease in performance?”

Choose Langchain Use Action agent reaction Because it is “one of the main agents of the agents.”

While the standard performance of the agent can often lead to misleading results, Langchain chose to limit the test into two quantitative measuring tasks easily: answering questions and scheduling scheduling.

“There are many current criteria for using tools and calling tools, but for the purposes of this experiment, we wanted to evaluate a practical agent that we already use,” Langishen wrote. “This agent is our internal email assistant, and he is responsible for two main fields of work – responding to the requests for meeting, scheduling and supporting customers with their questions.”

Langchain experience parameters

Langchain mainly uses pre -building reaction agents through its Langgraph platform. These factors were characterized by the introduction of large linguistic models (LLMS) that have become part of the standard test. This LLMS included Claude 3.5 Sonnet of Anthropor, Meta’s Llama-3.3-70B and three models of Openai, GPT-4O, O1 and O3-MINI.

The company broke the test down to evaluate the performance of the e -mail assistant better on the two tasks, and create a list of steps that must be followed. It has started with the customer assistance support capabilities, which are looking for how the agent accepts an e -mail from a customer and responds to an answer.

Langchain evaluated the tool calls first, or the tools made by the agent. If the agent follows the correct arrangement, he has passed the test. After that, the researchers asked the assistant to respond to an email and use LLM to judge its performance.

For the second work field, the calendar scheduling, Langchain focused on the agent’s ability to follow the instructions.

The researchers wrote: “In other words, the agent needs to remember specific instructions provided, such as when meetings should last with different parties.”

Excess in the agent

Once the parameters are identified, Langchain appointed and overcome the assistant agent of the email.

I have identified 30 tasks for each calendar scheduling and customer support. These three times (for a total of 90 runs). The researchers established the calendar scheme and customer support agent for better assessing tasks.

“The calendar schedule has only access to the field of calendar scheduling, and the agent of customer support has only access to the field of customer support,” explained to Langishen.

Then the researchers added more domain tasks and tools to the agents to increase the number of responsibilities. These can range from human resources, to technical quality guarantee, to law, compliance and a range of other areas.

The deterioration of the instructions of individual agents

After the evaluation management, Langchain found that individual agents are often very drowned when telling them to do a lot of things. They started forgetting to contact the tools or were unable to respond to the tasks when giving more instructions and contexts.

Langchain found that the calendar scheduling using GPT-4O “was” worse than CLAUDE-3.5-Sonnet, O1 and O3 through different context sizes, and performance decreased more severely than other models when providing a larger context. ” GPT-4O calendar performance decreased to 2 % when areas increased to at least seven.

Other models were not much better. Llama-3.3-70B forgot to contact the Send_email tool, “So I failed in every test case.”

Only CLAUDE-3.5-Sonnet, O1 and O3-MINI are all remembered to call the tool, but Claude-3.5-Sonnet was worse than the other two models. However, the performance of the O3-MINI decomposes by simply adding the relevant areas to scheduling instructions.

Customer Support Undersecretary can call more tools, but for this test, Langchain said that Claude-3.5-MINI was completely performance as well as O3-MINI and O1. He also provided a shallow performance when adding more areas. When the context window extends, the Claude model works worse.

GPT-4O also performed the worst among the tested models.

We have seen that with more context, the following instructions became worse. Langishene pointed out that some of our tasks were designed to follow up specific specialized instructions (for example, not a specific measure of the European Union customers). “We have found that these instructions will be successful with agents with fewer areas, but with the increase in the number of areas, these instructions often forgot, and the tasks later failed.”

The company said it explores how to evaluate the multi -agent structure using the same way to download the field.

Langchain has already invested in the performance of the agents, where the concept of “surrounding agents”, or agents who work in the background and are operated by specific events. These experiments can make it easy to know the best way to ensure functional performance.

Leave a Comment