AI Agents: Bridging Tool Calling and Reasoning in Generative AI

Exploring Problem Solving and Tool-Driven Decision Making in AI

Thank you for reading this post, don't forget to subscribe!

Introduction: The Emergence of Agentic AI

Recent advancements in libraries and low-code platforms have simplified the creation of AI agents, often referred to as digital workers. Tool calling stands out as a key capability that enhances the “agentic” nature of Generative AI models, enabling them to move beyond mere conversational tasks. By executing tools (functions), these agents can act on your behalf and tackle intricate, multi-step problems requiring sound decision-making and interaction with diverse external data sources.

This insight explores the role of reasoning in tool calling, examines the challenges associated with tool usage, discusses common evaluation methods for tool-calling proficiency, and provides examples of how various models and agents engage with tools.

Reasoning as a Means of Problem-Solving

Successful agents rely on two fundamental expressions of reasoning: reasoning through evaluation and planning, and reasoning through tool use.

  1. Reasoning through Evaluation and Planning: This aspect pertains to an agent’s ability to deconstruct a problem effectively by iteratively planning, assessing progress, and adapting its strategy until the task is completed. Techniques such as Chain-of-Thought (CoT), ReAct, and Prompt Decomposition are employed to enhance the model’s strategic reasoning capabilities, allowing for a more comprehensive approach to solving complex challenges. This reasoning type is macro-level, ensuring accuracy by considering outcomes at each stage of the task.
  2. Reasoning through Tool Use: This refers to an agent’s proficiency in engaging with its environment, determining which tools to call and how to structure each call. These tools enable the agent to retrieve data, execute code, access APIs, and more. The effectiveness of this reasoning lies in the accurate execution of tool calls, rather than reflecting on the results.

While both reasoning expressions are vital, they don’t always need to be combined to yield powerful solutions. For instance, OpenAI’s new o1 model excels in reasoning through evaluation and planning, having been trained to utilize chain of thought effectively. This has notably enhanced its ability to address complex challenges, achieving human PhD-level accuracy on benchmarks like GPQA across physics, biology, and chemistry, and ranking in the 86th-93rd percentile on Codeforces contests. However, the o1 model currently lacks explicit tool calling capabilities.

Conversely, many models are specifically fine-tuned for reasoning through tool use, allowing them to generate function calls and interact with APIs effectively. These models focus on executing the right tool at the right moment but may not evaluate their results as thoroughly as the o1 model. The Berkeley Function Calling Leaderboard (BFCL) serves as an excellent resource for comparing the performance of various models on tool-calling tasks and provides an evaluation suite for assessing fine-tuned models against challenging scenarios. The recently released BFCL v3 now includes multi-step, multi-turn function calling, raising the standards for tool-based reasoning tasks.

Both reasoning types are powerful in their own right, and their combination holds the potential to develop agents that can effectively deconstruct complex tasks and autonomously interact with their environments. For more insights into AI agent architectures for reasoning, planning, and tool calling, check out my team’s survey paper on ArXiv.

Challenges in Tool Calling: Navigating Complex Agent Behaviors

Creating robust and reliable agents necessitates overcoming various challenges. In tackling complex problems, an agent often must juggle multiple tasks simultaneously, including planning, timely tool interactions, accurate formatting of tool calls, retaining outputs from prior steps, avoiding repetitive loops, and adhering to guidelines to safeguard the system against jailbreaks and prompt injections.

Such demands can easily overwhelm a single agent, leading to a trend where what appears to an end user as a single agent is actually a coordinated effort of multiple agents and prompts working in unison to divide and conquer the task. This division enables tasks to be segmented and addressed concurrently by distinct models and agents, each tailored to tackle specific components of the problem.

This is where models with exceptional tool-calling capabilities come into play. While tool calling is a potent method for empowering productive agents, it introduces its own set of challenges. Agents must grasp the available tools, choose the appropriate one from a potentially similar set, accurately format the inputs, execute calls in the correct sequence, and potentially integrate feedback or instructions from other agents or humans. Many models are fine-tuned specifically for tool calling, allowing them to specialize in selecting functions accurately at the right time.

Key considerations when fine-tuning a model for tool calling include:

  • Proper Tool Selection: The model must understand the relationships among available tools, make nested calls when necessary, and select the appropriate tool even amidst similar options.
  • Addressing Structural Challenges: While most models utilize JSON format for tool calling, formats like YAML or XML may also be applicable. Consider whether the model should generalize across formats or specialize in one. Regardless, it must include the right parameters for each call, possibly using outputs from previous calls in subsequent ones.
  • Ensuring Dataset Diversity and Robust Evaluations: The dataset must be diverse enough to encompass the complexities of multi-step, multi-turn function calling. Comprehensive evaluations are essential to prevent overfitting and avoid benchmark contamination.

Common Benchmarks for Evaluating Tool Calling

As tool usage in language models becomes increasingly significant, numerous datasets have emerged to facilitate the evaluation and enhancement of model tool-calling capabilities. Two prominent benchmarks include the Berkeley Function Calling Leaderboard and the Nexus Function Calling Benchmark, both utilized by Meta to assess the performance of their Llama 3.1 model series. The recent ToolACE paper illustrates how agents can generate a diverse dataset for fine-tuning and evaluating model tool use.

Here’s a closer look at each benchmark:

  • Berkeley Function Calling Leaderboard (BFCL): The BFCL features 2,000 question-function-answer pairs spanning multiple programming languages. Currently, three versions of the BFCL dataset exist, each with enhancements for real-world applicability. For instance, BFCL-V2, released on August 19, 2024, includes user-contributed samples to address evaluation challenges linked to dataset contamination. BFCL-V3, released on September 19, 2024, introduces multi-turn, multi-step tool calling to the benchmark, crucial for agentic applications requiring multiple tool calls over time to complete a task. Instructions for evaluating models on BFCL can be found on GitHub, with the latest dataset available on Hugging Face, and the current leaderboard accessible here. The Berkeley team has also released various versions of their Gorilla Open-Functions model, fine-tuned for function-calling tasks.
  • Nexus Function Calling Benchmark: This benchmark assesses models on zero-shot function calling and API usage across nine tasks, categorized into single, parallel, and nested tool calls. Nexusflow has released NexusRaven-V2, a model tailored for function calling. The Nexus benchmark is available on GitHub, and the corresponding leaderboard can be found on Hugging Face.
  • ToolACE: The ToolACE paper showcases a creative strategy for addressing the challenges of gathering real-world data for function calling. The research team developed an agentic pipeline to generate a synthetic dataset for tool calling, comprising over 26,000 different APIs. This dataset includes examples of single, parallel, and nested tool calls, as well as non-tool-based interactions, supporting both single and multi-turn dialogs. The team also released a fine-tuned version of Llama-3.1–8B-Instruct, ToolACE-8B, designed to tackle complex tool-calling tasks. A subset of the ToolACE dataset is available on Hugging Face.

Each of these benchmarks enhances our ability to evaluate model reasoning through tool calling. They reflect a growing trend toward developing specialized models for specific tasks and extending the capabilities of LLMs to interact with the real world.

Practical Applications of Tool Calling

If you’re interested in observing tool calling in action, here are some examples to consider, categorized by ease of use, from simple built-in tools to utilizing fine-tuned models and agents with tool-calling capabilities.

  • Level 1 — ChatGPT: A great starting point for experiencing live tool calling is through ChatGPT. You can use GPT-4o via the chat interface to execute tools for web browsing. For instance, when prompted with, “What’s the latest AI news this week?” ChatGPT-4o will perform a web search and provide a response based on its findings. Note that the new o1 model currently lacks tool-calling abilities.

While the built-in web search feature is convenient, most applications require defining custom tools that can be integrated into your model workflows. This leads us to the next complexity level.

  • Level 2 — Using a Model with Tool Calling Abilities and Defining Custom Tools: This level involves utilizing a model with tool-calling capabilities to gauge its effectiveness in selecting and utilizing tools. It’s crucial to note that when a model is trained for tool calling, it generates the text or code for the tool call without executing it. An external entity must invoke the tool, marking the transition from language model capabilities to agentic systems.

To observe how models articulate tool calls, you can use the Databricks Playground. For example, select the Llama 3.1 405B model and grant access to sample tools like get_distance_between_locations and get_current_weather. When prompted with, “I am going on a trip from LA to New York. How far are these two cities? And what’s the weather like in New York? I want to be prepared for when I get there,” the model will decide which tools to call and what parameters to provide for an effective response.

In this scenario, the model suggests two tool calls. Since the model cannot execute the tools, the user must input a sample result to simulate.

  • Level 3 — Deploying a Fine-Tuned Model or Agent with Integrated Tool Calling: At this level, you can utilize a fine-tuned model or agent designed for specific tasks. In this case, the model should seamlessly call APIs and return information based on the user’s query.

Suppose you employ a model fine-tuned on the Berkeley Function Calling Leaderboard dataset. When prompted, “How many times has the word ‘freedom’ appeared in the entire works of Shakespeare?” the model will successfully retrieve and return the answer, executing the required tool calls without the user needing to define any input or manage the output format. Such models handle multi-turn interactions adeptly, processing past user messages, managing context, and generating coherent, task-specific outputs.

As AI agents evolve to encompass advanced reasoning and problem-solving capabilities, they will become increasingly adept at managing and executing complex, multi-step tasks across diverse domains.

Conclusion

The landscape of AI agents is rapidly evolving, and the integration of reasoning with tool calling is vital for their success. By capturinging the power of models designed for tool interaction and enhancing their reasoning capabilities, we can create intelligent systems that not only respond to inquiries but also engage in sophisticated decision-making processes, ultimately transforming how we interact with technology.

As we explore this journey, ongoing advancements in training methodologies, benchmarking practices, and the exploration of novel use cases will define the future of AI agents. Stay tuned for developments in this exciting domain, as we collectively navigate the transformative potential of AI agents in our lives.

Related Posts
Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more

Health Cloud Brings Healthcare Transformation
Health Cloud Brings Healthcare Transformation

Following swiftly after last week's successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more

Salesforce’s Quest for AI for the Masses
Roles in AI

The software engine, Optimus Prime (not to be confused with the Autobot leader), originated in a basement beneath a West Read more