Prompt Decomposition - gettectonic.com
AI Agents Connect Tool Calling and Reasoning

AI Agents Connect Tool Calling and Reasoning

AI Agents: Bridging Tool Calling and Reasoning in Generative AI Exploring Problem Solving and Tool-Driven Decision Making in AI Introduction: The Emergence of Agentic AI Recent advancements in libraries and low-code platforms have simplified the creation of AI agents, often referred to as digital workers. Tool calling stands out as a key capability that enhances the “agentic” nature of Generative AI models, enabling them to move beyond mere conversational tasks. By executing tools (functions), these agents can act on your behalf and tackle intricate, multi-step problems requiring sound decision-making and interaction with diverse external data sources. This insight explores the role of reasoning in tool calling, examines the challenges associated with tool usage, discusses common evaluation methods for tool-calling proficiency, and provides examples of how various models and agents engage with tools. Reasoning as a Means of Problem-Solving Successful agents rely on two fundamental expressions of reasoning: reasoning through evaluation and planning, and reasoning through tool use. While both reasoning expressions are vital, they don’t always need to be combined to yield powerful solutions. For instance, OpenAI’s new o1 model excels in reasoning through evaluation and planning, having been trained to utilize chain of thought effectively. This has notably enhanced its ability to address complex challenges, achieving human PhD-level accuracy on benchmarks like GPQA across physics, biology, and chemistry, and ranking in the 86th-93rd percentile on Codeforces contests. However, the o1 model currently lacks explicit tool calling capabilities. Conversely, many models are specifically fine-tuned for reasoning through tool use, allowing them to generate function calls and interact with APIs effectively. These models focus on executing the right tool at the right moment but may not evaluate their results as thoroughly as the o1 model. The Berkeley Function Calling Leaderboard (BFCL) serves as an excellent resource for comparing the performance of various models on tool-calling tasks and provides an evaluation suite for assessing fine-tuned models against challenging scenarios. The recently released BFCL v3 now includes multi-step, multi-turn function calling, raising the standards for tool-based reasoning tasks. Both reasoning types are powerful in their own right, and their combination holds the potential to develop agents that can effectively deconstruct complex tasks and autonomously interact with their environments. For more insights into AI agent architectures for reasoning, planning, and tool calling, check out my team’s survey paper on ArXiv. Challenges in Tool Calling: Navigating Complex Agent Behaviors Creating robust and reliable agents necessitates overcoming various challenges. In tackling complex problems, an agent often must juggle multiple tasks simultaneously, including planning, timely tool interactions, accurate formatting of tool calls, retaining outputs from prior steps, avoiding repetitive loops, and adhering to guidelines to safeguard the system against jailbreaks and prompt injections. Such demands can easily overwhelm a single agent, leading to a trend where what appears to an end user as a single agent is actually a coordinated effort of multiple agents and prompts working in unison to divide and conquer the task. This division enables tasks to be segmented and addressed concurrently by distinct models and agents, each tailored to tackle specific components of the problem. This is where models with exceptional tool-calling capabilities come into play. While tool calling is a potent method for empowering productive agents, it introduces its own set of challenges. Agents must grasp the available tools, choose the appropriate one from a potentially similar set, accurately format the inputs, execute calls in the correct sequence, and potentially integrate feedback or instructions from other agents or humans. Many models are fine-tuned specifically for tool calling, allowing them to specialize in selecting functions accurately at the right time. Key considerations when fine-tuning a model for tool calling include: Common Benchmarks for Evaluating Tool Calling As tool usage in language models becomes increasingly significant, numerous datasets have emerged to facilitate the evaluation and enhancement of model tool-calling capabilities. Two prominent benchmarks include the Berkeley Function Calling Leaderboard and the Nexus Function Calling Benchmark, both utilized by Meta to assess the performance of their Llama 3.1 model series. The recent ToolACE paper illustrates how agents can generate a diverse dataset for fine-tuning and evaluating model tool use. Here’s a closer look at each benchmark: Each of these benchmarks enhances our ability to evaluate model reasoning through tool calling. They reflect a growing trend toward developing specialized models for specific tasks and extending the capabilities of LLMs to interact with the real world. Practical Applications of Tool Calling If you’re interested in observing tool calling in action, here are some examples to consider, categorized by ease of use, from simple built-in tools to utilizing fine-tuned models and agents with tool-calling capabilities. While the built-in web search feature is convenient, most applications require defining custom tools that can be integrated into your model workflows. This leads us to the next complexity level. To observe how models articulate tool calls, you can use the Databricks Playground. For example, select the Llama 3.1 405B model and grant access to sample tools like get_distance_between_locations and get_current_weather. When prompted with, “I am going on a trip from LA to New York. How far are these two cities? And what’s the weather like in New York? I want to be prepared for when I get there,” the model will decide which tools to call and what parameters to provide for an effective response. In this scenario, the model suggests two tool calls. Since the model cannot execute the tools, the user must input a sample result to simulate. Suppose you employ a model fine-tuned on the Berkeley Function Calling Leaderboard dataset. When prompted, “How many times has the word ‘freedom’ appeared in the entire works of Shakespeare?” the model will successfully retrieve and return the answer, executing the required tool calls without the user needing to define any input or manage the output format. Such models handle multi-turn interactions adeptly, processing past user messages, managing context, and generating coherent, task-specific outputs. As AI agents evolve to encompass advanced reasoning and problem-solving capabilities, they will become increasingly adept at managing

Read More
Prompt Decomposition

Prompt Decomposition

Optimizing Generative AI: Overcoming Adoption Barriers Through Prompt Decomposition Understand and Control Every Element of Your Workload Prompt Decomposition decompose the task into steps that focus on age date and interest allowing for accurate recommendations based on predefined test cases. Challenges in Scaling Generative AI As Generative AI Specialist at AWS, Iweve worked with over 50 customers in the last 18 months, encountering numerous generative AI proof of concepts (PoCs). Many teams struggle to move beyond the PoC stage due to several common challenges: Solution: Prompt Decomposition Prompt decomposition offers a solution to these common issues by breaking down complex prompts into manageable parts. While other techniques exist, prompt decomposition stands out for its ability to address these blockers effectively. Does Prompt Decomposition Really Work? Yes, it does. This technique has proven effective in unlocking scalability for some of AWS’s largest clients across various sectors. In this blog post, I will share code examples for two use cases that illustrate how prompt decomposition can improve accuracy and reduce latency. Each example will demonstrate changes in cost, latency, and accuracy before and after applying prompt decomposition. Example Results What is Prompt Decomposition? Prompt decomposition involves breaking down a complex prompt into smaller, more manageable components. This approach simplifies large tasks into sequential, manageable steps, improving execution efficiency. Example: Summer Camp Recommendation System Consider a system recommending summer camps based on a child’s age, desired camp date, and interests. The process can be decomposed into three steps: Parallel Execution For particularly lengthy prompts, decomposing them into parallel tasks can significantly reduce execution time. For example, a prompt initially taking 43 seconds can be broken into three parallel parts, reducing the total execution time to under 10 seconds without sacrificing accuracy. Conclusion Prompt decomposition is a powerful technique to overcome common challenges in generative AI projects. By breaking down complex tasks, teams can improve accuracy, manage costs and latency, and gain better control and metrics, leading to more scalable and reliable solutions. Ready to Build? For those ready to dive in, full code examples are available in the GitHub repository linked below. The repository includes a Jupyter Notebook (Prompt_Decomposition.ipynb) with two examples: one focused on accuracy and the other on latency. An updated evaluation function for multithreaded calls to Amazon Bedrock is also included. Starting with Evaluation Automated evaluation is crucial for assessing generative AI performance. Begin with a gold standard set of input/output pairs created by humans to serve as a benchmark. Avoid using generative AI to create this set, as it may introduce inaccuracies. The evaluation function compares the correct and generated answers, scoring them similarly to how a teacher would grade student work. Here’s a sample evaluation prompt: pythonCopy codetest_prompt_template_system = “””You are a detail-oriented teacher. You are grading an exam, looking at a correct answer and a student submitted answer. Your goal is to score the student answer based on how close it is to the correct answer. This is a pass/fail test. If the two answers are basically the same, the score should be 100. Minor things like punctuation, capitalization, or spelling should not impact the score. If the two answers are different, then the score should be 0. Please use your score in a ‘score’ XML tag, and any reasoning in a ‘reason’ XML tag. “”” Task-Based Decomposition Example For a summer camp recommendation system, we decompose the task into steps that focus on age, date, and interests, allowing for accurate recommendations based on predefined test cases. Volume-Based Decomposition Use Case To handle long prompts efficiently, such as analyzing an entire novel, we break the task into smaller, parallel parts, significantly improving execution time and accuracy. Prompt Decomposition Creating a flowchart for your task and selecting the best tools for each step can greatly enhance your generative AI workflows. Explore the full code in the GitHub repository, and feel free to comment with questions or share your own experiences. Let’s build something amazing by breaking it down into manageable pieces! Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Health Cloud Brings Healthcare Transformation Following swiftly after last week’s successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more

Read More
gettectonic.com