Meta Archives - gettectonic.com - Page 5
Salesforce AI Introduces SFR-Judge

Salesforce AI Introduces SFR-Judge

Salesforce AI Introduces SFR-Judge: A Family of Three Evaluation Models with 8B, 12B, and 70B Parameters, Powered by Meta Llama 3 and Mistral NeMO The rapid development of large language models (LLMs) has transformed natural language processing, making the need for accurate evaluation of these models more critical than ever. Traditional human evaluations, while effective, are time-consuming and impractical for the fast-paced evolution of AI models. Salesforce AI Introduces SFR-Judge. To address this, Salesforce AI Research has introduced SFR-Judge, a family of LLM-based judge models designed to revolutionize how AI outputs are evaluated. Built using Meta Llama 3 and Mistral NeMO, the SFR-Judge family includes models with 8 billion (8B), 12 billion (12B), and 70 billion (70B) parameters. These models are designed to handle evaluation tasks such as pairwise comparisons, single ratings, and binary classifications, streamlining the evaluation process for AI researchers. Overcoming Limitations in Traditional Judge Models Traditional LLMs used for evaluation often suffer from biases such as position bias (favoring responses based on their order) and length bias (preferring longer responses regardless of their accuracy). SFR-Judge addresses these issues by leveraging Direct Preference Optimization (DPO), a training method that enables the model to learn from both positive and negative examples, reducing bias and ensuring more consistent and accurate evaluations. Performance and Benchmarking SFR-Judge has been rigorously tested across 13 benchmarks covering three key evaluation tasks. It outperformed existing judge models, including proprietary models like GPT-4o, achieving top performance on 10 of the 13 benchmarks. Notably, on the RewardBench leaderboard, SFR-Judge achieved a 92.7% accuracy, marking a new high in LLM-based evaluation and demonstrating its potential not only as an evaluation tool but also as a reward model for reinforcement learning from human feedback (RLHF) scenarios. Innovative Training Approach The SFR-Judge models were trained using three distinct data formats: These diverse data formats allow SFR-Judge to generate well-rounded, accurate evaluations, making it a more reliable and robust tool for model assessment. Bias Mitigation and Robustness SFR-Judge was tested on EvalBiasBench, a benchmark designed to measure six types of bias. The results demonstrated significantly lower bias levels compared to competing models, along with high consistency in pairwise order comparisons. This robustness ensures that SFR-Judge’s evaluations remain stable, even when the order of responses is altered, making it a scalable and reliable alternative to human annotation. Key Takeaways: Conclusion Salesforce AI Research’s introduction of SFR-Judge represents a breakthrough in the automated evaluation of large language models. By incorporating Direct Preference Optimization and a diverse training approach, SFR-Judge sets a new standard for accuracy, bias reduction, and consistency. Its ability to provide detailed feedback and adapt to various evaluation tasks makes it a powerful tool for the AI community, streamlining the process of LLM assessment and setting the stage for future advancements in AI evaluation. Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Service Cloud with AI-Driven Intelligence Salesforce Enhances Service Cloud with AI-Driven Intelligence Engine Data science and analytics are rapidly becoming standard features in enterprise applications, Read more

Read More
Zendesk Launches AI Agent Builder

Zendesk Launches AI Agent Builder

Zendesk Launches AI Agent Builder and Enhances Agent Copilot Zendesk has unveiled its AI Agent Builder, a key feature in a series of significant updates across its platform. This new tool enables customer service teams to create bots—now referred to as “AI Agents”—using natural language descriptions. For example, a user might input: “A customer wants to return a product.” The AI Agent Builder will recognize the scenario and automatically create a framework for the AI Agent, which can then be reviewed, tested, and deployed. This framework might include essential steps like checking the order number, verifying the items for return, and cross-referencing the return policy. Matthias Goehler, CTO for EMEA at Zendesk, explains, “You can define any number of workflows in the same straightforward manner. The best part is that business users can do this without needing to design complex flowcharts or decision trees.” However, developers may still need to consult an API when creating AI Agents that interact with multiple third-party applications. Other Enhancements to Zendesk’s AI Agents The AI Agent Builder simplifies the automation of customer interactions that involve multiple steps. For more straightforward queries, Zendesk can connect a single AI Agent to trusted knowledge sources, allowing it to autonomously provide answers. Recently, the vendor has expanded this capability to email and strengthened its partnership with Poly.AI to integrate conversational AI capabilities into the voice channel. Goehler remarked, “When I first heard a Poly bot, I thought it was a human; it even had subtle dialects and varied pacing.” This natural-sounding voice, combined with real-time data processing, enables the bot to understand customer intent and guide them through various processes. Zendesk aims to help customers automate up to 80 percent of their service inquiries. However, Goehler acknowledges that some situations will always require human intervention, whether due to case complexity or customer preferences. Therefore, the company continues to enhance its Agent Copilot, which now includes several new features. The “Enhanced” Zendesk Agent Copilot One of the most exciting new features in Agent Copilot is its “Procedure” capability. This allows contact centers to define specific procedures for the Copilot to execute on behalf of live agents. Users can specify these procedures in natural language, such as: “Do this first, then this, and finally this.” During live interactions, agents can request the Copilot to carry out tasks like scheduling appointments or sending shipping labels. The Copilot can also proactively suggest procedures, share recommended responses, and offer guidance through its new “auto-assist” mode. While the live agent remains in control, they can approve the Copilot’s suggestions, allowing it to handle much of the workload. Goehler noted, “If the agent wants to adjust something, they can do that, too. The AI continues to suggest steps and solutions.” This feature is particularly beneficial for companies facing high staff turnover, as it allows new agents to quickly adapt with consistent, high-quality guidance. Zendesk has also introduced Agent Copilot for Voice, making many of its capabilities accessible during customer calls. Agents will receive live call insights and relevant knowledge base content to enhance their interactions. Elsewhere at Zendesk 2024 has been a transformative year for Zendesk. The company has entered the workforce engagement management (WEM) market with acquisitions of Klaus and Tymeshift. This follows the integration of Ultimate, which laid the groundwork for the new Zendesk AI Agents and significantly enhanced the vendor’s conversational AI expertise. Additionally, Zendesk has developed a customer messaging app in collaboration with Meta, established a venture arm for AI startups, and announced new partnerships with AWS and Anthropic. Notably, Zendesk has gained attention for introducing an “industry-first” outcome-based pricing model. This move is significant as many CCaaS and CRM vendors, facing pressure from AI solutions that reduce headcounts, have traditionally relied on seat-based pricing models. By adopting outcome-based pricing, Zendesk ensures that customers only pay more when they achieve desired outcomes, addressing a key challenge in the industry. Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Service Cloud with AI-Driven Intelligence Salesforce Enhances Service Cloud with AI-Driven Intelligence Engine Data science and analytics are rapidly becoming standard features in enterprise applications, Read more

Read More
Small Language Models Explained

Small Language Models Explained

Exploring Small Language Models (SLMs): Capabilities and Applications Large Language Models (LLMs) have been prominent in AI for some time, but Small Language Models (SLMs) are now enhancing our ability to work with natural and programming languages. While LLMs excel in general language understanding, certain applications require more accuracy and domain-specific knowledge than these models can provide. This has created a demand for custom SLMs that offer LLM-like performance while reducing runtime costs and providing a secure, manageable environment. In this insight, we dig down into the world of SLMs, exploring their unique characteristics, benefits, and applications. We also discuss fine-tuning methods applied to Llama-2–13b, an SLM, to address specific challenges. The goal is to investigate how to make the fine-tuning process platform-independent. We selected Databricks for this purpose due to its compatibility with major cloud providers like Azure, Amazon Web Services (AWS), and Google Cloud Platform. What Are Small Language Models? In AI and natural language processing, SLMs are lightweight generative models with a focus on specific tasks. The term “small” refers to: SLMs like Google Gemini Nano, Microsoft’s Orca-2–7b, and Meta’s Llama-2–13b run efficiently on a single GPU and include over 5 billion parameters. SLMs vs. LLMs Applications of SLMs SLMs are increasingly used across various sectors, including healthcare, technology, and beyond. Common applications include: Fine-Tuning Small Language Models Fine-tuning involves additional training of a pre-trained model to make it more domain-specific. This process updates the model’s parameters with new data to enhance its performance in targeted applications, such as text generation or question answering. Hardware Requirements for Fine-Tuning The hardware needs depend on the model size, project scale, and dataset. General recommendations include: Data Preparation Preparing data involves extracting text from PDFs, cleaning it, generating question-and-answer pairs, and then fine-tuning the model. Although GPT-3.5 was used for generating Q&A pairs, SLMs can also be utilized for this purpose based on the use case. Fine-Tuning Process You can use HuggingFace tools for fine-tuning Llama-2–13b-chat-hf. The dataset was converted into a HuggingFace-compatible format, and quantization techniques were applied to optimize performance. The fine-tuning lasted about 16 hours over 50 epochs, with the cost around $100/£83, excluding trial costs. Results and Observations The fine-tuned model demonstrated strong performance, with over 70% of answers being highly similar to those generated by GPT-3.5. The SLM achieved comparable results despite having fewer parameters. The process was successful on both AWS and Databricks platforms, showcasing the model’s adaptability. SLMs have some limitations compared to LLMs, such as higher operational costs and restricted knowledge bases. However, they offer benefits in efficiency, versatility, and environmental impact. As SLMs continue to evolve, their relevance and popularity are likely to increase, especially with new models like Gemini Nano and Mixtral entering the market. Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Service Cloud with AI-Driven Intelligence Salesforce Enhances Service Cloud with AI-Driven Intelligence Engine Data science and analytics are rapidly becoming standard features in enterprise applications, Read more

Read More
A Company in Transition

A Company in Transition

OpenAI Restructures: Increased Flexibility, But Raises Concerns OpenAI’s decision to restructure into a for-profit entity offers more freedom for the company and its investors but raises questions about its commitment to ethical AI development. Founded in 2015 as a nonprofit, OpenAI transitioned to a hybrid model in 2019 with the creation of a for-profit subsidiary. Now, its restructuring, widely reported this week, signals a shift where the nonprofit arm will no longer influence the day-to-day operations of the for-profit side. CEO Sam Altman is set to receive equity in the newly restructured company, which will operate as a benefit corporation (B Corp), similar to competitors like Anthropic and Sama. A Company in Transition This move comes on the heels of a turbulent year. OpenAI’s board initially voted to remove Altman over concerns about transparency, but later rehired him after significant backlash and the resignation of several board members. The company has seen a number of high-profile departures since, including co-founder Ilya Sutskever, who left in May to start Safe Superintelligence (SSI), an AI safety-focused venture that recently secured $1 billion in funding. This week, CTO Mira Murati, along with key research leaders Bob McGrew and Barret Zoph, also announced their departures. OpenAI’s restructuring also coincides with an anticipated multi-billion-dollar investment round involving major players such as Nvidia, Apple, and Microsoft, potentially pushing the company’s valuation to as high as $150 billion. Complex But Expected Move According to Michael Bennett, AI policy advisor at Northeastern University, the restructuring isn’t surprising given OpenAI’s rapid growth and increasingly complex structure. “Considering OpenAI’s valuation, it’s understandable that the company would simplify its governance to better align with investor priorities,” said Bennett. The transition to a benefit corporation signals a shift towards prioritizing shareholder interests, but it also raises concerns about whether OpenAI will maintain its ethical obligations. “By moving away from its nonprofit roots, OpenAI may scale back its commitment to ethical AI,” Bennett noted. Ethical and Safety Concerns OpenAI has faced scrutiny over its rapid deployment of generative AI models, including its release of ChatGPT in November 2022. Critics, including Elon Musk, have accused the company of failing to be transparent about the data and methods it uses to train its models. Musk, a co-founder of OpenAI, even filed a lawsuit alleging breach of contract. Concerns persist that the restructuring could lead to less ethical oversight, particularly in preventing issues like biased outputs, hallucinations, and broader societal harm from AI. Despite the potential risks, Bennett acknowledged that the company would have greater operational freedom. “They will likely move faster and with greater focus on what benefits their shareholders,” he said. This could come at the expense of the ethical commitments OpenAI previously emphasized when it was a nonprofit. Governance and Regulation Some industry voices, however, argue that OpenAI’s structure shouldn’t dictate its commitment to ethical AI. Veera Siivonen, co-founder and chief commercial officer of AI governance vendor Saidot, emphasized the role of regulation in ensuring responsible AI development. “Major players like Anthropic, Cohere, and tech giants such as Google and Meta are all for-profit entities,” Siivonen said. “It’s unfair to expect OpenAI to operate under a nonprofit model when others in the industry aren’t bound by the same restrictions.” Siivonen also pointed to OpenAI’s participation in global AI governance initiatives. The company recently signed the European Union AI Pact, a voluntary agreement to adhere to the principles of the EU’s AI Act, signaling its commitment to safety and ethics. Challenges for Enterprises The restructuring raises potential concerns for enterprises relying on OpenAI’s technology, said Dion Hinchcliffe, an analyst with Futurum Group. OpenAI may be able to innovate faster under its new structure, but the reduced influence of nonprofit oversight could make some companies question the vendor’s long-term commitment to safety. Hinchcliffe noted that the departure of key staff could signal a shift away from prioritizing AI safety, potentially prompting enterprises to reconsider their trust in OpenAI. New Developments Amid Restructuring Despite the ongoing changes, OpenAI continues to roll out new technologies. The company recently introduced a new moderation model, “omni-moderation-latest,” built on GPT-4o. This model, available through the Moderation API, enables developers to flag harmful content in both text and image outputs. A Company in Transition As OpenAI navigates its restructuring, balancing rapid innovation with maintaining ethical standards will be crucial to sustaining enterprise trust and market leadership. Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Service Cloud with AI-Driven Intelligence Salesforce Enhances Service Cloud with AI-Driven Intelligence Engine Data science and analytics are rapidly becoming standard features in enterprise applications, Read more

Read More
AI Agents Connect Tool Calling and Reasoning

AI Agents Connect Tool Calling and Reasoning

AI Agents: Bridging Tool Calling and Reasoning in Generative AI Exploring Problem Solving and Tool-Driven Decision Making in AI Introduction: The Emergence of Agentic AI Recent advancements in libraries and low-code platforms have simplified the creation of AI agents, often referred to as digital workers. Tool calling stands out as a key capability that enhances the “agentic” nature of Generative AI models, enabling them to move beyond mere conversational tasks. By executing tools (functions), these agents can act on your behalf and tackle intricate, multi-step problems requiring sound decision-making and interaction with diverse external data sources. This insight explores the role of reasoning in tool calling, examines the challenges associated with tool usage, discusses common evaluation methods for tool-calling proficiency, and provides examples of how various models and agents engage with tools. Reasoning as a Means of Problem-Solving Successful agents rely on two fundamental expressions of reasoning: reasoning through evaluation and planning, and reasoning through tool use. While both reasoning expressions are vital, they don’t always need to be combined to yield powerful solutions. For instance, OpenAI’s new o1 model excels in reasoning through evaluation and planning, having been trained to utilize chain of thought effectively. This has notably enhanced its ability to address complex challenges, achieving human PhD-level accuracy on benchmarks like GPQA across physics, biology, and chemistry, and ranking in the 86th-93rd percentile on Codeforces contests. However, the o1 model currently lacks explicit tool calling capabilities. Conversely, many models are specifically fine-tuned for reasoning through tool use, allowing them to generate function calls and interact with APIs effectively. These models focus on executing the right tool at the right moment but may not evaluate their results as thoroughly as the o1 model. The Berkeley Function Calling Leaderboard (BFCL) serves as an excellent resource for comparing the performance of various models on tool-calling tasks and provides an evaluation suite for assessing fine-tuned models against challenging scenarios. The recently released BFCL v3 now includes multi-step, multi-turn function calling, raising the standards for tool-based reasoning tasks. Both reasoning types are powerful in their own right, and their combination holds the potential to develop agents that can effectively deconstruct complex tasks and autonomously interact with their environments. For more insights into AI agent architectures for reasoning, planning, and tool calling, check out my team’s survey paper on ArXiv. Challenges in Tool Calling: Navigating Complex Agent Behaviors Creating robust and reliable agents necessitates overcoming various challenges. In tackling complex problems, an agent often must juggle multiple tasks simultaneously, including planning, timely tool interactions, accurate formatting of tool calls, retaining outputs from prior steps, avoiding repetitive loops, and adhering to guidelines to safeguard the system against jailbreaks and prompt injections. Such demands can easily overwhelm a single agent, leading to a trend where what appears to an end user as a single agent is actually a coordinated effort of multiple agents and prompts working in unison to divide and conquer the task. This division enables tasks to be segmented and addressed concurrently by distinct models and agents, each tailored to tackle specific components of the problem. This is where models with exceptional tool-calling capabilities come into play. While tool calling is a potent method for empowering productive agents, it introduces its own set of challenges. Agents must grasp the available tools, choose the appropriate one from a potentially similar set, accurately format the inputs, execute calls in the correct sequence, and potentially integrate feedback or instructions from other agents or humans. Many models are fine-tuned specifically for tool calling, allowing them to specialize in selecting functions accurately at the right time. Key considerations when fine-tuning a model for tool calling include: Common Benchmarks for Evaluating Tool Calling As tool usage in language models becomes increasingly significant, numerous datasets have emerged to facilitate the evaluation and enhancement of model tool-calling capabilities. Two prominent benchmarks include the Berkeley Function Calling Leaderboard and the Nexus Function Calling Benchmark, both utilized by Meta to assess the performance of their Llama 3.1 model series. The recent ToolACE paper illustrates how agents can generate a diverse dataset for fine-tuning and evaluating model tool use. Here’s a closer look at each benchmark: Each of these benchmarks enhances our ability to evaluate model reasoning through tool calling. They reflect a growing trend toward developing specialized models for specific tasks and extending the capabilities of LLMs to interact with the real world. Practical Applications of Tool Calling If you’re interested in observing tool calling in action, here are some examples to consider, categorized by ease of use, from simple built-in tools to utilizing fine-tuned models and agents with tool-calling capabilities. While the built-in web search feature is convenient, most applications require defining custom tools that can be integrated into your model workflows. This leads us to the next complexity level. To observe how models articulate tool calls, you can use the Databricks Playground. For example, select the Llama 3.1 405B model and grant access to sample tools like get_distance_between_locations and get_current_weather. When prompted with, “I am going on a trip from LA to New York. How far are these two cities? And what’s the weather like in New York? I want to be prepared for when I get there,” the model will decide which tools to call and what parameters to provide for an effective response. In this scenario, the model suggests two tool calls. Since the model cannot execute the tools, the user must input a sample result to simulate. Suppose you employ a model fine-tuned on the Berkeley Function Calling Leaderboard dataset. When prompted, “How many times has the word ‘freedom’ appeared in the entire works of Shakespeare?” the model will successfully retrieve and return the answer, executing the required tool calls without the user needing to define any input or manage the output format. Such models handle multi-turn interactions adeptly, processing past user messages, managing context, and generating coherent, task-specific outputs. As AI agents evolve to encompass advanced reasoning and problem-solving capabilities, they will become increasingly adept at managing

Read More
OpenAI’s o1 model

OpenAI’s o1 model

The release of OpenAI’s o1 model has sparked some confusion. Unlike previous models that focused on increasing parameters and capabilities, this one takes a different approach. Let’s explore the technical distinctions first, share a real-world experience, and wrap up with some recommendations on when to use each model. Technical Differences The core difference is that o1 serves as an “agentic wrapper” around GPT-4 (or a similar model). This means it incorporates a layer of metacognition, or “thinking about thinking,” before addressing a query. Instead of immediately answering the question, o1 first evaluates the best strategy for tackling it by breaking it down into subtasks. Once this analysis is complete, o1 begins executing each subtask. Depending on the answers it receives, it may adjust its approach. This method resembles the “tree of thought” strategy, allowing users to see real-time explanations of the subtasks being addressed. For a deeper dive into agentic approaches, I highly recommend Andrew Ng’s insightful letters on the topic. However, this method comes with a cost—it’s about six times more expensive and approximately six times slower than traditional approaches. While this metacognitive process can enhance understanding, it doesn’t guarantee improved answers for straightforward factual queries or tasks like generating trivia questions, where simplicity may yield better results. Real-World Example To illustrate the practical implications, Tectonic began to deepen the understanding of variational autoencoders—a trend in multimodal LLMs. While we had a basic grasp of the concept, we had specific questions about their advantages over traditional autoencoders and the nuances of training them. This information isn’t easily accessible through a simple search; it’s more akin to seeking insight from a domain expert. To enhance our comprehension, we engaged with both GPT-4 and o1. We quickly noticed that o1’s responses were more thoughtful and facilitated a meaningful dialogue. In contrast, GPT-4 tended to recycle the same information, offering limited depth—much like how some people might respond in conversation. A particularly striking example occurred when we attempted to clarify our understanding. The difference was notable. o1 responded like a thoughtful colleague, addressing our specific points, while GPT-4 felt more like a know-it-all friend who rambled on, requiring me to sift through the information for valuable insights. Summary and Recommendations In essence, if we were to personify these models, GPT-4 would be the overzealous friend who dives into a stream of consciousness, while o1 would be the more attentive listener who takes a moment to reflect before delivering precise and relevant insights. Here are some scenarios where o1 may outperform GPT-4, justifying its higher cost: By leveraging these insights, you can better navigate the strengths of each model in your tasks and inquiries. Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Service Cloud with AI-Driven Intelligence Salesforce Enhances Service Cloud with AI-Driven Intelligence Engine Data science and analytics are rapidly becoming standard features in enterprise applications, Read more

Read More
gettectonic.com