An Independent Analysis of GPT-4o’s Classification Abilities

Article by Lars Wilk

Thank you for reading this post, don't forget to subscribe!

OpenAI’s recent unveiling of GPT-4o marks a significant advancement in AI language models, transforming how we interact with them. The most impressive feature is the live interaction capability with ChatGPT, allowing for seamless conversational interruptions. GPT-4o GPT4 and Gemini 1.5

Despite a few hiccups during the live demo, the achievements of the OpenAI team are undeniably impressive. Best of all, immediately after the demo, OpenAI granted access to the GPT-4o API.

In this article, I will present my independent analysis, comparing the classification abilities of GPT-4o with GPT-4, Google’s Gemini, and Unicorn models using an English dataset I created. Which of these models is the strongest in understanding English?

What’s New with GPT-4o?

GPT-4o introduces the concept of an Omni model, designed to seamlessly process text, audio, and video. OpenAI aims to democratize GPT-4 level intelligence, making it accessible even to free users. Enhanced quality and speed across more than 50 languages, combined with a lower price point, promise a more inclusive and globally accessible AI experience. Additionally, paid subscribers will benefit from five times the capacity compared to non-paid users. OpenAI also announced a desktop version of ChatGPT to facilitate real-time reasoning across audio, vision, and text interfaces.

How to Use the GPT-4o API

The new GPT-4o model follows the existing chat-completion API, ensuring backward compatibility and ease of use:

pythonCopy codefrom openai import AsyncOpenAI

OPENAI_API_KEY = "<your-api-key>"

def openai_chat_resolve(response: dict, strip_tokens=None) -> str:
    if strip_tokens is None:
        strip_tokens = []
    if response and response.choices and len(response.choices) > 0:
        content = response.choices[0].message.content.strip()
        if content:
            for token in strip_tokens:
                content = content.replace(token, '')
            return content
    raise Exception(f'Cannot resolve response: {response}')

async def openai_chat_request(prompt: str, model_name: str, temperature=0.0):
    message = {'role': 'user', 'content': prompt}
    client = AsyncOpenAI(api_key=OPENAI_API_KEY)
    return await client.chat.completions.create(
        model=model_name,
        messages=[message],
        temperature=temperature,
    )

openai_chat_request(prompt="Hello!", model_name="gpt-4o-2024-05-13")

GPT-4o is also accessible via the ChatGPT interface.

Official Evaluation GPT-4o GPT4 and Gemini 1.5

OpenAI’s blog post includes evaluation scores on known datasets such as MMLU and HumanEval, showcasing GPT-4o’s state-of-the-art performance. However, many models claim superior performance on open datasets, often due to overfitting. Independent analyses using lesser-known datasets are crucial for a realistic assessment.

My Evaluation Dataset

I created a dataset of 200 sentences categorized under 50 topics, designed to challenge classification tasks. The dataset is manually labeled in English. For this evaluation, I used only the English version to avoid potential biases from using the same language model for dataset creation and topic prediction. You can check out the dataset here.

Performance Results

I evaluated the following models:

  • GPT-4o: gpt-4o-2024-05-13
  • GPT-4: gpt-4-0613
  • GPT-4-Turbo: gpt-4-turbo-2024-04-09
  • Gemini 1.5 Pro: gemini-1.5-pro-preview-0409
  • Gemini 1.0: gemini-1.0-pro-002
  • Palm 2 Unicorn: text-unicorn@001

The task was to match each sentence with the correct topic, calculating an accuracy score and error rate for each model. A lower error rate indicates better performance.

Conclusion

This analysis using a uniquely crafted English dataset reveals insights into the state-of-the-art capabilities of these advanced language models. GPT-4o stands out with the lowest error rate, affirming OpenAI’s performance claims. Independent evaluations with diverse datasets are essential for a clearer picture of a model’s practical effectiveness beyond standardized benchmarks.

Note that the dataset is fairly small, and results may vary with different datasets. This evaluation was conducted using the English dataset only; a multilingual comparison will be conducted at a later time.

Related Posts
Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more

Health Cloud Brings Healthcare Transformation
Health Cloud Brings Healthcare Transformation

Following swiftly after last week's successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more

Salesforce Artificial Intelligence
Salesforce CRM for AI driven transformation

Is artificial intelligence integrated into Salesforce? Salesforce Einstein stands as an intelligent layer embedded within the Lightning Platform, bringing robust Read more