In this insight, the focus is on exploring the use of confidence scores available through the OpenAI API.

Thank you for reading this post, don't forget to subscribe!

The first section delves into these scores and explains their significance using a custom chat interface. The second section demonstrates how to apply confidence scores programmatically in code.

Understanding Confidence Scores

To begin, it’s important to understand what an LLM (Large Language Model) is doing for each token in its response:

  • The model generates a value for each token in its vocabulary (around 100,000 values).
  • These values are transformed into what is often referred to as “probabilities,” which are the focus of this discussion.
  • A token is selected based on these values (sometimes the most likely one, sometimes not), which then becomes part of the response.

However, it’s essential to clarify that the term “probabilities” here is somewhat misleading. While mathematically, they qualify as a “probability distribution” (the values add up to one), they don’t necessarily reflect true confidence or likelihood in the way we might expect. In this sense, these values should be treated with caution.

A useful way to think about these values is to consider them as “confidence” scores, though it’s crucial to remember that, much like humans, LLMs can be confident and still be wrong. The values themselves are not inherently meaningful without additional context or validation.

Example: Using a Chat Interface

An example of exploring these confidence scores can be seen in a chat interface where:

  • Confidence for each token in the LLM’s response is represented by a red underline—brighter red indicating lower confidence.
  • Hovering over a word reveals the top 10 possible tokens for that position, ranked by confidence score.

In one case, when asked to “pick a number,” the LLM chose the word “choose” despite it having only a 21% chance of being selected. This demonstrates that LLMs don’t always pick the most likely token unless configured to do so.

Additionally, this interface shows how the model might struggle with questions that have no clear answer, offering insights into detecting possible hallucinations. For example, when asked to list famous people with an interpunct in their name, the model shows low confidence in its guesses. This behavior indicates uncertainty and can be an indicator of a forthcoming incorrect response.

Hallucinations and Confidence Scores

The discussion also touches on the question of whether low confidence scores can help detect hallucinations—cases where the model generates false information. While low confidence often correlates with potential hallucinations, it’s not a foolproof indicator. Some hallucinations may come with high confidence, while low-confidence tokens might simply reflect natural variability in language.

For instance, when asked about the capital of Kazakhstan, the model shows uncertainty due to the historical changes between Astana and Nur-Sultan. The confidence scores reflect this inconsistency, highlighting how the model can still select an answer despite having conflicting information.

Using Confidence Scores in Code

The next part of the discussion covers how to leverage confidence scores programmatically. For simple yes/no questions, it’s possible to compress the response into a single token and calculate the confidence score using OpenAI’s API. Key API settings include:

  • temperature=0 ensures the model selects the token with the highest confidence.
  • max_tokens=1 limits the response to a single token.
  • logprobs=True requests the log probabilities for each token.

Using this setup, one can extract the model’s confidence in its response, converting log probabilities back into regular probabilities using math.exp.

Expanding to Real-World Applications

The post extends this concept to more complex scenarios, such as verifying whether an image of a driver’s license is valid. By analyzing the model’s confidence in its answer, developers can determine when to flag responses for human review based on predefined confidence thresholds.

This technique can also be applied to multiple-choice questions, allowing developers to extract not only the top token but also the top 10 options, along with their confidence scores.

Conclusion

While confidence scores from LLMs aren’t a perfect solution for detecting accuracy or truthfulness, they can provide useful insights in certain scenarios. With careful application and evaluation, developers can make informed decisions about when to trust the model’s responses and when to intervene.

The final takeaway is that confidence scores, while not foolproof, can play a role in improving the reliability of LLM outputs—especially when combined with thoughtful design and ongoing calibration.

Related Posts
Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

The Salesforce Story
The Salesforce Story

In Marc Benioff's own words How did salesforce.com grow from a start up in a rented apartment into the world's Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more

Health Cloud Brings Healthcare Transformation
Health Cloud Brings Healthcare Transformation

Following swiftly after last week's successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more