LLM Evaluation & Scoring
What is LLM Evaluation?
Evaluation is a critical aspect of developing and deploying LLM applications. Usually, teams use a multitude of different evaluation methods to score the performance of their AI application depending on the use case and the stage of the development process.
Langfuse provides a flexible scoring system to capture all your evaluations in one place and make them actionable.
Why are LLM Evals Important?
LLM evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing the user experience and trust in your AI application. It helps detect hallucinations and measure performance across diverse tasks. A structured evaluation in production is vital for continuously improving your application.
What are common Evaluation Methods?
Common evaluation methods for LLMs involve assessing both quantitative metrics and qualitative aspects to measure performance, coherence, and relevance. These evaluations help pinpoint the LLM Application’s strengths and areas for improvement, ensuring it produces accurate and contextually appropriate outputs.
Langfuse supports all forms of evaluation methods due to its open architecture and API (learn more about the Langfuse score
object here). Depending on your needs in the development process, you can use one or multiple of the following evaluation methods:
1. Model-based Evaluation (LLM-as-a-Judge)
Model-based evaluations (LLM-as-a-judge) are a powerful tool to automatically assess LLM applications integrated with Langfuse. With this approach, an LLMs scores a particular session, trace, or LLM call in Langfuse based on factors such as accuracy, toxicity, or hallucinations.
There are two ways to run model-based evaluations in Langfuse:
2. Manual Annotation (in UI)
With manual annotations, you can annotate a subset of traces and observations by hand. This allows you to collaborate with your team and add scores via the Langfuse UI. Annotations can be used to establish a baseline for your evaluation metrics and to compare them with automated evaluations.
3. User Feedback
Capturing user feedback in your AI application can be a valuable evaluation metric. You can add explicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output, human-in-the-loop) user feedback to your LLM traces in Langfuse.
4. Custom via SDKs/API
Langfuse gives you full flexibility to ingest custom scores via the Langfuse SDKs or API. The scoring workflow allows you to run custom quality checks (e.g. valid structured output format) on the output of your workflows at runtime, or to run custom external evaluation workflows.
Getting Started
Learn how to configure and utilize scores
in Langfuse to assess quality, accuracy, style, and security metrics in your LLM applications.