Backed byY Combinator

Run

Evals

while

developing

CI/CD

pipeline

Production

Detect hallucinations and bad outputs, measure performance, and prevent regressions.

A library of 50+ evaluation metrics for your entire pipeline

All evaluators can be run programmatically using our SDK, or automatically using our SaaS platform.

LLM
Functions
Ragas
Conversation
Safety

Answer Similarity

Checks if the response is similar to the expected response (ground truth)

FailPass

Context Similarity

Checks if the context is similar to the response

FailPass

Groundedness

Checks if the response is grounded to the provided context

FailPass

Summarization Accuracy

Checks if the summary has any discrepancies from the source

FailPass

Response Faithfulness

Checks if the response is faithful to the provided context

FailPass

Context Sufficiency

Checks if the context contains enough information to answer the user's query

FailPass

Answer Completeness

Checks if the response answers the user's query completely.

FailPass

Custom Prompt

Evaluates the response using a custom prompt

FailPass

Grading Criteria

Checks the response according to your grading criteria

FailPass

Custom Prompt

Evaluates the response using a custom prompt

FailPass

Grading Criteria

Checks the response according to your grading criteria

FailPass

Ragas Context Relevancy

Measures the relevancy of the retrieved context, calculated based on both the query and contexts.

Ragas

0.2
01

Ragas Faithfulness

Measures the factual consistency of the generated answer against the given context.

Ragas

0.8
01

Ragas Answer Correctness

Checks the accuracy of the generated llm response when compared to the ground truth

Ragas

0.5
01

Ragas Answer Relevancy

Measures how pertinent the generated response is to the given prompt.

Ragas

0.95
01

Ragas Answer Semantic Similarity

Measures the semantic resemblance between the generated response and the expected response.

Ragas

0.66
01

Ragas Context Precision

Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher o...

Ragas

0.79
01

Ragas Context Precision

Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher o...

Ragas

0.79
01

Ragas Context Recall

Measures the extent to which the retrieved context aligns with the expected response.

Ragas

0.39
01

Ragas Coherence

Checks if the generated response presents ideas, information, or arguments in a logical and organize...

Ragas

0.21
01

Ragas Conciseness

Checks if the generated response conveys information or ideas clearly and efficiently, without unnec...

Ragas

0.13
01

Ragas Harmfulness

Checks the potential of the generated response to cause harm to individuals, groups, or society at l...

Ragas

0.91
01

Ragas Maliciousness

Checks the potential of the generated response to harm, deceive, or exploit users.

0.1
01

Conversation Resolution

Checks if every user message in the conversation was successfully resolved.

FailPass

Conversation Coherence

Checks if the conversation was coherent given the previous messages in the chat.

FailPass
WORKS WITH RAG
Evaluate your entire
RAG pipeline
Athina evaluators run on your entire RAG pipeline, not just on the responses.
User Query

Which spaceship was first to land on the moon

Retrieved Context

Neil Armstrong was the first astronaut to land on the moon in 1969

i
Insufficient context
The retrieved context doesn’t contain information about the name of the spaceship.
Prompt Sent

You are an expert...

Prompt Response

The first spaceship to land on the moon in 1969 carried astronauts Neil Armstrong and Buzz...

Custom Evaluators

Bring your own eval

Create your own evaluator on Athina in a matter of seconds. Use an LLM, or a custom function to create your own eval.

Prompt
Function

Describe the prompt you would like to use for the evaluator

Response must directly address the user's query

Response {response}

User Query {query}

ANALYTICS
Track model performance over time
Track historical record of your pipeline's performance and usage metrics over time.
Response Faithfulness
Answer Relevancy
Context Relevancy
Answer Completeness
Context Sufficiency
Groundedness
Answer Relevancy per day
0.65Avg per day
Frequently Asked Questions

Everything you need to know about Evals.

Can't find the answer you're looking for?

Feel free to contact us

Why do LLM evals work?

Do all evals require a labeled dataset?

Can I create custom evaluators?

Which models can I use as evaluators?

How can I use evals in my CI / CD pipeline?

About Us

Athina AI