Start Lesson
If you only measure three things about your AI system, measure these. Factuality, relevance, and faithfulness are the metrics that directly determine whether users trust your system enough to keep using it. In this lesson, I will define each one precisely, show you how to compute them, and explain the trade-offs you will encounter in practice.
Every AI failure I have debugged in production falls into one of three categories:
These are distinct failure modes with distinct causes and distinct fixes. Collapsing them into a single "quality score" hides the signal you need to improve.
Definition: Does the system's output contain statements that are verifiably true?
Factuality measures whether the response aligns with ground truth or real-world knowledge. This is the hallucination metric. When a system invents a statistic, cites a paper that does not exist, or states a date incorrectly, factuality catches it.
Approach 1: Claim Decomposition + Verification
Break the response into individual claims, then verify each one.
async def measure_factuality(response: str, knowledge_base: list[str]) -> float:
# Step 1: Decompose response into atomic claims
claims = await extract_claims(response)
# Example: ["The refund window is 30 days",
# "Refunds are processed within 5 business days"]
# Step 2: Verify each claim against the knowledge base
verified = 0
for claim in claims:
is_supported = await verify_claim(claim, knowledge_base)
if is_supported:
verified += 1
# Step 3: Factuality = verified claims / total claims
return verified / len(claims) if claims else 0.0
Approach 2: Reference-Based Scoring
When you have a known-correct answer, compare directly.
from deepeval.metrics import HallucinationMetric
metric = HallucinationMetric(threshold=0.8)
# Compares actual_output against the provided context
# Returns a score where lower hallucination = higher factuality
Approach 3: Calibration-Aware Factuality
The latest research from 2025 reframes factuality as a calibration problem. A well-calibrated system should express high confidence when it is correct and low confidence when it is uncertain. Benchmarks like SimpleQA measure this by grading responses as correct, incorrect, or "not attempted," explicitly rewarding systems that abstain when uncertain rather than confabulating.
When I use factuality: Any system that surfaces facts to users -- customer-facing RAG, knowledge bases, report generators, data analysis tools.
Definition: Does the system's output actually address the user's question?
Relevance measures alignment between the query and the response. A system can be perfectly factual but completely irrelevant. If a user asks about pricing and gets an accurate history of the company, factuality is 1.0 and relevance is 0.0.
Approach 1: Answer Relevancy (RAGAS)
RAGAS measures relevance by generating synthetic questions from the answer, then checking if those questions match the original query.
from ragas.metrics import answer_relevancy
from ragas import evaluate
from datasets import Dataset
dataset = Dataset.from_dict({
"question": [
"What programming languages does the API support?"
],
"answer": [
"The API supports Python, TypeScript, and Go. "
"SDKs are available on our GitHub."
],
"contexts": [[
"Our API provides official SDKs for Python 3.8+, "
"TypeScript 4.5+, and Go 1.19+."
]],
})
result = evaluate(dataset, metrics=[answer_relevancy])
print(result["answer_relevancy"])
# 0.94 -- high relevance, the answer addresses the question directly
The intuition: if I can reconstruct the original question from the answer, the answer is relevant. If I cannot, the answer wandered off.
Approach 2: LLM-as-Judge with a Relevance Rubric
RELEVANCE_RUBRIC = """
Score 1 (PASS): The response directly addresses the user's question.
All key aspects of the question are covered. No major tangents.
Score 0 (FAIL): The response misses the user's question, addresses
a different topic, or contains excessive irrelevant information.
"""
async def judge_relevance(query: str, response: str) -> dict:
prompt = f"""Evaluate whether the response is relevant to the query.
Query: {query}
Response: {response}
{RELEVANCE_RUBRIC}
Return JSON: {{"score": 0 or 1, "reasoning": "..."}}"""
result = await judge_llm.generate(prompt)
return json.loads(result)
I prefer binary scoring for relevance. In my experience, a response either addresses the question or it does not. Graded scales introduce inconsistency without adding useful signal.
When I use relevance: Every system. There is no scenario where answering the wrong question is acceptable.
Definition: Is the system's output grounded in the evidence it was given?
Faithfulness is the metric that matters most for RAG systems. It asks: did the system use the retrieved documents to generate its response, or did it ignore them and rely on its parametric knowledge?
This is different from factuality. A response can be factually correct (matches reality) but unfaithful (the model "knew" the answer from training data and ignored the retrieved context). This matters because retrieval context is your control surface. If the model ignores it, you cannot steer the system.
Approach 1: RAGAS Faithfulness
RAGAS decomposes the response into statements, then checks whether each statement can be inferred from the retrieved context.
from ragas.metrics import faithfulness
from ragas import evaluate
from datasets import Dataset
dataset = Dataset.from_dict({
"question": ["When was the company founded?"],
"answer": [
"The company was founded in 2019 by Jane Smith. "
"It has since grown to 500 employees."
],
"contexts": [[
"Acme Corp was founded in 2019 by Jane Smith.",
"The company is headquartered in Austin, Texas."
]],
})
result = evaluate(dataset, metrics=[faithfulness])
print(result["faithfulness"])
# 0.5 -- only 1 of 2 claims is supported by context
# "founded in 2019 by Jane Smith" = supported
# "grown to 500 employees" = NOT in the context (unfaithful)
This example illustrates a subtle and common failure. The "500 employees" claim might be factually true (the model may know this from training data), but it is unfaithful because the retrieved documents do not support it. In a RAG system, this is a problem. If the employee count changes and you update your documents, the model might still output stale training-data knowledge.
Approach 2: DeepEval Faithfulness
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="When was the company founded?",
actual_output="Founded in 2019 by Jane Smith. Now 500 employees.",
retrieval_context=[
"Acme Corp was founded in 2019 by Jane Smith.",
"The company is headquartered in Austin, Texas."
]
)
metric = FaithfulnessMetric(threshold=0.8)
metric.measure(test_case)
print(metric.score) # 0.5
print(metric.reason) # "1 of 2 claims unsupported by context"
When I use faithfulness: Any RAG system. Any system where you provide context and expect the model to use it.
These metrics are independent axes, not a hierarchy.
Factual
|
|
Faithful --------+-------- Unfaithful
|
|
Not Factual
A response can be:
Each combination points to a different root cause and a different fix. This is why measuring all three independently is essential.
Based on what I have seen work in production:
| Metric | Minimum Viable | Production Target | Notes | |---|---|---|---| | Factuality | 0.80 | 0.95+ | Below 0.80, users notice errors | | Relevance | 0.85 | 0.95+ | Below 0.85, users feel ignored | | Faithfulness | 0.75 | 0.90+ | Below 0.75, RAG is not adding value |
These are starting points. Calibrate to your domain. A medical Q&A system needs 0.99 factuality. A creative writing assistant can tolerate 0.70.
Take 10 test cases from the golden dataset you built in Lesson 2. Run each one through your system. Score every response on all three metrics. Record the results in a table like this:
import json
from dataclasses import dataclass
@dataclass
class MetricResult:
case_id: str
input: str
factuality: float
relevance: float
faithfulness: float
notes: str
results: list[MetricResult] = []
# After scoring all 10 cases:
for r in results:
print(
f"{r.case_id}: F={r.factuality:.2f} "
f"R={r.relevance:.2f} Fa={r.faithfulness:.2f} "
f"| {r.notes}"
)
# Compute your baselines
avg_factuality = sum(r.factuality for r in results) / len(results)
avg_relevance = sum(r.relevance for r in results) / len(results)
avg_faithfulness = sum(r.faithfulness for r in results) / len(results)
print(f"\nBaselines: F={avg_factuality:.2f} "
f"R={avg_relevance:.2f} Fa={avg_faithfulness:.2f}")
# Compare against thresholds
thresholds = {"factuality": 0.80, "relevance": 0.85, "faithfulness": 0.75}
for name, baseline in [
("factuality", avg_factuality),
("relevance", avg_relevance),
("faithfulness", avg_faithfulness),
]:
status = "PASS" if baseline >= thresholds[name] else "FAIL"
print(f" {name}: {baseline:.2f} vs {thresholds[name]} -> {status}")
Write down these baselines. They are the numbers your regression tests will protect in the next lesson.
You have metrics and baselines. But right now you are running them manually. Next, we automate these evaluations into your CI/CD pipeline so they run on every change, catching regressions before they reach users.