Factuality, Relevance, and Faithfulness Metrics | AI Evaluation & Reliability Engineering | Celestinosalim.com

Factuality, Relevance, and Faithfulness Metrics

If you only measure three things about your AI system, measure these. Factuality, relevance, and faithfulness are the metrics that directly determine whether users trust your system enough to keep using it. In this lesson, I will define each one precisely, show you how to compute them, and explain the trade-offs you will encounter in practice.

Why These Three?

Every AI failure I have debugged in production falls into one of three categories:

The system made something up. (Factuality failure)
The system answered the wrong question. (Relevance failure)
The system ignored the evidence it was given. (Faithfulness failure)

These are distinct failure modes with distinct causes and distinct fixes. Collapsing them into a single "quality score" hides the signal you need to improve.

Metric 1: Factuality

Definition: Does the system's output contain statements that are verifiably true?

Factuality measures whether the response aligns with ground truth or real-world knowledge. This is the hallucination metric. When a system invents a statistic, cites a paper that does not exist, or states a date incorrectly, factuality catches it.

How to Measure Factuality

Approach 1: Claim Decomposition + Verification

Break the response into individual claims, then verify each one.

async def measure_factuality(response: str, knowledge_base: list[str]) -> float:
    # Step 1: Decompose response into atomic claims
    claims = await extract_claims(response)
    # Example: ["The refund window is 30 days",
    #           "Refunds are processed within 5 business days"]

    # Step 2: Verify each claim against the knowledge base
    verified = 0
    for claim in claims:
        is_supported = await verify_claim(claim, knowledge_base)
        if is_supported:
            verified += 1

    # Step 3: Factuality = verified claims / total claims
    return verified / len(claims) if claims else 0.0

Approach 2: Reference-Based Scoring

When you have a known-correct answer, compare directly.

from deepeval.metrics import HallucinationMetric

metric = HallucinationMetric(threshold=0.8)
# Compares actual_output against the provided context
# Returns a score where lower hallucination = higher factuality

Approach 3: Calibration-Aware Factuality

The latest research from 2025 reframes factuality as a calibration problem. A well-calibrated system should express high confidence when it is correct and low confidence when it is uncertain. Benchmarks like SimpleQA measure this by grading responses as correct, incorrect, or "not attempted," explicitly rewarding systems that abstain when uncertain rather than confabulating.

When I use factuality: Any system that surfaces facts to users -- customer-facing RAG, knowledge bases, report generators, data analysis tools.

Metric 2: Relevance

Definition: Does the system's output actually address the user's question?

Relevance measures alignment between the query and the response. A system can be perfectly factual but completely irrelevant. If a user asks about pricing and gets an accurate history of the company, factuality is 1.0 and relevance is 0.0.

How to Measure Relevance

Approach 1: Answer Relevancy (RAGAS)

RAGAS measures relevance by generating synthetic questions from the answer, then checking if those questions match the original query.

from ragas.metrics import answer_relevancy
from ragas import evaluate
from datasets import Dataset

dataset = Dataset.from_dict({
    "question": [
        "What programming languages does the API support?"
    ],
    "answer": [
        "The API supports Python, TypeScript, and Go. "
        "SDKs are available on our GitHub."
    ],
    "contexts": [[
        "Our API provides official SDKs for Python 3.8+, "
        "TypeScript 4.5+, and Go 1.19+."
    ]],
})

result = evaluate(dataset, metrics=[answer_relevancy])
print(result["answer_relevancy"])
# 0.94 -- high relevance, the answer addresses the question directly

The intuition: if I can reconstruct the original question from the answer, the answer is relevant. If I cannot, the answer wandered off.

Approach 2: LLM-as-Judge with a Relevance Rubric

RELEVANCE_RUBRIC = """
Score 1 (PASS): The response directly addresses the user's question.
All key aspects of the question are covered. No major tangents.

Score 0 (FAIL): The response misses the user's question, addresses
a different topic, or contains excessive irrelevant information.
"""

async def judge_relevance(query: str, response: str) -> dict:
    prompt = f"""Evaluate whether the response is relevant to the query.

Query: {query}
Response: {response}

{RELEVANCE_RUBRIC}

Return JSON: {{"score": 0 or 1, "reasoning": "..."}}"""

    result = await judge_llm.generate(prompt)
    return json.loads(result)

I prefer binary scoring for relevance. In my experience, a response either addresses the question or it does not. Graded scales introduce inconsistency without adding useful signal.

When I use relevance: Every system. There is no scenario where answering the wrong question is acceptable.

Metric 3: Faithfulness

Definition: Is the system's output grounded in the evidence it was given?

Faithfulness is the metric that matters most for RAG systems. It asks: did the system use the retrieved documents to generate its response, or did it ignore them and rely on its parametric knowledge?

This is different from factuality. A response can be factually correct (matches reality) but unfaithful (the model "knew" the answer from training data and ignored the retrieved context). This matters because retrieval context is your control surface. If the model ignores it, you cannot steer the system.

How to Measure Faithfulness

Approach 1: RAGAS Faithfulness

RAGAS decomposes the response into statements, then checks whether each statement can be inferred from the retrieved context.

from ragas.metrics import faithfulness
from ragas import evaluate
from datasets import Dataset

dataset = Dataset.from_dict({
    "question": ["When was the company founded?"],
    "answer": [
        "The company was founded in 2019 by Jane Smith. "
        "It has since grown to 500 employees."
    ],
    "contexts": [[
        "Acme Corp was founded in 2019 by Jane Smith.",
        "The company is headquartered in Austin, Texas."
    ]],
})

result = evaluate(dataset, metrics=[faithfulness])
print(result["faithfulness"])
# 0.5 -- only 1 of 2 claims is supported by context
# "founded in 2019 by Jane Smith" = supported
# "grown to 500 employees" = NOT in the context (unfaithful)

This example illustrates a subtle and common failure. The "500 employees" claim might be factually true (the model may know this from training data), but it is unfaithful because the retrieved documents do not support it. In a RAG system, this is a problem. If the employee count changes and you update your documents, the model might still output stale training-data knowledge.

Approach 2: DeepEval Faithfulness

from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="When was the company founded?",
    actual_output="Founded in 2019 by Jane Smith. Now 500 employees.",
    retrieval_context=[
        "Acme Corp was founded in 2019 by Jane Smith.",
        "The company is headquartered in Austin, Texas."
    ]
)

metric = FaithfulnessMetric(threshold=0.8)
metric.measure(test_case)
print(metric.score)    # 0.5
print(metric.reason)   # "1 of 2 claims unsupported by context"

When I use faithfulness: Any RAG system. Any system where you provide context and expect the model to use it.

The Relationship Between the Three

These metrics are independent axes, not a hierarchy.

                    Factual
                      |
                      |
     Faithful --------+-------- Unfaithful
                      |
                      |
                   Not Factual

A response can be:

Factual + Faithful + Relevant: The ideal outcome.
Factual + Unfaithful + Relevant: Correct answer, but ignored the context. Dangerous because you lose control.
Not Factual + Faithful + Relevant: The context itself was wrong, and the model faithfully reproduced the error. Fix the data.
Factual + Faithful + Not Relevant: Grounded and true, but answered the wrong question.

Each combination points to a different root cause and a different fix. This is why measuring all three independently is essential.

Practical Scoring Thresholds

Based on what I have seen work in production:

| Metric | Minimum Viable | Production Target | Notes | |---|---|---|---| | Factuality | 0.80 | 0.95+ | Below 0.80, users notice errors | | Relevance | 0.85 | 0.95+ | Below 0.85, users feel ignored | | Faithfulness | 0.75 | 0.90+ | Below 0.75, RAG is not adding value |

These are starting points. Calibrate to your domain. A medical Q&A system needs 0.99 factuality. A creative writing assistant can tolerate 0.70.

Build This: Score 10 Responses on All Three Metrics

Take 10 test cases from the golden dataset you built in Lesson 2. Run each one through your system. Score every response on all three metrics. Record the results in a table like this:

import json
from dataclasses import dataclass

@dataclass
class MetricResult:
    case_id: str
    input: str
    factuality: float
    relevance: float
    faithfulness: float
    notes: str

results: list[MetricResult] = []

# After scoring all 10 cases:
for r in results:
    print(
        f"{r.case_id}: F={r.factuality:.2f} "
        f"R={r.relevance:.2f} Fa={r.faithfulness:.2f} "
        f"| {r.notes}"
    )

# Compute your baselines
avg_factuality = sum(r.factuality for r in results) / len(results)
avg_relevance = sum(r.relevance for r in results) / len(results)
avg_faithfulness = sum(r.faithfulness for r in results) / len(results)

print(f"\nBaselines: F={avg_factuality:.2f} "
      f"R={avg_relevance:.2f} Fa={avg_faithfulness:.2f}")

# Compare against thresholds
thresholds = {"factuality": 0.80, "relevance": 0.85, "faithfulness": 0.75}
for name, baseline in [
    ("factuality", avg_factuality),
    ("relevance", avg_relevance),
    ("faithfulness", avg_faithfulness),
]:
    status = "PASS" if baseline >= thresholds[name] else "FAIL"
    print(f"  {name}: {baseline:.2f} vs {thresholds[name]} -> {status}")

Write down these baselines. They are the numbers your regression tests will protect in the next lesson.

Key Takeaways

Factuality, relevance, and faithfulness are the three independent axes of AI output quality.
Factuality catches hallucinations. Measure with claim decomposition or reference comparison.
Relevance catches off-topic responses. Measure with answer-relevancy or binary LLM-judge rubrics.
Faithfulness catches context-ignoring behavior. Critical for RAG systems. Measure with RAGAS or DeepEval.
Measure all three independently. Each failure mode has a different root cause and a different fix.
Set thresholds based on your domain and enforce them in CI.

What's Next

You have metrics and baselines. But right now you are running them manually. Next, we automate these evaluations into your CI/CD pipeline so they run on every change, catching regressions before they reach users.