Skip to main content

The Core Insight: Inductive Quality Definition

Sageloop uses a different approach than traditional evaluation tools.

Traditional Approach (Deductive)

  1. Define quality criteria upfront (“responses must be 100-200 words”)
  2. Test AI outputs against criteria
  3. Pass/fail based on rules
Problem: PMs can’t articulate quality upfront. You know good when you see it, but describing it is hard.

Sageloop Approach (Inductive)

  1. Generate examples of AI behavior
  2. Rate examples based on your intuition
  3. Discover patterns in your ratings
  4. Improve your prompt based on discovered patterns
Why it works better:
  • You don’t need to know criteria upfront
  • Patterns you see ≠ patterns you can describe
  • Real outputs reveal edge cases you’d never think of
  • Faster: Rate 30 outputs in 5 minutes vs. 2 hours debating criteria

Real Example

A support bot PM at a startup tested 20 refund scenarios.
PM spends 2 hours defining criteria: “Responses should be friendly, include timeline, apologize”Launches bot. Users complain: “Why didn’t it mention the return shipping fee?”Missed edge case. Rework prompt. Relaunch.
PM rates 20 real refund questions in 5 minutesPattern extraction finds: “All 5-star outputs mention return shipping fee”PM adds to prompt: “Always mention return shipping details”Retests. Quality jumps to 95%.
Winner: Sageloop (faster, catches edge cases, clearer patterns)

The Evaluation Loop

Scenarios → Outputs → Ratings → Insights → Fixes → Retest → Iterate

1. Scenarios (Inputs)

Test inputs representing real user requests. Good scenario:
  • Specific and realistic
  • Represents actual use case
  • Clear user intent
Where is my refund?

2. Outputs (AI Responses)

AI-generated responses using your system prompt. Sageloop’s Advantage: Generates all outputs at once so you can compare them visually. Visual Pattern Recognition:
  • Response 1: “Refunds take 5-7 business days”
  • Response 2: “Refunds arrive shortly”
  • Response 3: “Timeline: 5-7 business days”
  • Response 4: “Refunds processed soon”
You see it immediately: Half mention specific timeline, half say “soon”

3. Ratings (Your Judgment)

Your 1-5 star ratings teach Sageloop what quality means to you. Rating Principle: Trust your gut.
  • 5 stars: Perfect
  • 4 stars: Good
  • 3 stars: Okay
  • 2 stars: Problem
  • 1 star: Unacceptable
For 1-2 star outputs, add feedback explaining WHY.

4. Insights (Pattern Extraction)

Sageloop’s AI clusters your low-rated outputs by root cause.

Not Generic

“Outputs should be better”

Concrete

“4 outputs missing specific refund timeline (5-7 days)”
Example insight:
Cluster: Vague Timelines (4 outputs, high confidence)Root Cause: System prompt doesn’t specify exact timelineFix: Add “Always say exactly ‘5-7 business days‘“

5. Fixes (Iteration)

Apply suggested fixes and regenerate only failed scenarios. Example:
  • Version 1: “You are a helpful support agent”
  • Version 2: “You are a helpful support agent. Always mention ‘5-7 business days’ for refund timeline.”

6. Retest (Validation)

Rate new outputs. Did quality improve? Example:
  • Before: 4 failures → 73% success rate
  • After fix: 1 failure → 91% success rate
  • Result: “3/4 outputs now pass ✅“

Key Terminology

A workspace for evaluating a specific AI behavior (e.g., “Customer Support Bot v1”)
A single user input you want to test (e.g., “Where is my refund?”)
The AI-generated response to a scenario
Your 1-5 star quality judgment + optional feedback
The AI-powered analysis finding patterns in your ratings
A group of low-rated outputs with the same root cause
A 5-star rated output representing ideal AI behavior
% of outputs rated 4-5 stars (your quality benchmark)

Why Batch Evaluation Matters

Single Output (ChatGPT Style)

  • ❌ Test one output at a time
  • ❌ Hard to remember previous tests
  • ❌ Patterns hidden

Batch Evaluation (Sageloop)

  • ✅ See 20-30 outputs at once
  • ✅ Compare visually
  • ✅ Patterns obvious

The Difference

Testing date parsing:
  • Test 1: “2024-01-15” ✅
  • Test 2: “2024-01-22” ✅
  • Test 3: (wait, what did test 1 do?)
  • Test 10: “2022” ✅ (Wait, this is wrong!)
  • Miss the pattern: Some outputs use 2022 instead of 2024

When to Use Sageloop

Perfect For

Discovery

Figuring out what “good” AI behavior means

Prompt Iteration

Testing different prompt versions

Quality Spec Creation

Defining standards for engineering

Edge Case Discovery

Finding failure modes

Not Ideal For

  • Production monitoring (use PromptLayer, LangSmith)
  • Automated testing in CI (export to pytest instead)
  • Fine-tuning models (use training platforms)
  • A/B testing in production (use experimentation platforms)

Perfect Workflow

Sageloop (discovery) → Export golden examples → CI/CD testing (production)

The Methodology in Numbers

How Many Scenarios?

Sweet Spot: 15-30 scenarios
  • Too few (fewer than 10): Patterns won’t be statistically meaningful
  • Too many (more than 50): Rating takes forever; diminishing returns
  • 15-30: Fast to rate, enough for reliable patterns

Success Rate Target

Start with whatever you have. Sageloop will help you improve. Typical Journey:
  1. Iteration 1: 65% success rate
  2. Iteration 2: 80% success rate
  3. Iteration 3: 90% success rate
  4. Iteration 4: 95%+ success rate

Expected Timeline

ActivityTimeNotes
Create project2 minOne-time
Add 20 scenarios3 minBulk paste
Generate outputs2 minAI does the work
Rate outputs5 minKeyboard shortcuts
Extract patterns2 minAI does the work
Total to first insights14 minVery fast
Apply fix + retest5 minOptional iteration

Comparison Table

AspectTraditionalSageloop
Define criteriaUpfront (hard)From examples (natural)
Test methodOne at a timeBatch visual
Pattern discoveryManualAI-powered
IterationGuess and checkTargeted fixes
PM-friendlyNoYes
Code requiredMaybeNo
Time to valueWeeks15 minutes

Core Principles

1

PMs Know Quality When They See It

Even if they can’t articulate why.
2

Patterns Emerge from Examples

More reliable than pre-defined criteria.
3

Batch > Individual

Comparing 20 outputs reveals patterns; testing one hides them.
4

Iteration Works

Small prompt changes → test → validate → repeat.
5

Actionability Matters

Generic fixes (“be better”) don’t work. Concrete fixes do (“always say 5-7 days”).

Next Steps