Understand how Sageloop’s AI learns from your ratings to find patterns and suggest improvements.
Pattern extraction is Sageloop’s AI-powered analysis that discovers quality criteria from your ratings.
How It Works:
- You rate 15-30 outputs (1-5 stars + feedback)
- You click “Run Pattern Extraction”
- Sageloop’s AI analyzes ratings and clusters failures by root cause
- You get actionable insights and suggested prompt fixes
Why It’s Powerful: You don’t need to articulate quality criteria upfront. The AI learns from your examples.
Minimum: 15 rated outputs
Ideal: 20-30 rated outputs
Sageloop Warning: If you have fewer than 15 outputs, extraction will show low confidence
Sample Size Matters
Too few ratings = noise, not signal.
| Ratings | Confidence | Action |
|---|
| Fewer than 10 | Very Low | Wait and rate more |
| 10-14 | Low | Rate more for reliability |
| 15-29 | Medium-High | Good for extraction |
| 30+ | High | Very reliable patterns |
Step 1: You Rate Outputs
Sageloop Sees:
- 12 outputs rated 5 stars
- 4 outputs rated 4 stars
- 2 outputs rated 3 stars
- 4 outputs rated 1-2 stars
- Your feedback: “Missing timeline”, “Too formal”, “No apology”
Step 2: AI Analysis
Sageloop sends all low-rated outputs (≤2 stars) + your feedback to an LLM with this prompt:
“A PM rated these AI outputs poorly. Find common root causes and suggest specific fixes to the system prompt.”
The AI:
- Groups failures by root cause (not just symptoms)
- Identifies concrete patterns
- Suggests copy-pasteable prompt improvements
Step 3: Insights & Recommendations
You get:
- Failure clusters - Groups of low-rated outputs with same root cause
- Quality patterns - Characteristics of 5-star outputs
- Success metrics - Your current success rate and distribution
- Suggested fixes - Copy-pasteable prompt improvements
Understanding Your Insights
Failure Clusters
Example:
Cluster 1: Missing Specific Timeline (4 outputs, High Confidence)
Pattern: All low-rated outputs use vague language like “soon” or “shortly” instead of specific timeframe
Root Cause: System prompt doesn’t specify the 5-7 business day refund timeline
Suggested Fix:
Add to your system prompt:
"Always mention that refunds take 5-7 business days to process."
Affected Scenarios:
- Scenario #2: “Where is my refund?”
- Scenario #5: “How long does a refund take?”
- Scenario #9: “When will I get my money back?”
- Scenario #14: “I’m waiting for my refund”
Severity: High (4 out of 22 outputs failed)
✅ Concrete: Not “be more specific”—actual words to add
✅ Root Cause: Identifies WHY failures happened
✅ Actionable: Copy-paste fix available
✅ Evidence: Shows which scenarios failed
✅ Severity: Tells you impact (4/22 = 18%)
Quality Patterns (5-Star Outputs)
Example:
5-Star Outputs: Common Characteristics (12 outputs)
Length: 150-250 words
Structure:
-
Apology → 2. Explanation → 3. Timeline → 4. Next steps
Tone: Empathetic but professional (“I sincerely apologize…” not “Sorry.”)
Key Elements Always Present:
-
Starts with an apology or empathetic statement
-
Mentions “5-7 business days” for timeline
-
Includes link or direction to account dashboard
-
Ends with “Let me know if you need anything else!”
Example:
“I sincerely apologize for the delay with your refund. Refunds take 5-7 business days to process from the date of your return. You can track the status in your account dashboard here. Let me know if you have any questions!”
Success Metrics
Current Success Rate: 73%
Definition: Outputs rated 4-5 stars
Rating Distribution:
- ⭐⭐⭐⭐⭐ (5 stars): 12 outputs (55%)
- ⭐⭐⭐⭐ (4 stars): 4 outputs (18%)
- ⭐⭐⭐ (3 stars): 2 outputs (9%)
- ⭐⭐ (2 stars): 3 outputs (14%)
- ⭐ (1 star): 1 output (5%)
Goal: Improve 1-2 star outputs to 4-5 stars
How to Use Insights
Step 1: Review Failure Clusters
Focus on High Severity First: Fix clusters affecting the most outputs (4+ failures).
Ask Yourself:
- Does the root cause make sense?
- Is the suggested fix accurate?
- Are there other failures I noticed?
If a fix seems wrong, don’t apply it. The AI can make mistakes. Review the extraction rationale and trust your judgment.
Step 2: Apply Suggested Fixes
Option A: Manual Copy-Paste
- Click “Copy Fix”
- Go to Project Settings
- Paste into system prompt
- Save
Option B: One-Click Apply & Retest (Recommended)
- Click “Apply Fix & Retest”
- Review before/after prompt diff
- Edit if needed
- Click “Update & Retest”
- Only failed scenarios regenerate (saves time)
Step 3: Retest & Validate
After applying fix:
- New outputs generated for failed scenarios
- Rate the new outputs
- Check if success rate improved
Example:
- Before: 4 failures → 73% success rate
- After: 1 failure → 91% success rate
- Result: “3/4 outputs now pass ✅”
If Still Failing: You’ve found a deeper issue. The prompt needs more fundamental changes.
Step 4: Iterate
Typical Flow:
- Extract patterns (find 4 failures)
- Apply fix (improve 3, 1 still fails)
- Extract again (deeper root cause emerges)
- Apply second fix (last failure resolved)
- Success rate: 95%+
Expect 2-4 Iterations: Complex behavior takes multiple rounds.
Confidence Scores
Sageloop shows confidence scores for each pattern.
High Confidence (90%+)
Pattern appears in 12+ out of 15 samples.
Example: “100% of 5-star outputs include an apology”
Action: Trust this pattern. It’s statistically robust.
Medium Confidence (70-89%)
Pattern appears in 7-11 out of 15 samples.
Example: “8 out of 12 (67%) of 5-star outputs use bullet points”
Action: Likely meaningful, but consider adding more scenarios to confirm.
Low Confidence (below 70%)
Pattern appears in fewer than 7 out of 15 samples.
Example: “2 out of 15 outputs mention a phone number”
Action: May be noise, not signal. Add more scenarios or ignore.
Why Confidence Matters: Low sample sizes create false patterns. You need 15-30 rated outputs for reliable insights. If confidence is low, Sageloop will show: “⚠️ Add 5-10 more rated scenarios for more reliable patterns.”
Result 1: Clear Failure Cluster
Scenario: 6 outputs fail for the same reason.
Extraction:
Cluster: Missing Empathy (6 outputs)
Fix: Add “Always start responses with an apology or empathetic acknowledgment.”
Action: Apply fix confidently—pattern is strong.
Result 2: Multiple Small Clusters
Scenario: 2 failures for reason A, 2 for reason B, 1 for reason C.
Extraction:
Cluster 1: Missing timeline (2 outputs)
Cluster 2: Too formal (2 outputs)
Cluster 3: Incorrect policy (1 output)
Action: Fix Cluster 1 and 2 (higher frequency). Cluster 3 might be edge case.
Result 3: No Clear Clusters
Scenario: Every failure is unique.
Extraction:
“No strong failure patterns detected. Failures appear unrelated.”
Action:
- Review individual failures manually
- Your ratings may be inconsistent (re-check)
- Prompt may need fundamental rework, not incremental fix
Result 4: High Variance in 5-Star Outputs
Scenario: 5-star outputs are very different from each other.
Extraction:
Pattern: 5-star outputs range from 50-400 words, use different tones and structures.
Insight: You may value flexibility—quality isn’t one-size-fits-all.
Action: This is fine! Not all use cases need rigid criteria.
Problem: Patterns Don’t Match My Intuition
Example: Extraction says “5-star outputs are 200-300 words” but you rated some 100-word outputs as 5-star.
Possible Causes:
- Inconsistent ratings: You rated similar outputs differently
- Small sample size: Not enough data for robust pattern
- Edge case: 100-word output was exception (short but perfect)
Fix:
- Review ratings for consistency
- Add more scenarios
- Manually adjust extracted criteria if needed
Problem: No Meaningful Patterns Found
Example: “No strong patterns detected.”
Possible Causes:
- Too few ratings: Need 15-30 minimum
- All high ratings: If everything is 4-5 stars, no failures to cluster
- Random failures: Each failure is unique
Fix:
- Add more scenarios and rate them
- If quality is truly high, extraction may not be needed
- Review individual failures manually
Problem: Suggested Fix Seems Wrong
Example: Fix says “Use bullet points” but you don’t want bullet points.
Possible Causes:
- Coincidence: Some 5-star outputs had bullets, but it’s not causal
- AI hallucination: Extraction made incorrect inference
Fix:
- Don’t blindly apply suggested fixes
- Review extraction rationale
- Manually edit fix before applying
Problem: Low Confidence Warnings
Example: “⚠️ Confidence: 60% (low sample size)”
Cause: Only 10 outputs rated.
Fix: Add 5-10 more scenarios and rate them. Re-run extraction.
Best Practices
✅ Rate 15-30 outputs before extraction (minimum sample size)
✅ Provide feedback on 1-2 star ratings (helps clustering)
✅ Review suggested fixes before applying (don’t trust blindly)
✅ Expect 2-4 iterations to reach high quality
✅ Check confidence scores (ignore low-confidence patterns)
❌ Don’t extract with fewer than 10 ratings (too small sample)
❌ Don’t apply fixes without reviewing (AI can hallucinate)
❌ Don’t expect perfection in one pass (iteration is normal)
❌ Don’t trust low-confidence patterns (need more data)
Every extraction is saved with a version number.
Why: Track how your understanding of quality evolves.
Example:
- Version 1 (Jan 10): “Outputs should be concise (100-150 words)”
- Version 2 (Jan 15): “Outputs should be detailed (200-300 words)”
(You changed your mind after user feedback)
View History: Go to Insights → History to see all past extractions.
Compare Versions: See how criteria changed over time.
Next Steps