Workflow 1: Evaluating a New Feature
Time: 30 minutes | Goal: Define quality for a new AI featureSteps
-
Create Project (2 min)
- Name: “Feature Name - Evaluation”
- Write initial system prompt based on requirements
-
Add Scenarios (5 min)
- 15-20 real examples from product spec
- Include edge cases and boundary conditions
-
Generate & Rate (15 min)
- Generate outputs
- Rate all outputs (5 min)
- Add feedback on low ratings
-
Extract Patterns (2 min)
- Run extraction
- Review quality patterns
-
Document Findings (5 min)
- Export golden examples
- Share with engineering team
Output
Clear quality definition ready for implementation.Workflow 2: Iterating on Existing Prompt
Time: 20 minutes per iteration | Goal: Improve underperforming featureSteps
-
Rate Current Outputs (5 min)
- Open existing project
- Rate all outputs if not already done
-
Extract Patterns (2 min)
- Identify failure clusters
- Review root causes
-
Apply Suggested Fix (3 min)
- Click “Apply Fix & Retest”
- Review prompt changes
- Confirm update
-
Rate New Outputs (5 min)
- Rate regenerated scenarios
- Check if quality improved
-
Iterate or Ship (5 min)
- If
>90%success: Export and ship - If
<90%: Repeat steps 2-5
- If
Expected Result
Success rate improves 10-20% per iteration.Workflow 3: Collaborative Evaluation
Time: Ongoing | Goal: Team alignment on quality standardsSetup
- Create project (PM)
- Add scenarios (PM or team)
- Generate outputs (PM)
- Share project with team (PM)
- Each team member rates independently (all)
Why It Works
- Multiple perspectives on quality
- Discover subjective vs. objective failures
- Align team on standards
Deliverable
- Consensus on quality definition
- Points of disagreement documented
- Clear behavioral spec
Workflow 4: Comparing Two Models
Time: 30 minutes | Goal: Decide between GPT-4 vs ClaudeSteps
-
Create Two Projects
- Project A: GPT-4
- Project B: Claude
- Same system prompt and scenarios
-
Generate Outputs (3 min)
- Run Generation on both
-
Rate Independently (15 min)
- Rate Project A outputs
- Rate Project B outputs
-
Compare Results (5 min)
- Check success rates
- Read actual outputs
- Assess quality differences
-
Document Decision (7 min)
- Model choice
- Success rate difference
- Quality rationale
Quick Decision Matrix
| Metric | GPT-4 | Claude |
|---|---|---|
| Quality (%) | 92% | 88% |
| Speed | Slower | Faster |
Workflow 5: Export for CI/CD Integration
Time: 10 minutes | Goal: Get test suite into engineering workflowPrerequisites
- Achieved
>90%success rate - Have golden examples and failure patterns
Steps
- Go to Insights
- Click “Export”
- Choose “Test Suite (pytest)”
- Download JSON
- Share with engineering
Engineering Integration
Engineers can now run:Workflow 6: Testing Multiple Variations
Time: 40 minutes | Goal: A/B test different prompt versionsSteps
-
Create Base Project (5 min)
- Name: “Support Bot - Base”
- Initial system prompt
- Add 15 scenarios
-
Generate & Rate Base (15 min)
- Generate outputs
- Rate all outputs
- Success rate: 70%
-
Create Variation Project (5 min)
- Clone scenarios
- Modify system prompt (e.g., “be more casual”)
- Generate
-
Rate Variation (10 min)
- Rate new outputs
- Compare success rates
-
Choose Winner (5 min)
- Which variation performed better?
- Update production prompt
Example Variations
- Version A: Formal tone
- Version B: Casual tone
- Version C: Balanced
Quick Reference: Common Scenarios by Role
Product Manager
- Daily: Rate outputs as they’re generated
- Weekly: Run extraction to find patterns
- Monthly: Export insights to engineering
Design Lead
- Discovery: Define tone and personality
- Evaluation: Rate based on brand alignment
- Feedback: Add feedback explaining brand misalignment
Engineering Lead
- Review: Examine extracted patterns
- Implement: Build bot using specifications
- Test: Run exported test suite in CI/CD
Customer Support Lead
- Input: Provide real support questions
- Rating: Rate responses from customer perspective
- Feedback: Explain what customers expect
Tips for Smooth Workflows
Naming Convention: Use consistent project namingTroubleshooting Common Workflow Issues
Issue: Patterns Not Found
Cause: Fewer than 15 rated outputs or all high ratings Fix: Add more scenarios and rate them. Need minimum 15-30 ratings.Issue: Team Has Different Standards
Cause: Subjective quality definitions Fix:- Compare ratings
- Discuss disagreements
- Document final standard
- Re-rate together if needed
Issue: Iterations Not Improving Quality
Cause: Root cause not properly addressed Fix:- Review failure cluster reasoning
- Make larger prompt changes
- Add more specific instructions