Tenacious-Bench-v0.1: Evaluating Sales Agents for Policy Reliability (Not Just Fluency)
I spent the last week building a benchmark for sales agents.
Not to measure whether they complete tasks — but whether they can be trusted while doing them.
Because in production sales systems, the biggest failures aren’t broken workflows.
They’re confident, polished, policy-violating outputs that look correct — and aren’t.
This post is about what we built, why we built it, and what we learned from trying to train models against it.
Generic agent benchmarks are useful, but they are usually optimized to answer a different question than the one sales teams need answered in production.
The generic question is:
Can the agent complete multi-step tasks with tools?
The production sales question is:
Will the agent make safe, honest, high-signal claims under constraints that directly affect customer trust and revenue risk?
Tenacious-Bench v0.1 was built to answer the second question.
Why this matters
If you’re building LLM systems for production — especially in customer-facing roles — this distinction matters:
Fluency ≠ correctness
Correctness ≠ safety
Safety ≠ policy compliance
Most benchmarks optimize for the first.
Production systems fail on the last.
1) The Gap: Why Generic Benchmarks Under-Specify Sales Failures
Benchmarks such as tau²-Bench are strong for evaluating broad agent workflows, tool use, and task progression. They are not incorrect—they are measuring a wider target.
Failure types that matter in sales but are weakly represented in generic evals
Confidence calibration failures - Claiming high certainty from weak or partial signals.
Capacity commitment failures - Implying bench or delivery capacity that is not authorized.
Pricing-scope failures - Quoting contract values, pricing ranges, or discount language outside policy.
Tone under constraints - Maintaining non-condescending, professional tone within 60–120 words while still being specific
Example mismatch
A generic benchmark may reward:
Correct structure
Correct tool call
A “plausible” final answer
A sales-specific benchmark should penalize the same output if it states:
“We can support all your urgent hires this quarter” (unauthorized capacity claim)
“Your annual contract should be around $X” (policy violation on pricing specificity)
Promotional language that conflicts with current policy
This mismatch motivated a narrower benchmark:
2) Audit Method: From Agent Failures to Measurable Rubrics
We started with the agent probes and traces, then converted observed failure patterns into machine-checkable requirements.
trace_respond_874662476a68: policy boundary handling issuestrace_advance_2ef64021c4f8: invalid state-transition behaviortrace_schedule_book_2dc2d85ac0fcandtrace_slots_fail: scheduling reliability breakdowntrace_mem_get_03bdfa202017andtrace_outreach_ae9e643c953b: context/memory coupling failures
Design rules for the benchmark
Machine-verifiable first -Deterministic rubric markers take precedence over subjective scoring.
Sealed held-out split - Strict separation with contamination checks.
Evidence-linked reporting - Every headline claim maps to a concrete artifact.
Publish negative results - If training underperforms a strong prompt baseline, report it directly.
3) Dataset Construction: 250 Tasks, Four Generation Modes, One Sealed Held-Out
Tenacious-Bench v0.1 contains:
Train: 125
Dev: 75
Held-out: 50
Authoring mix
Trace-derived (~30%)
Programmatic sweeps (~30%)
Multi-LLM synthesis (~25%)
Hand-authored adversarial (~15%)
Why multiple authoring modes
Trace-derived tasks preserve real failure signatures from prior runs.
Programmatic sweeps stress known variables (signal strength, segment, ask type) at scale.
Multi-LLM synthesis increases lexical diversity and reduces overfitting to a single style.
Hand-authored adversarial tasks directly target known policy breakpoints.
Quality and leakage controls
Model-family separation between generation and judge paths
Pairwise filtering for near-duplicate synthetic candidates
Contamination checks across train vs held-out
Held-out split excluded from preference construction
The result is not a “perfect dataset.” It is an
4) Why We Chose (DPO + LoRA)
This was a method choice grounded in observed failure types.
The three options
SFT generator - Strong for phrasing quality, weaker for enforcing “block this output” behavior.
preference tuning / critic / intervention - Directly optimizes chosen vs rejected behavior for safety and consistency.
process reward model - owerful but significantly heavier in both data preparation and runtime complexity.
Why DPO + LoRA fit this project
After analyzing the agent traces, we have found that the dominant issues were guardrail and reliability failures, not just phrasing quality.
DPO (preference tuning / critic / intervention) provided the best alignment with:
observed failure modes
dataset size constraints
iteration speed requirements
Practical constraints
125 train preference pairs and 75 dev pairs
Colab-class hardware
Need to encode:
“Looks fluent but should be rejected”
DPO (preference tuning / critic / intervention) satisfied all three.
Why DPO?
We selected DPO first, with SimPO and ORPO as alternatives.
DPO (Rafailov et al., 2023) - Stable baseline for pairwise preference optimization
SimPO (Meng et al., 2024) - Promising, simpler objective; reserved for follow-up
ORPO (Hong et al., 2024) - Fallback if DPO behavior regressed
LoRA choice
LoRA was selected for:
Lower memory cost
Faster iteration cycles
Adapter-only publishing
Core setup
Qwen2.5 Instruct (3B class) via Unsloth
Max sequence length: 1024
Seed: 42
Effective batching via gradient accumulation
5) Paper Grounding Behind the Method Stack
The methodology is informed by:
DPO (Rafailov et al., 2023)
SimPO (Meng et al., 2024)
ORPO (Hong et al., 2024)
Prometheus 2 (Kim et al., 2024)
Preference leakage analysis (Li et al., 2025)
This project does not attempt a full method comparison—it selects a practical first path while preserving room for ablations.
6) Experiment Adjustments That Materially Changed Outcomes
A key lesson: evaluation wiring can mask real capability.
Introduced
trained_intervenemode → trained model acts as intervention over baseline draftsAdded strict postprocessing → subject prefix, CTA constraints, word limits, cleanup, safety normalization
Tightened inference prompt and decoding controls
These changes removed formatting/directness artifacts and exposed actual behavioral improvements.
7) Results: Positive Delta A, Negative Delta B
Held-out results (n = 50):
Baseline mean score: 93.44 (pass rate 0.86)
Prompt-only mean score: 100.0 (pass rate 1.0)
Trained mean score: 97.92 (pass rate 0.82)
Delta A (trained vs baseline)
Mean diff: +4.48
95% CI: [3.68, 5.44]
One-sided p-value: 0.0002
Significance: true
Delta B (trained vs prompt-only)
Mean diff: -2.08
95% CI: [-3.36, -0.96]
One-sided p-value: 1.0
“Training beats prompt-only”: false
Interpretation
Training produced statistically supported lift over baseline
Prompt-only intervention remained stronger
This is not contradictory—it defines the boundary of improvement for this setup.
8) What Did Not Work (and Why We Kept It)
After fixing evaluation issues, failures concentrated in:
Capacity over-commitment
TCV quoting
Discount/promo language
These matter because they:
pass superficial fluency checks
introduce real operational risk
We kept them visible because:
They define the next benchmark expansion target
They prevent over-claiming progress
9) What Changes in v0.2
Planned updates:
Expand adversarial slices for capacity/pricing/promo constraints
Add stricter intervention-time policy enforcement
Run cost-aware ablations
Maintain explicit non-win reporting
10) Reproducibility and Artifacts
Public artifacts:
Dataset:
https://huggingface.co/datasets/gemechisw/tenacious_bench_v0.1Model adapter:
https://huggingface.co/gemechisw/Tenacious-DPO-LoRA-v0.1Repository:
https://github.com/gemechisworku/tenacious_bench_v01
Key evidence files
The below evidence files are provided on this project's github repo added above.
ablation_results.jsonheld_out_traces.jsonltraining/config.yamltraining/metrics.jsontraining/training_run.logmethodology_rationale.md
References
Rafailov et al., 2023. Direct Preference Optimization (DPO). https://arxiv.org/abs/2305.18290
Meng et al., 2024. SimPO. https://arxiv.org/abs/2405.14734
Hong et al., 2024. ORPO. https://arxiv.org/abs/2403.07691
Kim et al., 2024. Prometheus 2. https://arxiv.org/abs/2405.01535
Li et al., 2025. Preference leakage analysis. https://arxiv.org/abs/2502.01534
