Tenacious-Bench-v0.1: Evaluating Sales Agents for Policy Reliability (Not Just Fluency)

May 02, 2026

I spent the last week building a benchmark for sales agents.
Not to measure whether they complete tasks — but whether they can be trusted while doing them.
Because in production sales systems, the biggest failures aren’t broken workflows.
They’re confident, polished, policy-violating outputs that look correct — and aren’t.
This post is about what we built, why we built it, and what we learned from trying to train models against it.
Generic agent benchmarks are useful, but they are usually optimized to answer a different question than the one sales teams need answered in production.
The generic question is:
Can the agent complete multi-step tasks with tools?
The production sales question is:
Will the agent make safe, honest, high-signal claims under constraints that directly affect customer trust and revenue risk?
Tenacious-Bench v0.1 was built to answer the second question.

Why this matters

If you’re building LLM systems for production — especially in customer-facing roles — this distinction matters:

Fluency ≠ correctness
Correctness ≠ safety
Safety ≠ policy compliance

Most benchmarks optimize for the first.
Production systems fail on the last.

1) The Gap: Why Generic Benchmarks Under-Specify Sales Failures

Benchmarks such as tau²-Bench are strong for evaluating broad agent workflows, tool use, and task progression. They are not incorrect—they are measuring a wider target.

Failure types that matter in sales but are weakly represented in generic evals

Confidence calibration failures - Claiming high certainty from weak or partial signals.
Capacity commitment failures - Implying bench or delivery capacity that is not authorized.
Pricing-scope failures - Quoting contract values, pricing ranges, or discount language outside policy.
Tone under constraints - Maintaining non-condescending, professional tone within 60–120 words while still being specific

Example mismatch

A generic benchmark may reward:

Correct structure
Correct tool call
A “plausible” final answer

A sales-specific benchmark should penalize the same output if it states:

“We can support all your urgent hires this quarter” (unauthorized capacity claim)
“Your annual contract should be around $X” (policy violation on pricing specificity)
Promotional language that conflicts with current policy

This mismatch motivated a narrower benchmark:

2) Audit Method: From Agent Failures to Measurable Rubrics

We started with the agent probes and traces, then converted observed failure patterns into machine-checkable requirements.

trace_respond_874662476a68: policy boundary handling issues
trace_advance_2ef64021c4f8: invalid state-transition behavior
trace_schedule_book_2dc2d85ac0fc and trace_slots_fail: scheduling reliability breakdown
trace_mem_get_03bdfa202017 and trace_outreach_ae9e643c953b: context/memory coupling failures

Design rules for the benchmark

Machine-verifiable first -Deterministic rubric markers take precedence over subjective scoring.
Sealed held-out split - Strict separation with contamination checks.
Evidence-linked reporting - Every headline claim maps to a concrete artifact.
Publish negative results - If training underperforms a strong prompt baseline, report it directly.

3) Dataset Construction: 250 Tasks, Four Generation Modes, One Sealed Held-Out

Tenacious-Bench v0.1 contains:

Train: 125
Dev: 75
Held-out: 50

Authoring mix

Trace-derived (~30%)
Programmatic sweeps (~30%)
Multi-LLM synthesis (~25%)
Hand-authored adversarial (~15%)

Why multiple authoring modes

Trace-derived tasks preserve real failure signatures from prior runs.
Programmatic sweeps stress known variables (signal strength, segment, ask type) at scale.
Multi-LLM synthesis increases lexical diversity and reduces overfitting to a single style.
Hand-authored adversarial tasks directly target known policy breakpoints.

Quality and leakage controls

Model-family separation between generation and judge paths
Pairwise filtering for near-duplicate synthetic candidates
Contamination checks across train vs held-out
Held-out split excluded from preference construction

The result is not a “perfect dataset.” It is an

4) Why We Chose (DPO + LoRA)

This was a method choice grounded in observed failure types.

The three options

SFT generator - Strong for phrasing quality, weaker for enforcing “block this output” behavior.
preference tuning / critic / intervention - Directly optimizes chosen vs rejected behavior for safety and consistency.
process reward model - owerful but significantly heavier in both data preparation and runtime complexity.

Why DPO + LoRA fit this project

After analyzing the agent traces, we have found that the dominant issues were guardrail and reliability failures, not just phrasing quality.

DPO (preference tuning / critic / intervention) provided the best alignment with:

observed failure modes
dataset size constraints
iteration speed requirements

Practical constraints

125 train preference pairs and 75 dev pairs
Colab-class hardware
Need to encode:

“Looks fluent but should be rejected”

DPO (preference tuning / critic / intervention) satisfied all three.

Why DPO?

We selected DPO first, with SimPO and ORPO as alternatives.

DPO (Rafailov et al., 2023) - Stable baseline for pairwise preference optimization
SimPO (Meng et al., 2024) - Promising, simpler objective; reserved for follow-up
ORPO (Hong et al., 2024) - Fallback if DPO behavior regressed

LoRA choice

LoRA was selected for:

Lower memory cost
Faster iteration cycles
Adapter-only publishing

Core setup

Qwen2.5 Instruct (3B class) via Unsloth
Max sequence length: 1024
Seed: 42
Effective batching via gradient accumulation

5) Paper Grounding Behind the Method Stack

The methodology is informed by:

DPO (Rafailov et al., 2023)
SimPO (Meng et al., 2024)
ORPO (Hong et al., 2024)
Prometheus 2 (Kim et al., 2024)
Preference leakage analysis (Li et al., 2025)

This project does not attempt a full method comparison—it selects a practical first path while preserving room for ablations.

6) Experiment Adjustments That Materially Changed Outcomes

A key lesson: evaluation wiring can mask real capability.

Introduced trained_intervene mode → trained model acts as intervention over baseline drafts
Added strict postprocessing → subject prefix, CTA constraints, word limits, cleanup, safety normalization
Tightened inference prompt and decoding controls

These changes removed formatting/directness artifacts and exposed actual behavioral improvements.

7) Results: Positive Delta A, Negative Delta B

Held-out results (n = 50):

Baseline mean score: 93.44 (pass rate 0.86)
Prompt-only mean score: 100.0 (pass rate 1.0)
Trained mean score: 97.92 (pass rate 0.82)

Delta A (trained vs baseline)

Mean diff: +4.48
95% CI: [3.68, 5.44]
One-sided p-value: 0.0002
Significance: true

Delta B (trained vs prompt-only)

Mean diff: -2.08
95% CI: [-3.36, -0.96]
One-sided p-value: 1.0
“Training beats prompt-only”: false

Interpretation

Training produced statistically supported lift over baseline
Prompt-only intervention remained stronger

This is not contradictory—it defines the boundary of improvement for this setup.

8) What Did Not Work (and Why We Kept It)

After fixing evaluation issues, failures concentrated in:

Capacity over-commitment
TCV quoting
Discount/promo language

These matter because they:

pass superficial fluency checks
introduce real operational risk

We kept them visible because:

They define the next benchmark expansion target
They prevent over-claiming progress

9) What Changes in v0.2

Planned updates:

Expand adversarial slices for capacity/pricing/promo constraints
Add stricter intervention-time policy enforcement
Run cost-aware ablations
Maintain explicit non-win reporting

10) Reproducibility and Artifacts

Public artifacts:

Dataset: https://huggingface.co/datasets/gemechisw/tenacious_bench_v0.1
Model adapter: https://huggingface.co/gemechisw/Tenacious-DPO-LoRA-v0.1
Repository: https://github.com/gemechisworku/tenacious_bench_v01

Key evidence files

The below evidence files are provided on this project's github repo added above.

ablation_results.json
held_out_traces.jsonl
training/config.yaml
training/metrics.json
training/training_run.log
methodology_rationale.md

References

Rafailov et al., 2023. Direct Preference Optimization (DPO). https://arxiv.org/abs/2305.18290
Meng et al., 2024. SimPO. https://arxiv.org/abs/2405.14734
Hong et al., 2024. ORPO. https://arxiv.org/abs/2403.07691
Kim et al., 2024. Prometheus 2. https://arxiv.org/abs/2405.01535
Li et al., 2025. Preference leakage analysis. https://arxiv.org/abs/2502.01534

Gemechis

Discussion about this post

Ready for more?