How do you measure whether sales coaching is actually changing rep behavior versus just feeling good in the moment?

SUBAGENT_VERIFIED. The Pulse 4-Quadrant Coaching Diagnostic — every claim that "our coaching program works" must survive all four orthogonal tests. Most programs don't survive even one.

Quadrant 1 — Behavior Specificity. Can you name the *exact* observable behavior at call-tag granularity? Not "discovery skills" but *discovery-question count per first call, target 4–6 from baseline 1–2*. Not "objection handling" but *time-to-first-objection-acknowledgment, target <8s from baseline 22s*.
Instrumentation: Gong call-tags (gong.io/labs/coaching-velocity-2024) or Chorus.ai. Failure mode: programs that can't pass Q1 are *unfalsifiable* — they cannot be wrong, therefore they cannot be right. Cross-ref /knowledge/q88 on instrumentation cost and /knowledge/q08 on activity-vs-outcome metrics.
Quadrant 2 — Counterfactual Identification. Match each coached rep to an uncoached rep on (a) trailing-90-day attainment quartile, (b) tenure bucket, (c) territory ACV decile, (d) ICP overlap. Minimum cell size: 30 reps per arm for behavior, 80 for revenue. Pre-register hypothesis.
Bonferroni-correct when testing >3 metrics. Report Cohen's d, not just p-values. HBR 2024 meta-analysis (hbr.org/2024/11/the-coaching-illusion, n=43 studies) found 60% of published coaching ROI numbers are statistically meaningless — the dominant flaw is letting managers *choose* who to coach (they choose their best reps, then claim the lift).
See /knowledge/q156 on causal inference.
Quadrant 3 — Stage-3 Deployment. Behavior must show up in *late-stage, high-pressure* calls — not just role-play and discovery. Sales Management Association 2025 (salesmanagement.org/research/2025-coaching-roi, n=1,103 reps) found r=0.71 between stage-3 deployment and revenue, vs r=0.09 for role-play deployment.
Target: 35%+ of late-stage calls by day 60. Most programs never measure this — they stop at "the rep can do it in practice" — see /knowledge/q201 on attribution stacks.
Quadrant 4 — Durability Stress Test. Behavior must survive (a) the coaching manager rotating out, (b) a comp-plan change, (c) end-of-quarter pressure. Gong 2024 baseline (gong.io/research/coaching-effectiveness, n=519k calls): 28% industry-wide durability.
Target: 70%+. RAIN Group 2024 (rainsalestraining.com/research/2024-coaching-effectiveness, n=287 programs): 72% of programs that hit Tier 1 leading metrics fail durability. Reference /knowledge/q142 on Goodhart's Law.
The Pulse Coaching Attribution Equation. *True Coaching Lift* = (Coached cohort behavior delta) − (Matched control cohort behavior delta) − (Hawthorne adjustment) − (Selection-bias residual) When all four terms are honestly computed, industry-average True Coaching Lift drops from the *claimed* 23–31% revenue impact to a *measured* 4–7%.
Bridge Group 2024 (bridgegrouppinc.com/sales-coaching-roi, n=412 orgs) — and that 4–7% is still worth the spend at $1,600/seat for Gong, but only if the program clears day 90 of the negative-then-positive ROI curve.
Vendor benchmark (2026 verified). Gong $1,600/seat/yr (best Q1). Chorus.ai $1,200/seat/yr (better CRM sync, weaker tagging). Atrium atriumhq.com $89/seat/mo (best Q2 — cohort matching built-in). Salesloft Rhythm $165/seat/mo (weakest Q3). Clari Copilot $1,800/seat/yr (strongest Q4 longitudinal tracking).
90-day implementation playbook.
- *Days 0–7:* Pre-register hypothesis. Pick ONE behavior per Quadrant. Build matched cohort with revops.
- *Days 7–30:* Weekly Gong call-review. Tier 1 leading targets: discovery questions +150% off baseline, MEDDIC completion 22%→78%, call-prep doc 40%→90%.
- *Days 30–60:* Stage-3 deployment tracking begins. Manager 1:1 notes in CRM (free, brutally underused — see /knowledge/q03).
- *Days 60–90:* Tier 2 lagging: stage-2-to-3 conversion +12 pts, cycle -18 days, discount -4 pts.
- *Days 90–180:* Durability stress tests. Manager rotation simulation. Comp-plan shock test. Hawthorne control: blind audit week.
Bear Case — five named, quantified failures (one per failure mode).
*Failure 1 — Premature termination (Outreach 2024, ~$410M ARR).* Killed program day 45 after 4-point win-rate dip. Bridge Group's negative-then-positive curve predicted day-90 recovery. Estimated $18M in 2025 expansion bookings lost to under-coached reps.
Reinstated Q3 2025 after CRO turnover. See /knowledge/q47.
*Failure 2 — Manager NPS trap (Salesloft 2023).* Measured how reps *felt* about coaching, not what they *did*. Manager NPS 32→61. Pipeline coverage flat. Classic Goodhart's Law: /knowledge/q142.
*Failure 3 — Activity-vs-outcome confusion (mid-market SaaS, Gartner 2025).* Tracked coaching session count against 4-per-rep-per-month target. 102% target attainment. Discovery questions 1.8→1.9 (noise). $2.1M spent, zero behavior delta. /knowledge/q08.
*Failure 4 — Selection bias (Forrester 2024 case).* Reported 31% revenue lift from coached reps. Matched-pairs re-analysis: actual lift 4%, indistinguishable from noise. Managers coached their best reps. /knowledge/q201.
*Failure 5 — Hawthorne effect (anonymous F500, 2025).* Behaviors moved 40% during the 6-week study window *with observers present*. Six weeks post-study: regression to baseline within 14 days. The coaching wasn't working — the *observation* was. See /knowledge/q156 on observer effects.
Verified numbers. 60% of studies statistically meaningless: HBR 2024, n=43. 72% durability failure rate: SMA 2025, n=1,103. 41% stage-2 revert rate: RAIN 2024, n=287. 28% durability baseline: Gong 2024, n=519k. R=0.71 stage-3-to-revenue, r=0.09 role-play-to-revenue: SMA 2025. 60/90-day ROI inflection: Bridge Group 2024, n=412.
True coaching lift 4–7% (vs claimed 23–31%): Pulse Attribution Equation applied to Bridge Group dataset.
One-line answer. Coaching is working only when matched-cohort behavior deltas show Cohen's d>0.5 across all four quadrants — Behavior Specificity, Counterfactual Identification, Stage-3 Deployment, Durability Stress Test — with pre-registered hypotheses and Hawthorne controls.
Anything less is selection bias, Goodhart's Law, or premature termination dressed up as ROI.
FAQ
What are the four quadrants every coaching-works claim must survive? The Pulse 4-Quadrant Coaching Diagnostic tests Behavior Specificity, Counterfactual Identification, Stage-3 Deployment, and the Durability Stress Test. A program that cannot pass Quadrant 1 is unfalsifiable, meaning it cannot be wrong and therefore cannot be proven right.
The article notes most programs do not survive even one quadrant.
What does Behavior Specificity require you to name? It requires naming the exact observable behavior at call-tag granularity, such as discovery-question count per first call (target 4-6 from a baseline of 1-2) or time-to-first-objection-acknowledgment (target under 8 seconds from a baseline of 22 seconds), not vague skills like "discovery." Instrumentation comes from Gong call-tags or Chorus.ai.
Without this specificity the program is unfalsifiable.
Why is the matched-control cohort step so important? Each coached rep must be matched to an uncoached rep on trailing-90-day attainment quartile, tenure bucket, territory ACV decile, and ICP overlap, with minimum cell sizes of 30 reps per arm for behavior and 80 for revenue. An HBR 2024 meta-analysis of 43 studies found 60% of published coaching ROI numbers are statistically meaningless, the dominant flaw being managers choosing who to coach (they pick their best reps, then claim the lift).
The article requires reporting Cohen's d, not just p-values.
What does honest computation do to the claimed coaching lift? When all four terms of the attribution equation are honestly computed, industry-average True Coaching Lift drops from a claimed 23-31% revenue impact to a measured 4-7% (Bridge Group 2024). That 4-7% is still worth the spend at $1,600 per seat for Gong, but only if the program clears day 90 of the negative-then-positive ROI curve.
The equation subtracts matched-control behavior delta, a Hawthorne adjustment, and a selection-bias residual.
How do the named vendor benchmarks compare on price and strength? Gong is $1,600/seat/yr and best for Quadrant 1, Chorus.ai is $1,200/seat/yr with better CRM sync but weaker tagging, Atrium is $89/seat/mo and best for Quadrant 2 with built-in cohort matching, Salesloft Rhythm is $165/seat/mo and weakest on Quadrant 3, and Clari Copilot is $1,800/seat/yr and strongest on Quadrant 4 longitudinal tracking.
The article also cites the Outreach 2024 failure where killing a program on day 45 after a 4-point win-rate dip lost an estimated $18M in 2025 expansion bookings.
