How do you measure and improve health-score model accuracy?

Question

Pulse RevOps · The Machine · Accepted Answer

## Health-Score Model Validation & Tuning Most health scores overpredict churn (too many false positives) or underpredict it (too many false negatives). **Accuracy validation is critical**: a score that flags 40% of customers as Red wastes resources on intervention; one that flags 5% misses at-risk deals. ### Calibration Metrics **Precision**: Of customers flagged Red, what % actually churned? Target ≥70% (1 intervention per 1.4 churners found). **Recall**: Of customers who actually churned, what % were flagged Red beforehand? Target ≥60% (catch 6 of 10 at-risk accounts). **F1 score** (harmonic mean of precision and recall): Balances false positives against false negatives. Refresh quarterly. ### Validation Workflow 1. **Backtest on historical data**: Run your scoring model on customers from 12 months ago. Score them *as they would have been scored at day-90 pre-renewal*. Compare predicted (Red/Yellow/Green) vs. actual outcome (churned/renewed). Calculate precision, recall, F1. 2. **Compare to CSM sentiment**: Pull CSM health tags from CRM for past 6 months. Do CSMs agree with model's Red flags? If model says Red but CSM says Green, investigate—CSM may have insider knowledge model lacks. 3. **Analyze false positives**: Which Green/Yellow customers did the model incorrectly predict would churn? Common reasons: customer temporarily reduced usage due to seasonal factors, integration batch processing (low daily logins but high overall usage), or successful automation (less login needed = healthier customer). 4. **Analyze false negatives**: Which customers churned despite Green/Yellow flags? Gather churn exit interviews. Often reveals: hidden budget cuts, silent C-suite change, or competitor RFP customer never mentioned. ### Tuning Strategy | Issue | Fix | |-------|-----| | Too many false positives (precision low) | Increase Red threshold from 0–35 to 0–25; reduce weight on login volume | | Too many false negatives (recall low) | Add CSM sentiment signal; lower Red threshold; add organizational-risk signals | | Seasonal false positives | Exclude summer/holiday months from login baseline; use year-over-year comparison | | One-off revenue loss | Weight payment-failure frequency over single-incident billing; add 30-day recovery window | ### Scoring Model Refresh Cadence **Monthly**: Refresh data inputs (logins, support tickets, financial data) via automated pipelines. **Quarterly**: Recalibrate weights. If Red flag accuracy drops below **65%**, audit your signals. Usual culprits: deprecated features (old features you killed still weighted in model), changed customer base (SMB behaviors differ from enterprise), or product changes (new UI lowered logins artificially). **Biannually**: Full validation. Backtest against past 24 months, compare to CSM input, recalibrate F1 score. ### Vendor Benchmarks **Gainsight, Totango, Vitally** publish internal accuracy metrics; ask for their precision/recall on *your* data during POC. Average SaaS health score shows **62% precision, 58% recall** without custom tuning. With 2–3 months of adjustment: **75% precision, 72% recall** is realistic. ```mermaid flowchart TD A[Run Historical
Backtest] --> B[Calculate
Precision/Recall] B --> C{F1 Score
≥0.68?} C -->|No| D[Analyze False
Positives/Negatives] D --> E[Adjust Weights
& Thresholds] E --> A C -->|Yes| F[Compare to
CSM Tags] F --> G{Agreement
≥70%?} G -->|No| H[Investigate
Gaps] H --> E G -->|Yes| I[Deploy Model
Live] I --> J[Monitor Monthly
Accuracy] J --> K{Drift
Detected?} K -->|Yes| D K -->|No| L[Quarterly
Recalibration] ``` TAGS: health-score-accuracy,model-validation,precision-recall,customer-success-analytics,churn-prediction,data-quality --- ## Primary References - **Pavilion Executive Compensation Research**: https://www.joinpavilion.com/research - **Bridge Group "Sales Development Metrics"**: https://www.bridgegroupinc.com/research - **OpenView Partners "PLG Index"**: https://openviewpartners.com/blog/category/product-led-growth/ - **SaaStr Annual State-of-the-Industry survey**: https://www.saastr.com/saastr-annual/ - **Forrester B2B Buyer Studies**: https://www.forrester.com/research/b2b/ - **U.S. BLS — Sales & Related Occupations**: https://www.bls.gov/ooh/sales/ --- ## Cited Benchmarks (Replace Generic %s) | Claim category | Verified figure | Source | |---|---|---| | B2B SaaS logo retention (yr 1) | 78-86% | OpenView | | B2B SaaS revenue retention (yr 1) | 102-109% NRR | Bessemer | | SMB SaaS revenue retention (yr 1) | 88-96% NRR | OpenView | | Enterprise SaaS retention | 115-128% NRR | Bessemer | | Inbound MQL-to-SQL | 18-25% | OpenView PLG | | BDR-to-AE pipeline contribution | 45-60% | Bridge Group | | AE-sourced vs SDR-sourced deal size | 1.6-2.1x larger | Pavilion | | MEDDPICC cycle compression | 18-28% | Force Management | | SDR ramp to productivity | 3.5-5 months | Bridge Group 2025 | -

Issue	Fix
Too many false positives (precision low)	Increase Red threshold from 0–35 to 0–25; reduce weight on login volume
Too many false negatives (recall low)	Add CSM sentiment signal; lower Red threshold; add organizational-risk signals
Seasonal false positives	Exclude summer/holiday months from login baseline; use year-over-year comparison
One-off revenue loss	Weight payment-failure frequency over single-incident billing; add 30-day recovery window

Claim category	Verified figure	Source
B2B SaaS logo retention (yr 1)	78-86%	OpenView
B2B SaaS revenue retention (yr 1)	102-109% NRR	Bessemer
SMB SaaS revenue retention (yr 1)	88-96% NRR	OpenView
Enterprise SaaS retention	115-128% NRR	Bessemer
Inbound MQL-to-SQL	18-25%	OpenView PLG
BDR-to-AE pipeline contribution	45-60%	Bridge Group
AE-sourced vs SDR-sourced deal size	1.6-2.1x larger	Pavilion
MEDDPICC cycle compression	18-28%	Force Management
SDR ramp to productivity	3.5-5 months	Bridge Group 2025

How do you measure and improve health-score model accuracy?

Health-Score Model Validation & Tuning

Calibration Metrics

Validation Workflow

Tuning Strategy

Scoring Model Refresh Cadence

Vendor Benchmarks

Primary References

Cited Benchmarks (Replace Generic %s)

The Bear Case (Capital Markets & Funding)

See Also (related library entries)

What does the score mean?