How do I clean a CRM that has 5 years of bad data?
Declare CRM bankruptcy. Archive deals older than 12 months read-only, run a 2-week one-time cleanup on accounts with deals touched in the last 90 days, then enforce a forward-only data contract enforced by validation rules + enrichment webhooks. Total project: 4 weeks elapsed, $20K-$40K (mostly Apollo or ZoomInfo enrichment + 1 RevOps lead + 1 analyst at 0.5 FTE). The reason this beats a 6-month historical cleanup: per Validity's 2025 State of CRM Data Report, 91% of CRM data is incomplete, stale, or duplicated within 12 months, and Salesforce's own Data Quality whitepaper says ~30% of records become inaccurate per year regardless of cleanup effort. You cannot out-clean entropy. You can only out-govern it. Two of these projects in three fail because teams skip the forward contract and just deduplicate; the third fails because management refuses to enforce. Plan accordingly.
Why traditional cleanup fails (the math)
Most RevOps teams attack the symptom: "let's deduplicate, fill empty fields, standardize names." The brutal arithmetic:
- Decay rate: Gartner's 2024 CRM benchmark pegs B2B contact data decay at 22.5% per year (people change jobs, companies merge, domains lapse). Validity's State of CRM Data Report puts it higher near 30%.
- Effort to clean 5 years of records: For a typical 100K-account org, manual + tool-assisted cleanup runs 600-900 hours. At a $120/hr loaded RevOps cost, that's $72K-$108K.
- Half-life of cleaned data: Without governance, ~40% of cleaned records are dirty again within 60 days (Validity benchmark). You spent $100K to buy 60 days.
ROI is not just negative; it is *structurally* negative. The fix is not more cleaning. It is a new contract.
The bankruptcy approach (real mechanics)
Step 1 - Define the active set (Day 1-2)
Important Salesforce governor caveat: a scheduled flow that touches >50K Account records in a single transaction will hit the SOQL row governor limit. Use a Queueable Apex job with batches of 200 records, or chunk the flow with a For Each loop that commits in batches. Do not learn this in production; learn it in a full-copy sandbox.
In Salesforce, create a custom field Active_Cohort__c (checkbox) and a scheduled flow that sets it = TRUE for any Account where:
Opportunity.LastModifiedDate >= TODAY() - 90ORAccount.Stage = 'Customer'ANDRenewal_Date__c <= TODAY() + 365OR- An
Activity(Task/Event) was logged on the Account in the last 90 days.
Everything else is the Archive cohort. Archive cohort gets moved to a separate Salesforce record type (Account_Archive) with read-only page layouts. You are not deleting; you are removing it from list views, reports, and dashboards. This single move usually shrinks the "active" surface by 70-85%.
Step 2 - Deduplicate the active cohort (Week 1)
Use Salesforce Duplicate Management (setup docs). Create a Matching Rule with:
- Account name: Fuzzy match (Levenshtein), threshold 85%
- Website domain: Exact match (after stripping
www.and protocol) - Billing country + state: Exact
Then a Duplicate Rule that runs the matching rule on insert/update and fires a "Merge Required" alert. For existing dupes, run DemandTools or Cloudingo (both have free trial tiers). Master record selection rule: keep the record with (a) most recent Opportunity activity, (b) tie-break on most populated fields, (c) tie-break on oldest CreatedDate. Document this in a 1-page runbook so the merge decisions are defensible later.
Step 3 - Enrich, don't fill (Week 2)
Every active Account gets pushed through Apollo or ZoomInfo bulk enrichment API. Apollo's 2026 list price is roughly $0.10-$0.20/record at volume; ZoomInfo runs $0.40-$1.00. For 5,000 active accounts, budget $500-$5,000.
Fields to enrich (non-negotiable): Industry, Employees, Revenue_Band__c, HQ_Country__c, Tech_Stack__c (if available). Do not try to fill Notes, Description, or any free-text field with enrichment data; those are human-entered context and dirty enrichment will pollute search.
Step 4 - Forward data contract (Week 3-4)
This is the only step that compounds. Implement four hard validation rules in Salesforce (or HubSpot equivalent):
- Lead/Contact creation:
Email != null AND Email NOT LIKE '%@gmail.com' AND Email NOT LIKE '%@yahoo.com'(block free webmail on B2B lead intake) ORSource = 'Inbound Demo'. - Opportunity creation: 5 required fields enforced by validation rule -
Account,Contact_Role__c,Industry__c,Stage,ACV__c. No exceptions. - Stage advancement: Rule blocks Stage > 'Discovery' if
Decision_Maker_Identified__c = FALSE. Same forStage > 'Proposal'ifMutual_Action_Plan__c = NULL. - Closed-Won handoff: Apex trigger (
OpportunityHandoffTrigger) requires CSM checkboxCS_Handoff_Verified__c = TRUEbefore stage flips to 'Closed Won'. The trigger throwsaddError('CS handoff verification required before close')if false. CSM has 5 fields to verify; if wrong, they un-check and email the rep. Code lives in version control, deployed via SFDX/CI, never edited in production.
Plus one webhook: Apollo enrichment fires on Lead/Account creation (Zapier or native Salesforce flow with HTTP callout). New record gets industry/size in <2 minutes. Zero rep effort.
Cleanup project scope (the full 4-week plan)
| Week | Workstream | Owner | Effort | Deliverable |
|---|---|---|---|---|
| 1 | Define Active cohort, archive rest | RevOps Lead | 5 days | Active_Cohort__c flag populated; Archive record type live |
| 1-2 | Deduplicate active accounts (DemandTools/Cloudingo) | Data Analyst | 8 days | <2% dupe rate on active cohort |
| 2 | Enrich active cohort (Apollo/ZoomInfo bulk) | Data Analyst | 3 days | 95%+ fill on Industry/Employees/Revenue |
| 3 | Build 4 validation rules + 1 enrichment webhook | RevOps + Salesforce Admin | 5 days | Rules in production, tested in sandbox first |
| 3-4 | Train reps + managers on new contract | RevOps Lead | 2 days | 30-min training, written 1-pager, manager sign-off |
| 4 | Run first monthly audit, calibrate rules | RevOps Lead | 2 days | Audit report, 3-5 rule tweaks |
Budget: $20K-$40K all-in. Apollo enrichment $5K. DemandTools/Cloudingo $3K-$8K (3-month license). 0.5 FTE RevOps lead for 4 weeks (~$15K loaded). 0.5 FTE data analyst (~$8K). No external vendor required.
Governance rules (the part that compounds)
- Inbound prospect: Auto-enriched within 2 hours via Apollo/ZoomInfo webhook. Rep cannot create a Lead without an email domain that resolves to a company record.
- Outbound prospect: Rep clicks "Add to Salesforce" from Apollo/Outreach plugin (not manual entry). The plugin fills 80% of fields; rep verifies in 30 seconds.
- Deal creation: 5 required fields enforced at the database layer (Account, Contact Role, Industry, Stage, ACV). Validation rule fires on save. No "I'll fix it later."
- Deal close: Apex trigger requires CS handoff checkbox. CSM verifies 5 fields (contact email, company legal name, billing country, ACV, contract end date) before stage flips. If wrong, CSM unchecks; rep gets a Slack ping.
- Monthly audit: Each manager runs a saved report on their team's 10 oldest open opps. Wrong/missing fields = coaching. Two violations in 90 days = comp accelerator suspended for the quarter. Public, predictable, applied uniformly.
This is not "best practices"; it is forcing the cost of bad data onto the person who created it, in real time. That is the only mechanism that works.
Tools (with real prices)
*HubSpot equivalents in parentheses where applicable. Most of this approach maps 1:1 to HubSpot Operations Hub Enterprise.*
- Salesforce Duplicate Management - included in Sales Cloud Enterprise+. Use for prevention rules.
- DemandTools by Validity - $5K-$15K/year, full-featured cleanup. Use for the one-time bulk merge.
- Apollo enrichment - $0.10-$0.20/record, $99-$149/user/month for plug-in.
- ZoomInfo enrichment - $0.40-$1.00/record; better for enterprise data, worse on SMB.
- Salesforce Einstein Data Detect - bundled with Einstein 1, flags anomalies. Worth turning on if you already pay.
Bear Case (when bankruptcy fails)
This approach is wrong in three scenarios:
- You are in a regulated industry where archived records still need active monitoring - e.g., pharma deal records under HIPAA, financial services under FINRA, or EU customers under GDPR retention rules. Archiving without active governance can violate retention policy. Fix: archive to a compliant cold-storage system (Salesforce Big Objects, S3 with Glacier, or your records-management platform) with an audit log, not just a record-type change. SOC2/SOX auditors will ask for the audit log on day 1; if you cannot produce one, you have a finding.
- Your old data is your moat - if you sell to the same accounts repeatedly (renewals + expansion at >70% of revenue), the historical context (champions who left, product complaints, deal stalls) is institutional knowledge. Archiving it kills the AE's first-call advantage. Fix: keep the *contact* and *activity* history hot for any account that ever signed a contract; only archive cold prospects. Quick math: if 70% of next year's revenue comes from existing accounts and the average AE saves 30 minutes of research per renewal call by having the history hot, on a 200-rep org with 10 renewal calls/quarter that is 100,000 minutes (~$200K of fully-loaded rep time) - more than the cleanup budget.
- Your reps will revolt and game the validation rules - if the org has weak management, reps will create fake
Industry = 'Other'andACV = 1placeholder records to bypass rules. Common gaming patterns:Industry = "Other"on 60% of new accounts,ACV = 1placeholder records that get fixed "later" (never), copy-pastedDecision_Maker_Identifiedflags with no actual person named, fakeMutual_Action_Plan = "TBD"text. The rules become noise, the data gets worse, trust collapses. Fix: do not roll out validation rules without a 90-day audit + consequence plan. If your sales leaders won't enforce, do not start. The rules without enforcement are negative-value.
If any of those three apply, do not declare bankruptcy. Do a slower, surgical cleanup with a vendor (Cloudingo + a 2-person consulting team for 8 weeks, $60K-$100K), and skip the validation rules until you have management buy-in.
Success metrics (90 days post-cleanup)
- Duplicate account rate on active cohort: <2% (typical pre-cleanup: 8-15%)
- Industry/Employees/Revenue fill rate on active accounts: 95%+ (was 50-70%)
- "Which deal is this?" Slack questions to RevOps: down 70%
- Forecast variance (committed vs. actual): improves 200-400 bps quarter-over-quarter
- Time from Lead-In to first rep call: down 30-50% (auto-enrichment removes the research step)
- Rep CRM hours/week: down from ~6 to ~3 (validation + enrichment kills manual entry)
What NOT to do
- Do not hire a "CRM cleanup vendor" to replay 5 years of history. You will pay $100K and reps will not trust the result.
- Do not roll out validation rules in production without a 2-week sandbox + change-management plan. You will brick rep workflows on a Monday morning and own the political fallout.
- Do not try to fill every field. Pick the 5 that drive segmentation, routing, and forecasting; ignore the rest.
- Do not start in Q4 or Q1. Run cleanup in slow season (June-August or December dead week).
- Do not let leadership exempt themselves. CRO records get audited like everyone else.
What 10/10 success looks like at month 6
- Active cohort dupe rate stable <2% (it stays clean because the matching rule fires on every insert).
- New Lead-to-MQL time down from days to hours (enrichment webhook removes the research bottleneck).
- Forecast variance is *boring*: managers stop asking "is this number real?" and start asking "what do we do about it?"
- Reps stop emailing RevOps screenshots of broken records, because the validation rules made them impossible to create.
- The cleanup project has paid for itself 4-6x in saved rep hours + improved forecast credibility with the board.
Action (this week)
- Pull a count of accounts where
LastActivity > TODAY() - 90. That is your active cohort size. - Run a duplicate report:
SELECT Name, Website, COUNT(Id) FROM Account GROUP BY Name, Website HAVING COUNT(Id) > 1. That is your dupe baseline. - Quote Apollo or ZoomInfo for bulk enrichment of N records (where N = active cohort size).
- Block 4 weeks on the calendar. Get CRO sponsor sign-off in writing. Start Monday of week 1.
Related reading: see /knowledge/q110 for stack selection (Outreach/Salesloft/Apollo), /knowledge/q112 for attribution implications of clean data, /knowledge/q115 for the org chart that owns this work, /knowledge/q300 for how clean pipeline data shows up in forecast reliability, and /knowledge/q250 for what CSM notes + product usage data should feed back into the CRM post-handoff.
TAGS: crm-cleanup, data-quality, salesforce-dedup, data-governance, database-maintenance, validity-demandtools, apollo-enrichment