Outbound hypothesis testing: the exact playbook we use
Benchmarks, sample sizes, reply rates — a full breakdown of how we validate outbound segments every week.
June 17, 2025
Hey — I’m Rinat.
We at Sally help 25+ US & EU B2B teams scale sales through structured outbound systems. Every week, I share frameworks, benchmarks, and behind-the-scenes lessons on how outbound actually works in practice.
Most outbound teams hit the same wall early:
they spend weeks perfecting copy — without knowing if they’re even targeting the right segments.
That’s where most of the waste happens.
The messaging doesn’t convert, not because it’s bad — but because they’re pitching the wrong audience.
The fix is simple:
validate the segment first, then fine-tune the message.
In this article, I’m gonna share the exact framework we use to run outbound hypothesis tests — including the sample sizes, reply benchmarks, and volume thresholds we track.
You can benchmark these numbers against your own process — or borrow parts of the system directly.
Whenever we start a new outbound project, we don’t build messaging first
We build hypotheses: which segments we want to test — and why we believe they might convert.
We pull this from actual sales data:
-
where deals have already closed
-
where there was strong intent in past sales cycles
-
which industries have shown early traction
We don’t guess. We start from fact.
Every Monday, our clients get an action plan:
• which segments we're testing
• which channels we’re using
• how many contacts go into each batch
For each hypothesis, we typically run ~500 contacts.
That’s the sample size where we get clean data to evaluate the segment itself.
If we want to A/B test messaging, copy, or language, we split the segment into ~250/250 batches inside that same volume.
The goal is always to isolate variables:
— first we validate the segment,
— then we refine the message.
Running tests on 50–100 contacts doesn’t produce any meaningful signal. At ~500, real patterns start to show.
What happens next:
• If a segment shows steady replies → we double down.
• If a segment stays cold → we kill it.
👉 If we push 400+ contacts through and see zero replies, it’s dead simple: move on.
Benchmarks we track:
📬 Email:
• 0.5% lead rate → enough to keep testing
• 1–1.5% → ready to scale
🔗 LinkedIn:
• 3–4% reply rate per segment
Why channels behave differently
Email gives fast feedback: opens → replies → leads.
LinkedIn takes longer:
• 15–20% first wave invite acceptance
• 200 invites → ~30–40 connections
• replies often land days or even weeks later
That’s why LinkedIn hypotheses always run for at least 2 full weeks before we evaluate.
Once the segment works — scaling comes down to data and execution
When a segment proves itself, scaling becomes purely operational — but only if your data layer can keep up.
We’ve invested heavily into this layer. For every project, we build segment-specific databases — pulling from multiple sources:
-
LinkedIn
-
business directories
-
niche listings
-
open datasets
-
all filtered by geography, roles, industries, and different buying signals.
This allows us to spin up fresh, highly targeted lists for every new test — without having to rebuild databases manually for each batch.
If you want to see a few examples of more unusual list-building cases we've run, check out these:
Once this data layer is stable, scaling becomes a matter of execution:
1️⃣ More inboxes → more daily volume
2️⃣ More valid contacts → broader reach
3️⃣ More SDRs → faster inbound processing
This system removes the biggest trap most teams fall into: over-optimizing copy before they even know who converts.
In outbound, segment beats copy — every time.
📚 More stories on outbound that actually works: