Playbooks

Outbound hypothesis testing: the exact playbook we use

Benchmarks, sample sizes, reply rates — a full breakdown of how we validate outbound segments every week.

June 17, 2025

Hey — I’m Rinat.

We at Sally help 25+ US & EU B2B teams scale sales through structured outbound systems. Every week, I share frameworks, benchmarks, and behind-the-scenes lessons on how outbound actually works in practice.

Most outbound teams hit the same wall early:
they spend weeks perfecting copy — without knowing if they’re even targeting the right segments.

That’s where most of the waste happens.
The messaging doesn’t convert, not because it’s bad — but because they’re pitching the wrong audience.

The fix is simple:
validate the segment first, then fine-tune the message.

In this article, I’m gonna share the exact framework we use to run outbound hypothesis tests — including the sample sizes, reply benchmarks, and volume thresholds we track.
You can benchmark these numbers against your own process — or borrow parts of the system directly.

Whenever we start a new outbound project, we don’t build messaging first

We build hypotheses: which segments we want to test — and why we believe they might convert.

We pull this from actual sales data:

  • where deals have already closed

  • where there was strong intent in past sales cycles

  • which industries have shown early traction

We don’t guess. We start from fact.

Every Monday, our clients get an action plan:

• which segments we're testing

• which channels we’re using

• how many contacts go into each batch

For each hypothesis, we typically run ~500 contacts.

That’s the sample size where we get clean data to evaluate the segment itself.

If we want to A/B test messaging, copy, or language, we split the segment into ~250/250 batches inside that same volume.

The goal is always to isolate variables:

— first we validate the segment,

— then we refine the message.

Running tests on 50–100 contacts doesn’t produce any meaningful signal. At ~500, real patterns start to show.

What happens next:

• If a segment shows steady replies → we double down.

• If a segment stays cold → we kill it.

👉 If we push 400+ contacts through and see zero replies, it’s dead simple: move on.

Benchmarks we track:

📬 Email:

• 0.5% lead rate → enough to keep testing

• 1–1.5% → ready to scale

🔗 LinkedIn:

• 3–4% reply rate per segment

Why channels behave differently

Email gives fast feedback: opens → replies → leads.

LinkedIn takes longer:

• 15–20% first wave invite acceptance

• 200 invites → ~30–40 connections

• replies often land days or even weeks later

That’s why LinkedIn hypotheses always run for at least 2 full weeks before we evaluate.

Once the segment works — scaling comes down to data and execution

When a segment proves itself, scaling becomes purely operational — but only if your data layer can keep up.

We’ve invested heavily into this layer. For every project, we build segment-specific databases — pulling from multiple sources:

  • LinkedIn

  • business directories

  • niche listings

  • open datasets

  • all filtered by geography, roles, industries, and different buying signals.

This allows us to spin up fresh, highly targeted lists for every new test — without having to rebuild databases manually for each batch.

If you want to see a few examples of more unusual list-building cases we've run, check out these:

Once this data layer is stable, scaling becomes a matter of execution:

1️⃣ More inboxes → more daily volume
2️⃣ More valid contacts → broader reach
3️⃣ More SDRs → faster inbound processing

This system removes the biggest trap most teams fall into: over-optimizing copy before they even know who converts.

In outbound, segment beats copy — every time.


📚 More stories on outbound that actually works: