Outbound hypothesis testing: the exact playbook we use

Hey — I’m Rinat.

We at Sally help 25+ US & EU B2B teams scale sales through structured outbound systems. Every week, I share frameworks, benchmarks, and behind-the-scenes lessons on how outbound actually works in practice.

Most outbound teams hit the same wall early:
they spend weeks perfecting copy — without knowing if they’re even targeting the right segments.

That’s where most of the waste happens.
The messaging doesn’t convert, not because it’s bad — but because they’re pitching the wrong audience.

The fix is simple:
validate the segment first, then fine-tune the message.

In this article, I’m gonna share the exact framework we use to run outbound hypothesis tests — including the sample sizes, reply benchmarks, and volume thresholds we track.
You can benchmark these numbers against your own process — or borrow parts of the system directly.

Whenever we start a new outbound project, we don’t build messaging first

We build hypotheses: which segments we want to test — and why we believe they might convert.

We pull this from actual sales data:

where deals have already closed
where there was strong intent in past sales cycles
which industries have shown early traction

We don’t guess. We start from fact.

Every Monday, our clients get an action plan:

• which segments we're testing

• which channels we’re using

• how many contacts go into each batch

For each hypothesis, we typically run ~500 contacts.

That’s the sample size where we get clean data to evaluate the segment itself.

If we want to A/B test messaging, copy, or language, we split the segment into ~250/250 batches inside that same volume.

The goal is always to isolate variables:

— first we validate the segment,

— then we refine the message.

Running tests on 50–100 contacts doesn’t produce any meaningful signal. At ~500, real patterns start to show.

What happens next:

• If a segment shows steady replies → we double down.

• If a segment stays cold → we kill it.

👉 If we push 400+ contacts through and see zero replies, it’s dead simple: move on.

Benchmarks we track:

📬 Email:

• 0.5% lead rate → enough to keep testing

• 1–1.5% → ready to scale

🔗 LinkedIn:

• 3–4% reply rate per segment

Why channels behave differently

Email gives fast feedback: opens → replies → leads.

LinkedIn takes longer:

• 15–20% first wave invite acceptance

• 200 invites → ~30–40 connections

• replies often land days or even weeks later

That’s why LinkedIn hypotheses always run for at least 2 full weeks before we evaluate.

Once the segment works — scaling comes down to data and execution

When a segment proves itself, scaling becomes purely operational — but only if your data layer can keep up.

We’ve invested heavily into this layer. For every project, we build segment-specific databases — pulling from multiple sources:

LinkedIn
business directories
niche listings
open datasets
all filtered by geography, roles, industries, and different buying signals.

This allows us to spin up fresh, highly targeted lists for every new test — without having to rebuild databases manually for each batch.

If you want to see a few examples of more unusual list-building cases we've run, check out these:

Once this data layer is stable, scaling becomes a matter of execution:

1️⃣ More inboxes → more daily volume
2️⃣ More valid contacts → broader reach
3️⃣ More SDRs → faster inbound processing

This system removes the biggest trap most teams fall into: over-optimizing copy before they even know who converts.

In outbound, segment beats copy — every time.

📚 More stories on outbound that actually works: