Is web scraping legal?

Scraping publicly accessible data (no login) is generally not a CFAA violation under the Ninth Circuit's hiQ ruling, but data behind a password is risky, and contract, copyright, and GDPR/CCPA obligations still apply. It's situational — and this isn't legal advice.

Should we build scraping in-house or have it run for us?

Maintenance is the real cost — roughly 20–30% of the build per year as sites and anti-bot defenses change. Build in-house only if it's core and you'll staff it; otherwise managed scraping clears far higher success rates without the upkeep.

Scrape or just buy the data?

Buy the bulk that vendors already cover well — it's faster and cheaper. Scrape only the gap they don't: niche markets, unique fields, or freshness a static file can't match.

Guide · Updated June 2, 2026

When is scraping the right solution?

The short answer

Scrape when the data you need isn't in any database you can buy: a niche or data-poor market, a field no vendor sells, or freshness a static file can't match. For mainstream, well-covered B2B, buying is faster and cheaper; custom scraping is the right tool in maybe 10–20% of cases. And it's a commitment, not a one-off: a scraper costs roughly 20–30% of its build in maintenance every year as sites and anti-bot defenses change. Scrape for the gap, buy for the bulk — and never scrape what's already a column you can purchase.

When scraping is the right call

Scraping earns its keep exactly where databases give up. If your buyers live in a niche or data-poor market the major vendors barely index, or you need a field no one sells (a specific certification, a marketplace rating, a tech signal, a hiring trigger), there's no list to buy. You have to go get it.

It's also the answer when freshness matters more than a vendor's refresh cycle, or when the population you need is larger than any export will hand you.

Coverage gaps in niche, regional, or net-new markets
Unique fields no data provider offers
Freshness a periodic database file can't match
A market larger than off-the-shelf exports allow

When it's the wrong tool

If the data is already a column you can buy, scraping it yourself is just rebuilding a worse, more fragile version of a vendor's database. Buying is faster and cheaper for the well-covered bulk of mainstream B2B.

Skip scraping when the list is tiny, when you need contractual accuracy guarantees or an SLA, when targets run heavy anti-bot defenses across many fragmented sites, or when you have no one to maintain it. Industry estimates put custom scraping as the right tool in only about 10–20% of data needs — the genuine gaps.

The real cost is maintenance, not the first run

The demo scraper is easy. Keeping it alive is the job. Sites change structure and anti-bot systems have grown sharply (providers report defenses rising on the order of 78% year over year), so a scraper that worked last month silently breaks this month.

Plan for roughly 20–30% of the original build cost in maintenance every year, plus proxy and infrastructure that commonly runs a few hundred to a few thousand dollars a month at volume. It shows up in results too: professionally managed scraping tends to clear 90%+ extraction success where ad-hoc DIY efforts often sit near half that.

Is it legal? Settled vs. contested (2025–26)

Settled enough to act on: scraping data that's publicly accessible (no login, no password) is generally not a Computer Fraud and Abuse Act violation. The Ninth Circuit's hiQ v. LinkedIn ruling established that, and Van Buren reinforced that the line is whether you bypass an authentication gate. Scraping data behind a login is a different, far riskier story.

Narrower than people think: a CFAA-clean scrape can still face contract, copyright, and trespass claims. Breaking a site's terms can be a breach even when it isn't a CFAA crime, and courts enforce click-to-agree (clickwrap) terms far more readily than passive browsewrap. In 2024, Meta v. Bright Data found that scraping public data while logged out didn't breach Meta's terms, since a logged-out scraper never agreed to them — but that's a single district-court ruling on Meta's specific terms, not nationwide law.

Personal data carries its own rules: GDPR and CCPA apply to scraped personal information regardless of how public it was. None of this is legal advice — but the defensible zone is public, logged-out, non-personal data on the right side of a site's enforceable terms.

Scraping, APIs, and waterfall enrichment are complementary

This isn't scraping versus buying. The strongest setups layer all three: buy or waterfall-enrich the well-covered bulk of your market, use a provider's API where one exists, and scrape only the gap nothing else fills — then unify it into one clean, scored dataset.

Picking the right tool for each slice of the problem, and owning the maintenance so it doesn't rot, is most of the work. It's also exactly what we do.

Common questions

Scraping publicly accessible data (no login) is generally not a CFAA violation under the Ninth Circuit's hiQ ruling, but data behind a password is risky, and contract, copyright, and GDPR/CCPA obligations still apply. It's situational — and this isn't legal advice.
Maintenance is the real cost — roughly 20–30% of the build per year as sites and anti-bot defenses change. Build in-house only if it's core and you'll staff it; otherwise managed scraping clears far higher success rates without the upkeep.
Buy the bulk that vendors already cover well — it's faster and cheaper. Scrape only the gap they don't: niche markets, unique fields, or freshness a static file can't match.

Book a free 30-min discovery session