Day 5: Why Scraping Career Pages Won't Get Me Sued (Probably)

Scraping Career Pages Scraping Career Pages

Scraping Career Pages

This is the summarization of my day 5 research. (I gave my rough research to AI for summarization.)

Good news: Scraping company career pages is generally lower risk than scraping job aggregators. Here’s why:

Public-facing data – Career pages are intentionally public
Legitimate purpose – Helping job seekers find opportunities is generally viewed favorably
No competitive harm – You’re not competing with the companies whose pages you’re scraping
Career pages often welcome traffic – Companies want people to see their job listings

Why career pages specifically?

Job aggregators like Indeed and LinkedIn actively prohibit scraping in their Terms of Service and have legal teams enforcing it—Huntkit would be a direct competitor.

Company career pages are different: there’s no TOS to violate, no authentication to bypass, and you’re actually helping them by driving qualified candidates their way.

The legal calculus is completely different.

robots.txt Guidance

You should definitely respect it. Here’s how:

Check before scraping each domain:

Identify yourself properly:
- Use a clear User-Agent: "HuntKit/1.0 (+https://yoursite.com/about; [email protected])"
- This shows you’re legitimate, not malicious
Honor Crawl-delay:
- If robots.txt says Crawl-delay: 10, wait 10 seconds between requests
- Even without it, add delays (2-5 seconds minimum)
What if you’re blocked?
- Some companies disallow all bots – respect that
- Focus on the companies that allow it (you’ll still have hundreds)
- Can always email companies directly: “We’re aggregating jobs to help candidates – can we scrape your career page?”

Practical Risk Mitigation

Do:

✅ Scrape during off-peak hours
✅ Cache results (don’t re-scrape unchanged pages)
✅ Have a clear privacy policy on your site
✅ Provide an opt-out mechanism for companies
✅ Link back to original job postings (drive traffic to them)

Don’t:

❌ Scrape behind login walls
❌ Republish entire job descriptions (use excerpts + links)
❌ Scrape at aggressive rates
❌ Ignore cease & desist requests

For Your 500-Company List

Build in compliance from day 1:

Check robots.txt for all 500 companies
Filter out those that explicitly disallow crawlers
Keep a “robots.txt respect” layer in your architecture
Log which companies you can/can’t scrape

Summing Up

Most companies with career pages won’t care if you:

Scrape respectfully (slow rate)
Link back to them
Help drive qualified candidates

The ones who do care will either:

Block you in robots.txt (easy to detect)
Send a cease & desist (rare for non-commercial student projects)

Company Search

And I could only do the 100-companies list today.

Will start building from tomorrow.

That’s it for today.

Scraping Career Pages Scraping Career Pages