Scraping Career Pages Scraping Career Pages
Scraping Career Pages
This is the summarization of my day 5 research. (I gave my rough research to AI for summarization.)
Contents
Good news: Scraping company career pages is generally lower risk than scraping job aggregators. Here’s why:
- Public-facing data – Career pages are intentionally public
- Legitimate purpose – Helping job seekers find opportunities is generally viewed favorably
- No competitive harm – You’re not competing with the companies whose pages you’re scraping
- Career pages often welcome traffic – Companies want people to see their job listings
Why career pages specifically?
Job aggregators like Indeed and LinkedIn actively prohibit scraping in their Terms of Service and have legal teams enforcing it—Huntkit would be a direct competitor.
Company career pages are different: there’s no TOS to violate, no authentication to bypass, and you’re actually helping them by driving qualified candidates their way.
The legal calculus is completely different.
robots.txt Guidance
You should definitely respect it. Here’s how:
- Check before scraping each domain:
- Identify yourself properly:
- Use a clear User-Agent:
"HuntKit/1.0 (+https://yoursite.com/about; [email protected])" - This shows you’re legitimate, not malicious
- Use a clear User-Agent:
- Honor Crawl-delay:
- If robots.txt says
Crawl-delay: 10, wait 10 seconds between requests - Even without it, add delays (2-5 seconds minimum)
- If robots.txt says
- What if you’re blocked?
- Some companies disallow all bots – respect that
- Focus on the companies that allow it (you’ll still have hundreds)
- Can always email companies directly: “We’re aggregating jobs to help candidates – can we scrape your career page?”
Practical Risk Mitigation
Do:
- ✅ Scrape during off-peak hours
- ✅ Cache results (don’t re-scrape unchanged pages)
- ✅ Have a clear privacy policy on your site
- ✅ Provide an opt-out mechanism for companies
- ✅ Link back to original job postings (drive traffic to them)
Don’t:
- ❌ Scrape behind login walls
- ❌ Republish entire job descriptions (use excerpts + links)
- ❌ Scrape at aggressive rates
- ❌ Ignore cease & desist requests
For Your 500-Company List
Build in compliance from day 1:
- Check robots.txt for all 500 companies
- Filter out those that explicitly disallow crawlers
- Keep a “robots.txt respect” layer in your architecture
- Log which companies you can/can’t scrape
Summing Up
Most companies with career pages won’t care if you:
- Scrape respectfully (slow rate)
- Link back to them
- Help drive qualified candidates
The ones who do care will either:
- Block you in robots.txt (easy to detect)
- Send a cease & desist (rare for non-commercial student projects)
Company Search
And I could only do the 100-companies list today.
Will start building from tomorrow.
That’s it for today.
Scraping Career Pages Scraping Career Pages