Day 9: Scrapped 35 Companies (Brute Force) Tested Ok

Scrapped

After yesterday, my today’s goal was very simple. I wanted to create a simple working scraper that gets the data from web pages.

Today’s target: From Sitemaps, get career and about pages (links) and store them in the DB.

That’s it.

Instead of struggling with top companies whose details are available everywhere. I thought of starting out with startups on my list.

So, I got 35 from my list of companies and easily got their sitemaps.

Scrapped

The Strategy

  1. CSV Parsing: Some (3-4) companies didn’t have sitemaps so I had to have direct links for them.
  2. Sitemap Filtering: Instead of saving all URLs, I need to store only the career and about pages URLs. (We would need about page when writing emails)
  3. MongoDB Integration: Store data in this structure inside mongoDB:
    • “Company Name”: { “URL”: site_url, “Sitemap URL”: sitemap_url, “Career Pages”: [career_pages], “About Page”: about_page }

I later added docstrings and type hints with the help of AI.

The code is updated on GitHub: https://github.com/maitry4/HuntKit

Today was easier. I learnt works first scales next builds scalable systems faster.

A blurry sneak peek of my database:

Day 9: Scrapped 35 Companies (Brute Force) Tested Ok

Leave a Reply