Architecture
Today wasn’t a great day. I went through error logs and more error logs…
Contents
A Peak To My Error Logs
🔍 Processing Google...
🔍 Processing Apple...
🔍 Processing Amazon...
Request for URL https://www.google.com/slides/sitemaps.xml failed: 404 Not Found
Request for URL https://www.google.com/sheets/sitemaps.xml failed: 404 Not Found
Parsing sitemap from URL https://www.google.com/travel/flights/unsupported?ucpp=CjFodHRwczovL3d3dy5nb29nbGUuY29tL3RyYXZlbC9mbGlnaHRzL3NpdGVtYXAueG1s failed: syntax error: line 1, column 0
Parsing sitemap from URL https://business.google.com/in/business-profile/ failed: syntax error: line 1, column 0
Request for URL https://www.google.com/search/about/sitemap.xml failed: 404 Not Found
Request for URL https://www.google.com/calendar/about/sitemap.xml failed: 404 Not Found
E:\HuntKit\get_career_pages.py:49: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
"last_updated": datetime.utcnow().isoformat()
⚠️ Amazon: No matches in sitemap
🔍 Processing Microsoft...
E:\HuntKit\get_career_pages.py:37: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
"last_updated": datetime.utcnow().isoformat(),
✅ Google: Found 3 pages
🔍 Processing Meta...
Unable to gunzip response for https://www.metacareers.com/sitemap/www_metacareers_com_sitemap.xml.gz, maybe it's a non-gzipped sitemap: Unable to gunzip data: Not a gzipped file (b'<!')
Parsing sitemap from URL https://www.metacareers.com/sitemap/www_metacareers_com_sitemap.xml.gz failed: Unsupported root element 'html'.
Request for URL https://www.metacareers.com/sitemap_index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/.sitemap.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/.sitemap.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/.sitemap.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/.sitemap.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/.sitemap.xml failed: 500 Internal Server Error
Parsing sitemap from URL https://www.metacareers.com/sitemap/sitemap-index.xml failed: Unsupported root element 'html'.
Request for URL https://www.metacareers.com/sitemap-index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-news.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap-index.xml failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml.gz failed: 500 Internal Server Error
Request for URL https://www.metacareers.com/sitemap_news.xml.gz failed: 500 Internal Server Error
The Messy Middle: Why I’m Trading “Magic” for Bare-Metal Architecture
Consistency is easy when things are simple. After 7 days of smooth sailing with HuntKit, I hit a wall. Two days of silence followed. But today, I’m back with a new philosophy: Pure focus over fixed targets.
As I started scaling the scraper to handle the “Big Tech” giants (Google, Meta, Amazon), the “messy” reality of the web hit me. I realized I was falling into the Prompting Trap—copy-pasting AI code to handle sitemap errors without actually understanding the underlying system.
Today was about taking the reins back.
The Problem: When Sitemaps Bite Back
I thought sitemaps were simple XML files. I was wrong. When you try to scrape 250 enterprise sites, you aren’t just fetching files; you are navigating a minefield:
- Bot Protection: Meta and Amazon don’t like scripts. They serve 500 errors or “Are you a human?” HTML when they detect a bare request.
- Namespace Nightmares: Standard XML parsers fail because every big tech site uses different XML namespaces.
- The “Magic” Library Curse: Using high-level libraries made me feel productive, but when they failed, I was helpless.
The Shift: Building a “Bare-Metal” Discovery Engine
Instead of asking an LLM to “fix my script,” I sat down to architect the logic myself. I moved from being a “user” to an “architect” by:
- Handling Identity: Implementing custom User-Agent headers to look like a real browser.
- Namespace Mapping: Writing my own XPath logic to handle the different ways sites define their sitemap tags.
- Async Concurrency: Using Semaphores to ensure I don’t get IP-banned while trying to scan dozens of sites at once.
The Lesson: The 1-Hour Rule
My morning update on X said it best: No more targets, just 1 hour of pure focus.
When you focus on the time and the depth of the work rather than just “shipping a feature,” the quality of your system design skyrockets.
I’m no longer just building a scraper; I’m building a robust data pipeline that I actually understand.
Next Step: Handling the massive “Sitemap Indexes” of companies like Google without blowing up my memory.