Python Web Scraper ✅ Complete

I came into this with no Python background and a clear goal: build a scraper that could pull technical documents from hundreds of websites, organize them on disk, and never download the same file twice. This is the story of building it and what I learned along the way.

python scraper tools learning automation

Reads a structured JSON database and systematically fetches every URL in it
10-method waterfall fallback: requests → httpx → Playwright → Selenium → curl subprocess
SHA256 deduplication — hashes every file before saving, skips anything already seen
PDF handling: pdfminer.six, pymupdf, pytesseract OCR for scanned documents
BFS crawler with configurable depth — follows links without running forever
Proper class design: HashRegistry, ScrapeLog, UrlProcessor, ManufacturerCrawler
argparse CLI — target a single manufacturer or the entire database
Simultaneous terminal and file logging with timestamps
Runs on a Raspberry Pi without modification

A Python web scraper built from scratch as a real project — not a tutorial exercise

Build Story

1. Starting From Zero

I didn't learn Python by reading a book. I learned it by having a problem that needed solving. I needed a tool that could pull technical documents from hundreds of websites, save them in an organized way, and never download the same thing twice. A real tool, not a tutorial exercise.
The early wins were small. Getting a script to read a JSON file. Getting it to make an HTTP request and save the response to disk. Each one felt like a real step forward.

2. Files, Paths, and Being Defensive

Working with files and paths was the first real lesson. Python's pathlib module became essential. Learning the difference between Path.mkdir(parents=True, exist_ok=True) and just assuming a folder exists — then watching your script crash when it doesn't — teaches you to be defensive fast.
JSON was the backbone of the whole project. The database was JSON, the logs were JSON, the state tracking was JSON. Error handling with try/except stopped being something I added grudgingly and started being something I reached for automatically.

3. The 10-Method Waterfall

HTTP requests were where things got interesting. The web is not simple. Sites have bot detection. Some require JavaScript to render. Some block certain user-agent strings.
I built a 10-method waterfall fallback: plain requests first, then httpx for HTTP/2 support, then Playwright for JavaScript-heavy pages, then Selenium, then raw curl via subprocess as a last resort. If one method fails, try the next. Building that taught me Python's module ecosystem in a very practical way — what each tool actually does and when to reach for a heavier one.

4. SHA256 Deduplication

If you run the scraper twice, you don't want to download the same file twice. The solution was to hash every piece of content before saving it and keep a registry of hashes seen. If the hash already exists, skip it.
I'd never written a hash function in any language before. Learning that hashlib.sha256(content).hexdigest() gives you a reliable fingerprint for any piece of data — and that a dictionary makes a fast lookup table — felt like real computer science knowledge, not just Python trivia.

5. Classes, PDFs, and Finishing It

PDFs are a binary format that hides text in surprisingly difficult ways. I worked through pdfminer.six for text extraction, pymupdf for faster processing, and pytesseract for OCR on scanned documents. Checking raw_bytes[:4] == b"%PDF" to detect a PDF by magic bytes rather than file extension felt like a real programmer move.
Class design was the biggest conceptual jump. The scraper grew into HashRegistry, ScrapeLog, UrlProcessor, and ManufacturerCrawler. Learning when to use a class versus a function clicked not by reading about it but by needing it to keep the code manageable.
The scraper works. It handles HTML, PDFs, binary files, redirects, failures, retries, and restarts gracefully. It runs on a Raspberry Pi. None of that was obvious when I started.