Build Story
1. Starting From Zero
I didn't learn Python by reading a book. I learned it by having a problem that needed solving. I needed a tool that could pull technical documents from hundreds of websites, save them in an organized way, and never download the same thing twice. A real tool, not a tutorial exercise.
The early wins were small. Getting a script to read a JSON file. Getting it to make an HTTP request and save the response to disk. Each one felt like a real step forward.
2. Files, Paths, and Being Defensive
Working with files and paths was the first real lesson. Python's pathlib module became essential. Learning the difference between Path.mkdir(parents=True, exist_ok=True) and just assuming a folder exists — then watching your script crash when it doesn't — teaches you to be defensive fast.
JSON was the backbone of the whole project. The database was JSON, the logs were JSON, the state tracking was JSON. Error handling with try/except stopped being something I added grudgingly and started being something I reached for automatically.
3. The 10-Method Waterfall
HTTP requests were where things got interesting. The web is not simple. Sites have bot detection. Some require JavaScript to render. Some block certain user-agent strings.
I built a 10-method waterfall fallback: plain requests first, then httpx for HTTP/2 support, then Playwright for JavaScript-heavy pages, then Selenium, then raw curl via subprocess as a last resort. If one method fails, try the next. Building that taught me Python's module ecosystem in a very practical way — what each tool actually does and when to reach for a heavier one.
4. SHA256 Deduplication
If you run the scraper twice, you don't want to download the same file twice. The solution was to hash every piece of content before saving it and keep a registry of hashes seen. If the hash already exists, skip it.
I'd never written a hash function in any language before. Learning that hashlib.sha256(content).hexdigest() gives you a reliable fingerprint for any piece of data — and that a dictionary makes a fast lookup table — felt like real computer science knowledge, not just Python trivia.
5. Classes, PDFs, and Finishing It
PDFs are a binary format that hides text in surprisingly difficult ways. I worked through pdfminer.six for text extraction, pymupdf for faster processing, and pytesseract for OCR on scanned documents. Checking raw_bytes[:4] == b"%PDF" to detect a PDF by magic bytes rather than file extension felt like a real programmer move.
Class design was the biggest conceptual jump. The scraper grew into HashRegistry, ScrapeLog, UrlProcessor, and ManufacturerCrawler. Learning when to use a class versus a function clicked not by reading about it but by needing it to keep the code manageable.
The scraper works. It handles HTML, PDFs, binary files, redirects, failures, retries, and restarts gracefully. It runs on a Raspberry Pi. None of that was obvious when I started.