I have a Homebox instance running on my Unraid server. It tracks everything I own. Every router, every cable, every random USB hub I bought at 2 AM. The problem is actually putting things in it.
Adding a single item means: open the Amazon order, copy the product name, strip out all the SEO garbage, find the manufacturer, find the model number, download the product photo, search for the user manual PDF, then type it all into the Homebox web UI. For one item, that's five minutes. For the 30 things sitting in my "to inventory" pile, that's an evening I'll never get back.
So I built homebox-tools. Paste an Amazon URL. Get a fully populated inventory item with image, specs, price, and manuals attached. Done.
The Pipeline
Here's what happens when you run a single command:
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" \ --location "Network Closet" --tags networking Scraping Amazon... Searching for product manuals... Found 2 manual(s): - TL-SG108E Installation Guide - TL-SG108E User Manual Upload these manuals? [Y/n]: y Creating item: TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch Uploading product image... Uploading manual: TL-SG108E Installation Guide (842 KB) Uploading manual: TL-SG108E User Manual (3201 KB) Item created: http://192.168.50.224:3100/item/abc123-...
One command. The item shows up in Homebox with manufacturer, model number, purchase price, product photo, two PDF manuals, and specs as custom fields. That five-minute manual process is now about 15 seconds.
The Amazon Name Problem
If you've ever copied a product title from Amazon, you know the pain. Amazon sellers stuff every keyword they can think of into the title. A simple network switch becomes a 200-character monster.
Here's what the name cleaner does to real product titles:
[2024 Upgraded] TP-Link TL-SG108E 8 Port Gigabit Easy Smart Switch, Ethernet Managed Desktop Network Internet Splitter, QoS, VLAN, IGMP Snooping, Compatible With Alexa, Ideal for Home Office - Black
TP-Link TL-SG108E 8 Port Gigabit Easy Smart Switch
APC UPS Battery Backup and Surge Protector, 1500VA, APC Back-UPS Pro (BN1500M2), Perfect for Home Office and Electronics, 10 Outlets, 2 USB Charging Ports, Designed for Gaming PCs - Black
APC UPS Battery Backup and Surge Protector, 1500VA, APC Back-UPS Pro (BN1500M2)
Anker USB C Charger, 67W 3-Port PIQ 3.0 Compact & Foldable Fast Charger for MacBook Pro/Air, Galaxy S23/S22, Dell XPS 13, Note 20/10+, iPhone 15/14/Pro, iPad Pro, Pixel, and More
Anker USB C Charger, 67W 3-Port PIQ 3.0 Compact & Foldable Fast Charger
The cleaner strips bracket tags like [2024 Upgraded], cuts off "for iPhone/Samsung/MacBook" tails, removes trailing color variants like "- Black", drops parenthetical junk like "(Renewed)" and "(Frustration-Free Packaging)", and title-cases all-caps brand shouting. It's a series of regex passes, each one targeting a specific type of Amazon SEO garbage.
# Patterns that indicate SEO junk after them SEO_CUTOFF_PATTERNS = [ r"\bIdeal for\b", r"\bGreat for\b", r"\bPerfect for\b", r"\bDesigned for\b", r"\bCompatible [Ww]ith\b", r"\bWorks [Ww]ith\b", r"\bfor (?:iPhone|Samsung|Galaxy|iPad|MacBook|Laptop)\b", r"\bA Certified\b", r"\bLifetime (?:Internet |)Security\b", ] # Trailing color removal TRAILING_COLOR_RE = re.compile( r"\s*[-]\s*(?:Black|White|Silver|Gray|Grey|" r"Blue|Red|Green|Pink|Gold|Space Gray)\s*$", re.IGNORECASE, )
Each pattern triggers a cutoff. If "Perfect for" appears in the title, everything from that point onward gets dropped. Same for "Compatible With", "Ideal for", and all the other filler phrases Amazon sellers love.
Scraping Without Getting Blocked
Amazon does not want you scraping their product pages. Fair enough. But I'm scraping my own order history, one product at a time, for personal inventory. So I built the scraper to act like a real person.
The scraper uses Playwright in headed mode (visible browser window, not headless) with the playwright-stealth library. It maintains a persistent browser session so you only log into Amazon once. Random delays between page loads. Real user agent strings. No parallel requests.
# Launch a real browser with persistent login session self._browser = await self._pw.chromium.launch_persistent_context( user_data_dir=self._session_dir, headless=False, # visible window, not headless viewport={"width": 1280, "height": 900}, user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...", ) # Apply stealth patches to the browser context stealth = Stealth() await stealth.apply_stealth_async(self._browser) # Random delays between navigations (2-5 seconds) async def _random_delay(self, min_s=2.0, max_s=5.0): await asyncio.sleep(random.uniform(min_s, max_s))
First-time setup requires a one-time interactive login where you manually sign into Amazon in the browser window. After that, the session cookies persist on disk at ~/.config/homebox-tools/amazon-session/. The scraper detects expired sessions and CAPTCHA pages, telling you to re-login instead of silently failing.
What Gets Extracted
The scraper pulls everything useful from a product page:
| Field | Source |
|---|---|
| Product title | #productTitle |
| Brand | #bylineInfo (stripped of "Visit the" / "Store") |
| Manufacturer + Model | Tech spec table rows |
| Price | span.a-price > span.a-offscreen with fallbacks |
| Description | Feature bullets, product facts, or A+ content |
| Product image | High-res from data-old-hires attribute |
| Specs | All rows from the technical details table |
The price parser handles edge cases: comma-separated thousands, price ranges (takes the first price), "Currently unavailable" text, and "list:" prefixes. The image URL gets cleaned too, stripping Amazon's resize parameters to get the full-resolution version.
Finding Manuals Automatically
This is the part I'm most proud of. When you buy a network switch or a UPS, the manual PDF exists somewhere online. Finding it manually means Googling, navigating support pages, downloading. The manual finder does this automatically with a three-tier search.
Tier 0: Manufacturer Direct
For known brands, the tool goes straight to the source. It has dedicated scrapers for TP-Link, ASUS, Samsung, APC/Schneider Electric, and Anker. Each scraper knows where that brand hosts their PDFs.
| Brand | Strategy |
|---|---|
| TP-Link | Scrape /support/download/{model}/, follow document redirects to static CDN |
| ASUS | Parse Nuxt SSR payload from support page, prioritize English user manuals |
| Samsung | Hit the public JSON API at /us/api/support/product/detail/{model}.json |
| APC | Try predictable CDN URL pattern, then scrape se.com for SPD_ document refs |
| Anker | Search service portal for article links, extract S3-hosted PDFs |
When the manufacturer is unknown, it just tries all five scrapers. Each one returns an empty list on failure, so the cost of a miss is a single fast HTTP request.
# Samsung exposes a public JSON API. No auth needed. # Returns product metadata including download URLs. resp = requests.get( f"https://www.samsung.com/us/api/support/product/" f"detail/{model}.json" ) data = resp.json() # Response structure: # [{"downloads": { # "UserManual": { # "ENGLISH": [{"downloadUrl": "https://...pdf"}] # }, # "QuickStartGuide": {...} # }}] # Priority: UserManual ENGLISH > QuickStartGuide > other # Deduplicate by CttFileID to avoid downloading the same file twice
Tier 1: Internet Archive
The Internet Archive has a massive manuals collection. The tool searches it via the advanced search API with the model number and manufacturer name. No API key needed. Free, public, and surprisingly complete for older hardware.
Tier 2: DuckDuckGo Fallback
If the first two tiers come up empty, the tool searches DuckDuckGo's HTML endpoint (no API key, no rate limits worth worrying about). It runs three targeted queries:
site:manualslib.com "{model}" user manual"{model}" filetype:pdf user manualsite:{manufacturer-domain} "{model}" filetype:pdf
Every downloaded file gets validated. Magic bytes must start with %PDF-. Max 20MB per file, 50MB total. SHA-256 hashing catches duplicates. Limits are 5 manuals max per item, because nobody needs more than that.
The Homebox API Dance
Homebox's REST API has some quirks. Creating an item is a two-phase operation.
Phase 1: POST only accepts basic fields: name, description, location ID, and tags. Everything else gets silently ignored. So you create a bare-bones item first.
Phase 2: GET then PUT. You fetch the item back to get its current state, then PUT the full object with manufacturer, model number, purchase price, and specs added as custom fields. The PUT expects flat field references (locationId, not nested location.id). Custom fields are full-replacement, so you must include all existing fields or they get deleted.
# Phase 1: Create item with basic fields only item_id = client.create_item( name="TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch", description="- 8 10/100/1000Mbps RJ45 Ports\n- QoS, VLAN...", location_id=location_id, tag_ids=["networking-tag-id"], ) # Phase 2: Fetch current state, merge extended fields, PUT back item_data = client.get_item(item_id) update_data = { "id": item_data["id"], "name": item_data["name"], "manufacturer": "TP-Link", "modelNumber": "TL-SG108E", "purchasePrice": 29.99, "purchaseFrom": "Amazon", # ... flatten location/tags to IDs "locationId": item_data["location"]["id"], "tagIds": [t["id"] for t in item_data["tags"]], # ... preserve + append custom fields } client.update_item(item_id, update_data) # Upload product photo as primary attachment client.upload_attachment(item_id, photo_path, type="photo", primary=True) # Upload each manual PDF for manual in manuals: client.upload_attachment(item_id, manual.path, type="manual")
Attachment uploads require a name form field or you get a 422. The Homebox docs don't mention this. I found it by staring at a 422 response body for way too long.
The API client handles token refresh automatically. Tokens expire after 7 days (or 28 days with stayLoggedIn). On a 401, it tries a token refresh first, then falls back to a full re-login. Retries on 429 and 5xx with exponential backoff: 1s, 2s, 4s.
Other Modes
Dry Run
The --dry-run flag scrapes and processes everything but stops before creating the Homebox item. Combined with --json, it gives you the structured output for scripting or just checking what the tool extracted.
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" \ --dry-run --json { "name": "TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch", "manufacturer": "TP-Link", "model": "TL-SG108E", "price": 29.99, "description": "- 8 10/100/1000Mbps RJ45 Ports\n- QoS...", "image_path": "/tmp/homebox_B0BTMKSGSC.jpg", "specs": [ {"name": "Item Weight", "value": "9.9 ounces", "type": "text"}, {"name": "Connectivity", "value": "Ethernet", "type": "text"} ], "asin": "B0BTMKSGSC" }
Folder Mode
Not everything comes from Amazon. For items I already have product files for, there's --folder mode. Point it at a directory with a product.json, some photos, and PDF manuals, and it skips the scraping step entirely.
# Directory structure: my-product/ product.json # optional, same schema as --json output photo.jpg # first image found becomes the product photo user-manual.pdf # all PDFs become manual attachments $ python -m homebox_tools --folder ./my-product/ --location "Office" Creating item: My Product Uploading product image... Uploading manual: user-manual (1204 KB) Item created: http://192.168.50.224:3100/item/def456-...
If there's no product.json, the tool infers the product name from the folder name (replacing underscores and hyphens with spaces), grabs the first image file, and collects all PDFs as manuals. Minimal but it works.
Field Overrides
Sometimes the scraper gets it mostly right but you want to fix one field. The --overrides flag takes a JSON string and patches the product data before creation.
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" \ --overrides '{"name":"Office Network Switch","price":24.99}'
Duplicate Detection
Before creating anything, the tool searches existing Homebox items by model number (or name, if there's no model). If it finds potential matches, it warns you before proceeding. No accidental double-entries for the same router because you forgot you already added it last month.
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" Scraping Amazon... Warning: possible duplicates found: TP-Link TL-SG108E Switch Creating item: TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch
Architecture
The whole thing is about 900 lines of Python across six modules. No framework. No database. Just a CLI tool that calls some APIs.
| Module | Purpose |
|---|---|
__main__.py |
CLI entry point, argparse, orchestration |
amazon_scraper.py |
Playwright browser automation for Amazon |
name_cleaner.py |
Regex-based title cleanup (no dependencies) |
manual_finder.py |
Three-tier PDF discovery across multiple sources |
homebox_client.py |
REST API client with retry logic and token management |
models.py |
ProductData, ManualInfo, SpecField dataclasses |
Dependencies are minimal: Playwright and playwright-stealth for the browser automation, requests for HTTP calls, PyYAML for config. That's it. Config lives at ~/.config/homebox-tools/config.yaml with env var overrides for CI or Docker usage.
About 15 seconds. Paste URL, confirm location, confirm manuals, done. The scraper takes a few seconds to load Amazon, name cleanup and manual search run in parallel with the page load, and the Homebox API calls are instant on a local network.
What I'd Build Next
The tool handles my core use case (Amazon orders going into Homebox) well enough that I actually use it. A few things I'd add if I had the time:
- Batch mode: Feed it a list of Amazon URLs from an order export CSV and let it rip through them all
- More scraper sources: Best Buy, Newegg, and B&H Photo all have product pages worth scraping
- Receipt attachment: Pull the order receipt from Amazon and attach it alongside the manuals
- Barcode/UPC lookup: Scan a barcode, hit a UPC database, skip the URL entirely
But for now, it solves the problem it was built to solve. I no longer have a pile of things waiting to be inventoried. And when I buy something new, adding it to Homebox takes less time than opening the shipping box.
The code is on GitHub. MIT licensed. Python 3.10+, make setup, and you're running.