Automation 2026-03-20

Paste an Amazon URL, Get a Homebox Item


python homebox scraping automation inventory

I have a Homebox instance running on my Unraid server. It tracks everything I own. Every router, every cable, every random USB hub I bought at 2 AM. The problem is actually putting things in it.

Adding a single item means: open the Amazon order, copy the product name, strip out all the SEO garbage, find the manufacturer, find the model number, download the product photo, search for the user manual PDF, then type it all into the Homebox web UI. For one item, that's five minutes. For the 30 things sitting in my "to inventory" pile, that's an evening I'll never get back.

So I built homebox-tools. Paste an Amazon URL. Get a fully populated inventory item with image, specs, price, and manuals attached. Done.

The Pipeline

Here's what happens when you run a single command:

Amazon URL Playwright scrape Name cleanup Manual search Homebox API Upload attachments
adding a TP-Link switch to inventory
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" \
    --location "Network Closet" --tags networking

Scraping Amazon...
Searching for product manuals...
Found 2 manual(s):
  - TL-SG108E Installation Guide
  - TL-SG108E User Manual
Upload these manuals? [Y/n]: y

Creating item: TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch
Uploading product image...
Uploading manual: TL-SG108E Installation Guide (842 KB)
Uploading manual: TL-SG108E User Manual (3201 KB)

Item created: http://192.168.50.224:3100/item/abc123-...

One command. The item shows up in Homebox with manufacturer, model number, purchase price, product photo, two PDF manuals, and specs as custom fields. That five-minute manual process is now about 15 seconds.

The Amazon Name Problem

If you've ever copied a product title from Amazon, you know the pain. Amazon sellers stuff every keyword they can think of into the title. A simple network switch becomes a 200-character monster.

Here's what the name cleaner does to real product titles:

Amazon title
[2024 Upgraded] TP-Link TL-SG108E 8 Port Gigabit Easy Smart Switch, Ethernet Managed Desktop Network Internet Splitter, QoS, VLAN, IGMP Snooping, Compatible With Alexa, Ideal for Home Office - Black
After cleanup
TP-Link TL-SG108E 8 Port Gigabit Easy Smart Switch
Amazon title
APC UPS Battery Backup and Surge Protector, 1500VA, APC Back-UPS Pro (BN1500M2), Perfect for Home Office and Electronics, 10 Outlets, 2 USB Charging Ports, Designed for Gaming PCs - Black
After cleanup
APC UPS Battery Backup and Surge Protector, 1500VA, APC Back-UPS Pro (BN1500M2)
Amazon title
Anker USB C Charger, 67W 3-Port PIQ 3.0 Compact & Foldable Fast Charger for MacBook Pro/Air, Galaxy S23/S22, Dell XPS 13, Note 20/10+, iPhone 15/14/Pro, iPad Pro, Pixel, and More
After cleanup
Anker USB C Charger, 67W 3-Port PIQ 3.0 Compact & Foldable Fast Charger

The cleaner strips bracket tags like [2024 Upgraded], cuts off "for iPhone/Samsung/MacBook" tails, removes trailing color variants like "- Black", drops parenthetical junk like "(Renewed)" and "(Frustration-Free Packaging)", and title-cases all-caps brand shouting. It's a series of regex passes, each one targeting a specific type of Amazon SEO garbage.

name_cleaner.py - SEO cutoff patterns
# Patterns that indicate SEO junk after them
SEO_CUTOFF_PATTERNS = [
    r"\bIdeal for\b",
    r"\bGreat for\b",
    r"\bPerfect for\b",
    r"\bDesigned for\b",
    r"\bCompatible [Ww]ith\b",
    r"\bWorks [Ww]ith\b",
    r"\bfor (?:iPhone|Samsung|Galaxy|iPad|MacBook|Laptop)\b",
    r"\bA Certified\b",
    r"\bLifetime (?:Internet |)Security\b",
]

# Trailing color removal
TRAILING_COLOR_RE = re.compile(
    r"\s*[-]\s*(?:Black|White|Silver|Gray|Grey|"
    r"Blue|Red|Green|Pink|Gold|Space Gray)\s*$",
    re.IGNORECASE,
)

Each pattern triggers a cutoff. If "Perfect for" appears in the title, everything from that point onward gets dropped. Same for "Compatible With", "Ideal for", and all the other filler phrases Amazon sellers love.

Scraping Without Getting Blocked

Amazon does not want you scraping their product pages. Fair enough. But I'm scraping my own order history, one product at a time, for personal inventory. So I built the scraper to act like a real person.

The scraper uses Playwright in headed mode (visible browser window, not headless) with the playwright-stealth library. It maintains a persistent browser session so you only log into Amazon once. Random delays between page loads. Real user agent strings. No parallel requests.

amazon_scraper.py - browser setup
# Launch a real browser with persistent login session
self._browser = await self._pw.chromium.launch_persistent_context(
    user_data_dir=self._session_dir,
    headless=False,  # visible window, not headless
    viewport={"width": 1280, "height": 900},
    user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
)

# Apply stealth patches to the browser context
stealth = Stealth()
await stealth.apply_stealth_async(self._browser)

# Random delays between navigations (2-5 seconds)
async def _random_delay(self, min_s=2.0, max_s=5.0):
    await asyncio.sleep(random.uniform(min_s, max_s))

First-time setup requires a one-time interactive login where you manually sign into Amazon in the browser window. After that, the session cookies persist on disk at ~/.config/homebox-tools/amazon-session/. The scraper detects expired sessions and CAPTCHA pages, telling you to re-login instead of silently failing.

What Gets Extracted

The scraper pulls everything useful from a product page:

Field Source
Product title #productTitle
Brand #bylineInfo (stripped of "Visit the" / "Store")
Manufacturer + Model Tech spec table rows
Price span.a-price > span.a-offscreen with fallbacks
Description Feature bullets, product facts, or A+ content
Product image High-res from data-old-hires attribute
Specs All rows from the technical details table

The price parser handles edge cases: comma-separated thousands, price ranges (takes the first price), "Currently unavailable" text, and "list:" prefixes. The image URL gets cleaned too, stripping Amazon's resize parameters to get the full-resolution version.

Finding Manuals Automatically

This is the part I'm most proud of. When you buy a network switch or a UPS, the manual PDF exists somewhere online. Finding it manually means Googling, navigating support pages, downloading. The manual finder does this automatically with a three-tier search.

Tier 0: Manufacturer Direct

For known brands, the tool goes straight to the source. It has dedicated scrapers for TP-Link, ASUS, Samsung, APC/Schneider Electric, and Anker. Each scraper knows where that brand hosts their PDFs.

Brand Strategy
TP-Link Scrape /support/download/{model}/, follow document redirects to static CDN
ASUS Parse Nuxt SSR payload from support page, prioritize English user manuals
Samsung Hit the public JSON API at /us/api/support/product/detail/{model}.json
APC Try predictable CDN URL pattern, then scrape se.com for SPD_ document refs
Anker Search service portal for article links, extract S3-hosted PDFs

When the manufacturer is unknown, it just tries all five scrapers. Each one returns an empty list on failure, so the cost of a miss is a single fast HTTP request.

manual_finder.py - Samsung API example
# Samsung exposes a public JSON API. No auth needed.
# Returns product metadata including download URLs.

resp = requests.get(
    f"https://www.samsung.com/us/api/support/product/"
    f"detail/{model}.json"
)
data = resp.json()

# Response structure:
# [{"downloads": {
#     "UserManual": {
#         "ENGLISH": [{"downloadUrl": "https://...pdf"}]
#     },
#     "QuickStartGuide": {...}
# }}]

# Priority: UserManual ENGLISH > QuickStartGuide > other
# Deduplicate by CttFileID to avoid downloading the same file twice

Tier 1: Internet Archive

The Internet Archive has a massive manuals collection. The tool searches it via the advanced search API with the model number and manufacturer name. No API key needed. Free, public, and surprisingly complete for older hardware.

Tier 2: DuckDuckGo Fallback

If the first two tiers come up empty, the tool searches DuckDuckGo's HTML endpoint (no API key, no rate limits worth worrying about). It runs three targeted queries:

  1. site:manualslib.com "{model}" user manual
  2. "{model}" filetype:pdf user manual
  3. site:{manufacturer-domain} "{model}" filetype:pdf

Every downloaded file gets validated. Magic bytes must start with %PDF-. Max 20MB per file, 50MB total. SHA-256 hashing catches duplicates. Limits are 5 manuals max per item, because nobody needs more than that.

The Homebox API Dance

Homebox's REST API has some quirks. Creating an item is a two-phase operation.

Phase 1: POST only accepts basic fields: name, description, location ID, and tags. Everything else gets silently ignored. So you create a bare-bones item first.

Phase 2: GET then PUT. You fetch the item back to get its current state, then PUT the full object with manufacturer, model number, purchase price, and specs added as custom fields. The PUT expects flat field references (locationId, not nested location.id). Custom fields are full-replacement, so you must include all existing fields or they get deleted.

two-phase item creation
# Phase 1: Create item with basic fields only
item_id = client.create_item(
    name="TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch",
    description="- 8 10/100/1000Mbps RJ45 Ports\n- QoS, VLAN...",
    location_id=location_id,
    tag_ids=["networking-tag-id"],
)

# Phase 2: Fetch current state, merge extended fields, PUT back
item_data = client.get_item(item_id)
update_data = {
    "id": item_data["id"],
    "name": item_data["name"],
    "manufacturer": "TP-Link",
    "modelNumber": "TL-SG108E",
    "purchasePrice": 29.99,
    "purchaseFrom": "Amazon",
    # ... flatten location/tags to IDs
    "locationId": item_data["location"]["id"],
    "tagIds": [t["id"] for t in item_data["tags"]],
    # ... preserve + append custom fields
}
client.update_item(item_id, update_data)

# Upload product photo as primary attachment
client.upload_attachment(item_id, photo_path, type="photo", primary=True)

# Upload each manual PDF
for manual in manuals:
    client.upload_attachment(item_id, manual.path, type="manual")
API gotcha:

Attachment uploads require a name form field or you get a 422. The Homebox docs don't mention this. I found it by staring at a 422 response body for way too long.

The API client handles token refresh automatically. Tokens expire after 7 days (or 28 days with stayLoggedIn). On a 401, it tries a token refresh first, then falls back to a full re-login. Retries on 429 and 5xx with exponential backoff: 1s, 2s, 4s.

Other Modes

Dry Run

The --dry-run flag scrapes and processes everything but stops before creating the Homebox item. Combined with --json, it gives you the structured output for scripting or just checking what the tool extracted.

dry run with JSON output
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" \
    --dry-run --json

{
  "name": "TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch",
  "manufacturer": "TP-Link",
  "model": "TL-SG108E",
  "price": 29.99,
  "description": "- 8 10/100/1000Mbps RJ45 Ports\n- QoS...",
  "image_path": "/tmp/homebox_B0BTMKSGSC.jpg",
  "specs": [
    {"name": "Item Weight", "value": "9.9 ounces", "type": "text"},
    {"name": "Connectivity", "value": "Ethernet", "type": "text"}
  ],
  "asin": "B0BTMKSGSC"
}

Folder Mode

Not everything comes from Amazon. For items I already have product files for, there's --folder mode. Point it at a directory with a product.json, some photos, and PDF manuals, and it skips the scraping step entirely.

folder mode
# Directory structure:
my-product/
  product.json    # optional, same schema as --json output
  photo.jpg       # first image found becomes the product photo
  user-manual.pdf # all PDFs become manual attachments

$ python -m homebox_tools --folder ./my-product/ --location "Office"
Creating item: My Product
Uploading product image...
Uploading manual: user-manual (1204 KB)
Item created: http://192.168.50.224:3100/item/def456-...

If there's no product.json, the tool infers the product name from the folder name (replacing underscores and hyphens with spaces), grabs the first image file, and collects all PDFs as manuals. Minimal but it works.

Field Overrides

Sometimes the scraper gets it mostly right but you want to fix one field. The --overrides flag takes a JSON string and patches the product data before creation.

overriding the scraped name
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC" \
    --overrides '{"name":"Office Network Switch","price":24.99}'

Duplicate Detection

Before creating anything, the tool searches existing Homebox items by model number (or name, if there's no model). If it finds potential matches, it warns you before proceeding. No accidental double-entries for the same router because you forgot you already added it last month.

duplicate warning
$ python -m homebox_tools "https://amazon.com/dp/B0BTMKSGSC"

Scraping Amazon...
Warning: possible duplicates found: TP-Link TL-SG108E Switch
Creating item: TP-Link TL-SG108E 8-Port Gigabit Easy Smart Switch

Architecture

The whole thing is about 900 lines of Python across six modules. No framework. No database. Just a CLI tool that calls some APIs.

Module Purpose
__main__.py CLI entry point, argparse, orchestration
amazon_scraper.py Playwright browser automation for Amazon
name_cleaner.py Regex-based title cleanup (no dependencies)
manual_finder.py Three-tier PDF discovery across multiple sources
homebox_client.py REST API client with retry logic and token management
models.py ProductData, ManualInfo, SpecField dataclasses

Dependencies are minimal: Playwright and playwright-stealth for the browser automation, requests for HTTP calls, PyYAML for config. That's it. Config lives at ~/.config/homebox-tools/config.yaml with env var overrides for CI or Docker usage.

Total time to add a new item:

About 15 seconds. Paste URL, confirm location, confirm manuals, done. The scraper takes a few seconds to load Amazon, name cleanup and manual search run in parallel with the page load, and the Homebox API calls are instant on a local network.

What I'd Build Next

The tool handles my core use case (Amazon orders going into Homebox) well enough that I actually use it. A few things I'd add if I had the time:

But for now, it solves the problem it was built to solve. I no longer have a pile of things waiting to be inventoried. And when I buy something new, adding it to Homebox takes less time than opening the shipping box.

The code is on GitHub. MIT licensed. Python 3.10+, make setup, and you're running.