twin-nvenc: Two NVENC Chips, Two Encodes, One GPU
I had about 400 GB of OBS screen recordings sitting on a drive. Raw captures, game footage, tutorial recordings. All of them enormous, none of them compressed. I'd been meaning to batch-encode them to AV1 for months. So I wrote a script.
Then I learned something about the RTX 4090 that changed the whole approach.
The Hidden Second Encoder
Most NVIDIA GPUs have one NVENC chip. The RTX 4090 (and 4080) have two. Two independent hardware encoders on the same card. NVIDIA doesn't advertise this much. The driver exposes it, but ffmpeg doesn't know how to use both at once.
If you run a single ffmpeg encode, even on a 4090, only one NVENC chip is active. The other sits idle. You're leaving half your encoding hardware on the table.
There is no tool that handles this. HandBrake uses one NVENC chip. FFmpeg uses one NVENC chip. Every batch encoding script on GitHub launches sequential encodes on one chip. Nobody has built a tool that specifically targets dual NVENC and schedules work across both chips with zero idle gaps. So I built one.
The fix is dead simple: run two ffmpeg processes at the same time. The driver routes them to separate chips automatically. No device selection, no API flags, no CUDA context juggling. Just two processes.
That's what twin-nvenc does. It runs N encodes in parallel using an asyncio semaphore. On a 4090, -j 2 saturates both chips. The moment one encode finishes, the next file starts on the freed chip. No gaps, no wasted time.
The Architecture
It started as a bash script. 340 lines. It worked, but the parallel logic was ugly: launch a batch, wait for all of them, launch the next batch. If one file in a batch was huge and another was tiny, the tiny one would finish and the chip would sit idle until the big one caught up.
The Python rewrite uses asyncio.Semaphore to fix this. All files are launched as coroutines immediately. The semaphore limits concurrency to N (your NVENC chip count). The next file starts the instant a chip frees up. No batching, no idle gaps.
# The core scheduling logic. That's it.
semaphore = asyncio.Semaphore(config.parallel)
# Each file gets its own coroutine
async def _encode_one(input_path, output_path, size, duration):
async with semaphore: # blocks until a chip is free
result = await encode_file(input_path, output_path, ...)
return result
# Launch all at once, semaphore handles the rest
tasks = [_encode_one(inp, out, sz, dur) for inp, out, sz, dur in files]
results = await asyncio.gather(*tasks)
Five lines of real logic. The semaphore is doing all the work. I spent more time tuning the ffmpeg flags than writing the scheduler.
The ffmpeg Flags
Getting the right NVENC flags took some research. The defaults ffmpeg uses are not great. Here's what twin-nvenc passes to every encode:
-hwaccel cuda
-hwaccel_output_format cuda # decode stays on GPU, no PCIe round-trips
-c:v av1_nvenc # AV1 hardware encoder (40-series+)
-rc vbr # variable bitrate with constant quality
-cq 28 # quality target (lower = bigger/better)
-b:v 0 # no bitrate cap, let CQ decide
-multipass fullres # two-pass for better bit allocation
-rc-lookahead 32 # 32-frame lookahead for scene changes
-spatial-aq 1 # adaptive quantization (spatial)
-temporal-aq 1 # adaptive quantization (temporal)
-bf 3 # 3 B-frames with adaptive placement
-b_adapt 1
The important one is -hwaccel cuda -hwaccel_output_format cuda. Without this, ffmpeg decodes on the CPU, transfers the frame to GPU for encoding, then pulls the result back. With it, both decode and encode happen on the GPU. No PCIe bus traffic for raw frames.
Also: -rc vbr -cq N -b:v 0 instead of -rc constqp -qp N. VBR with CQ adapts the bitrate to scene complexity. Static screens get tiny bitrates. Fast action gets more bits. constqp doesn't adapt. The original bash script used constqp. Switching to VBR+CQ saved about 15-20% more space at equivalent visual quality.
Progress Parsing
ffmpeg has a hidden flag that most people don't know about: -progress pipe:1. Instead of the usual dancing cursor on stderr, it writes structured key-value pairs to stdout:
frame=150
fps=45.2
out_time_us=5000000
speed=2.5x
progress=continue
Each block ends with progress=continue (or progress=end). You parse the key-value pairs, divide out_time_us by the total duration, and you've got a percentage. Divide remaining time by the speed multiplier and you've got an ETA. Simple, reliable, no regex against stderr.
This is what powers the TUI. Both NVENC slots update their progress bars independently, in real time.
The TUI
I built a terminal dashboard using Textual. It shows each NVENC chip as a separate progress bar with speed and ETA. Below that, a completed files list color-coded by compression ratio. Green means good compression. Yellow means marginal. Red means the output was bigger than the input (which gets auto-deleted).
73.5% 4.39x ETA 0:07
27.3% 4.50x ETA 0:55
There's also a --demo flag that runs simulated encodes so you can preview the TUI without actually encoding anything. Useful for screenshots and testing the layout.
Smart Defaults
A few decisions that saved me headaches:
- Skip-if-bigger: Every encode writes to a
.tmp.mp4file. After encoding, twin-nvenc compares sizes. If the output is larger than the input, it deletes the output and moves on. Some already-compressed files just can't be shrunk further. No point keeping a bigger copy. - Atomic rename: The temp file only gets renamed to the final name after a successful encode + size check. If ffmpeg crashes mid-encode or you Ctrl+C, there's no half-written output pretending to be complete.
- Resume-safe: Before encoding a file, it checks if
compressed/filename.mp4already exists. If it does, skip it. You can kill the process and restart. It picks up where it left off. - Auto-detect ffmpeg: On Windows, ffmpeg is rarely in PATH. The tool checks common install locations like ShareX's bundled copy, Chocolatey, and
C:\ffmpeg.
Profiles
I got tired of remembering flag combos, so I added TOML config with named profiles. Run twin-nvenc --init-config and it creates ~/.config/twin-nvenc/config.toml:
[defaults]
codec = "av1_nvenc"
preset = "p4"
quality = 28
parallel = 2
[presets.screen]
quality = 26 # sharp text, lower CQ preserves detail
[presets.gaming]
quality = 32 # fast action barely compresses anyway
[presets.archival]
preset = "p7"
quality = 24 # maximum quality, slow
Then: twin-nvenc -P screen "F:/OBS Captures". CLI flags always override profile values, so you can do -P archival -q 20 to tweak on the fly.
The layering is: dataclass defaults, then [defaults] from TOML, then [presets.name], then CLI flags. Each layer only overrides what it sets.
NVENC Chip Count by GPU
| GPU | NVENC Chips | -j flag |
|---|---|---|
| RTX 4090 | 2 | 2 |
| RTX 4080 | 2 | 2 |
| RTX 4070 and below | 1 | 1 |
| RTX 30-series | 1 | 1 |
| RTX 20-series | 1 | 1 |
You can set -j 1 for single-chip GPUs. The tool still works fine. You just don't get the parallel speedup.
Quality Guide
The -q flag controls the constant quality value. Lower numbers mean higher quality and bigger files. Here's how I think about it:
| CQ Range | Use |
|---|---|
| 20 - 24 | High quality, moderate compression. Good for archival if you care about every pixel. |
| 25 - 28 | Balanced. My default. Good for screen recordings with text. |
| 29 - 32 | Aggressive. Fine for game footage and casual recordings. |
| 33 - 38 | Maximum compression. Artifacts visible on close inspection. Fine for throwaway clips. |
The Evolution
The project went through a few phases:
- Bash script (340 lines) - basic batching. Launch N encodes,
waitfor all, launch next N. Idle gaps when files had different durations. Usedconstqprate control. - Python rewrite - asyncio semaphore for gapless scheduling. VBR+CQ for better compression. Progress parsing via
-progress pipe:1. Rich colored output. - TUI dashboard - Textual app showing both NVENC chips in real time. Per-file progress bars, speed, ETA, running totals.
- Config profiles - TOML config with named presets. No more remembering flag combos.
Both the bash script and the Python tool are in the repo. The Python version is the one I use now.
What I Learned
The RTX 4090's second NVENC chip is completely invisible unless you know to look for it. NVIDIA's docs mention it in one table. No driver settings. No GPU-Z indicator. You just run two encodes and the throughput doubles.
The other surprise was how much the ffmpeg flags matter. Switching from constqp to vbr -cq and adding spatial/temporal AQ gave noticeably better quality at the same file size. Adding -hwaccel cuda -hwaccel_output_format cuda eliminated CPU decode bottlenecks entirely. The GPU does everything.
And asyncio.Semaphore is an underrated primitive. Five lines replaced a whole batch-and-wait loop with something that never wastes a cycle.
I looked for existing tools before building this. HandBrake, Tdarr, Unmanic, StaxRip - none of them schedule across dual NVENC chips. They all treat the GPU as a single encoder. The 4090 and 4080 owners who figure out the dual chip trick end up writing one-off bash scripts. twin-nvenc is the only dedicated tool that does this properly: gapless scheduling, resume support, auto-cleanup, researched encoder flags, and a TUI that shows both chips working in real time.
Source: github.com/andrewle8/twin-nvenc
Python 3.11+, an NVENC-capable GPU, and ffmpeg. pip install -e . and you're encoding.