Claude Video Editing Just Became Unrecognizable

ELI5/TLDR

A YouTuber drops a 50-second raw recording into Claude Code and gets back a 27-second edited video with motion graphics, subtitles, and a final “thanks for watching” card — all from natural-language prompts. The stack is Claude Code as orchestrator, a tool called video-use for trimming filler words, and hyperframes for HTML-based motion graphics. He still has to be very specific in the prompts, iterate two or three rounds, and burn ~238k tokens for one short video. The pitch: build up a folder of your own approved edits, then future videos converge toward “drop file, get edit.”

The Full Story

The pipeline before vs after

The old loop: record raw, hand-trim filler words in Premiere, hand-place animations, render. Four manual steps. Hyperframes (released a few days before this video) collapsed the animation step into a Claude tool call. The new tool, video-use, collapses the trimming step too. So now the manual surface is just recording the raw clip — everything downstream is a Claude Code conversation.

“Right now what we’re doing is we’re dropping in a raw file and then it’s basically being trimmed, edited, animated, and rendered for us.”

He notes you could remove the recording step too with a HeyGen avatar reading a script, but he wants to keep his face in the videos, so he doesn’t.

The actual stack

Claude Code (desktop app) — the orchestrator. He uses the desktop app over VS Code in this tutorial because the interface is less intimidating, though VS Code gives you the file tree.
video-use — a GitHub repo / skill that handles transcription + trimming. Defaults to ElevenLabs API for transcription. Can also use OpenAI Whisper or a local Whisper. It outputs an edited.mp4 plus a JSON transcript with word-level timestamps.
hyperframes — another GitHub repo / skill that produces motion graphics as HTML compositions. He prefers it to video-use’s built-in Remotion option because the look is more “iOS 26 liquid glass premium UI” — which is the aesthetic he keeps reaching for.

Setup is just: paste the two GitHub URLs into Claude Code and tell it “set this up as my video-editing studio, pull in the skills.” API keys (e.g., ElevenLabs) go into a .env file, never the chat — standard hygiene.

Why trim first, animate second

Animation timing has to lock onto specific words. So the pipeline must:

Trim the raw clip.
Generate a transcript with per-word timestamps.
Sync motion graphics to anchor words.
Render.

Skip step one and the animations sync to filler words and dead air. Skip step two and there’s nothing to anchor to.

How a session actually goes

He drops the raw 50-second clip in, says “use video-use to remove filler words and retakes.” Claude returns a list of cuts (“false start at 0:04, stutter around 0:22, trailing ‘so’ at 0:42 — keep as breath or cut?”). He approves with “make it punchy.” Out comes edited.mp4 at 32 seconds plus a JSON transcript.

Then he switches to plan mode before asking for motion graphics. Plan mode makes Claude lay out every scene — anchor word, position, color palette, copy, font size — before writing a single line of HTML. He reads the plan, asks for one revision (add a “thanks for watching” outro with a vertical-cropped face cam on the right), approves it, and Claude builds the compositions.

“This planning stage is very important… we can iterate a little bit before it actually wastes our time and our Claude session limit actually coding out and building the HTML for the video.”

First render is decent but has bugs — a card covers his face, a grid pattern shows up across the whole video, the final crop only crops one side. He sends a second-round prompt with surgical fixes. Second render is the keeper.

The Hyperframes timeline editor

After Claude builds the scenes, there’s a visual timeline showing each animation as a draggable block. He can shorten, move, or delete elements directly, and the changes write back to the underlying code so Claude stays in sync. This is the part that actually makes iteration feel like editing rather than re-prompting.

Why specificity is the whole game

He spent a long time dictating exactly what each scene should do — “liquid glass card on the left, karaoke-style subtitles, scissor animation cutting red text” — using a voice-to-text tool to keep up. The argument for doing this work: every approved edit becomes training data for future videos.

“All of these videos are training data… build a lesson design markdown philosophy file, which means every time I build a lesson, just use that. And that’s where you truly get to the point of dropping in a raw file and having it edited end to end.”

So the upfront cost is high. The compounding payoff is when you have, say, five lesson videos in your folder and Claude can pattern-match instead of asking.

Cost

This single short video burned ~238,000 tokens across all the prompting and HTML generation. Not catastrophic, but enough that “shoot first, iterate later” gets expensive fast. Hence plan mode.

One nice trick

He tells Claude to take screenshots of rendered scenes and verify them itself before declaring done. Otherwise Claude has a tendency to claim success on output that looks visually broken.

Key Takeaways

Stack: Claude Code (desktop or VS Code) + video-use repo + hyperframes repo, both pulled in as skills.
Transcription options: ElevenLabs API (his pick), OpenAI Whisper API, or local Whisper. Keys go in .env, not chat.
Pipeline order: trim → word-level transcript → animate (anchored to words) → render.
Always switch to plan mode before asking for motion graphics. Approve or revise the plan before code is written.
Hyperframes timeline editor lets you drag/shorten/delete animation blocks visually; edits sync back to code.
Tell Claude to screenshot-verify its own renders to catch visual bugs.
Hyperframes vs Remotion: both work; he prefers Hyperframes for the “liquid glass” aesthetic, Remotion is fine for cleaner/simpler.
Cost reference point: one short video ≈ 238k tokens.
Compounding move: save approved edits as reference projects; future similar videos converge toward one-shot.

Claude’s Take

The headline word is “unrecognizable” and it isn’t. What he’s actually showing is a well-orchestrated three-tool pipeline that needs a plan-mode pass, two iteration rounds, and a couple thousand words of voice-dictated direction to produce a polished 30-second clip. That’s genuinely useful — it’s also not “drop file, get edit.” Calibrate the title down by one notch.

The substance underneath the hype is real, though. The architectural insight worth keeping: trimming and transcription have to happen before animation because animations anchor on words, and word timestamps only exist after the cut. That ordering constraint is the actual product. Hyperframes is a nice-looking HTML composition layer; Claude is doing the gluing; ElevenLabs is doing the listening. None of these pieces are new — what’s new is that Claude Code now feels comfortable orchestrating them end-to-end with skills loaded from GitHub repos.

The “build a folder of approved styles as training data” point is the most useful framing in the video. It mirrors how prompt libraries work for any kind of generative workflow — the first one is expensive, the tenth one is free, and the moat is your accumulated reference set.

Two honest caveats Nate gets to but doesn’t dwell on: 238k tokens for one short clip is steep, and the first render had real visual bugs (face occluded, mystery grid overlay, wrong crop). This is not a turnkey tool. It’s a tool that respects you if you respect it.

For someone who already lives in Claude Code and edits short-form video occasionally, the setup is worth a weekend. For anyone hoping to skip learning to direct, it isn’t.

Score: 6/10 — useful tutorial, real pipeline, oversold title.