Skip to content

Building a highlight reel without watching the video

Published:

I had a 9-minute screen recording of a design session in Hudson and wanted a 45-second highlight reel. I never opened a video editor.

Getting frames out of the video

Vision models can’t take video files directly, so I pulled 7 keyframes with ffmpeg and sent each one to MiniMax M2.7 (via the MCP understand_image tool).

It picked up the app name, the specific modules on screen (Scout Radar, Scout Lattice, Scout Radio), terminal conversations, even which slider the cursor was on. Seven frames was enough to know what happened across the whole recording.

Editing by talking

Instead of scrubbing a timeline, I just said what I wanted:

  • “Make a 45-second highlight reel capturing the good stuff”
  • “The transitions are too rough, make them smooth cross-dissolves”
  • “The S glyph is still inverted, push the timecode forward”
  • “Squeeze everything in a bit more, I want to see the browser chrome”

Each note became a code change in a Remotion composition. Clips are Sequence elements with overlapping cross-dissolves. The whole edit is a React component:

const OVERLAP_FRAMES = 18; // ~0.6s cross-dissolve

let cursor = introFrames + promptFrames;
const clipSequences = CLIPS.map((clip, i) => {
  const clipFrames = Math.floor(clip.duration * fps);
  const from = cursor;
  cursor += clipFrames - OVERLAP_FRAMES;
  return { clip, from, clipFrames, index: i };
});

It opens with a typewriter prompt card, then five clips with eased cross-dissolves, music on its own layer, intro and outro. When I wanted something different I just said so and re-rendered.

The loop

The actual workflow is: say what you want, render, watch, say what’s wrong. “Make the intro snappier” changes the clip array. “Center it more” changes a constant. It’s fast enough that you don’t lose the thread.

Claude Code (Opus) writes the Remotion compositions and adjusts timecodes. For the frame analysis I used MiniMax M2.7 through MCP because it’s cheap under their token plan, seven frames in parallel costs almost nothing. And since it’s MCP, it’s just another tool call in the same conversation. No tab-switching.

The whole thing runs locally with Remotion and bun. I didn’t sign up for anything or upload anything anywhere.


I wanted to start sharing my work in a way that makes it feel less like throwaway manual effort and more like reusable engineering. So to share this video I built a pipeline that produces a 45-second reel from a 9-minute source recording Once I got the basics running, agentic ✨