I used this to build a CLI that indexes hours of footage into ChromaDB, then searches it with natural language and auto-trims the matching clip. Demo video on the GitHub README. Indexing costs ~$2.50/hr of footage. Still-frame detection skips idle chunks, so security camera / sentry mode footage is much cheaper.
Then again, let's not be too hasty here. Let's see what you're willing to offer. I can sell you the eyeballs of the AI ad-watcher running in my closet for $10/impression. Or, for $1000/impression, you can bring your message to the attention of myself, an actual human. A bargain at any price!
Thanks for sharing!
Imagine a Premiere plugin where you could say "remove all scenes containing cats" and it'll spit out an EDL (Edit Decision List) that you can still manually adjust.
SentrySearch already returns precise in/out timestamps for any natural-language query and uses ffmpeg to auto-trim clips. Turning that into an EDL (or even a direct Premiere plugin that exports an editable cut list) feels natural.
I’m not a Premiere expert myself, but I’d love to see this happen. If you (or anyone) wants to sketch out a quick EDL exporter or plugin, I’ll happily review + merge a PR and help wherever I can. Just drop a GitHub issue if you start something!
collections.lwarfield.dev
I believe you could use a combination of select and scene parameters in ffmpeg to do this automatically when a chunk of video is created each time.
This very well might be a reality in a couple years though!
>Indexing 1 hour of footage costs ~$2.84 with Gemini's embedding API (default settings: 30s chunks, 5s overlap):
>1 hour = 3,600 seconds of video = 3,600 frames processed by the model. 3,600 frames × $0.00079 = ~$2.84/hr
>The Gemini API natively extracts and tokenizes exactly 1 frame per second from uploaded video, regardless of the file's actual frame rate. The preprocessing step (which downscales chunks to 480p at 5fps via ffmpeg) is a local/bandwidth optimization — it keeps payload sizes small so API requests are fast and don't timeout — but does not change the number of frames the API processes.
for example, for now if i search "cybertruck" in my indexed dashcam footage, i don't have any cybertrucks in my footage, so it'll return a clip of the next best match which is a big truck, but not a cybertruck
Would love to see open-weight models with this capability since it would eliminate the API cost and the privacy concern of uploading footage.
a bit expensive right now so it's not as practical at scale. but once the embedding model comes out of public preview, and we hopefully get a local equivalent, this will be a lot more practical.
If there is text on the video (like a caption or wtv), will the embedding capture that? Never thought about this before.
If the video has audio, does the embedding capture that too?
Cool Project, thanks for sharing!
The presence of cameras everywhere is considerably more concerning than the status quo, to me at least, when there is an AI watching and indexing every second of every feed—where camera owners or manufacturers or governments could set simple natural language parameters for highly specific people or activities notify about. There are obviously compelling and easy-to-sell cases here that will surely drive adoption as it becomes cost effective: get an alert to crime in progress, get an alert when a neighbor who doesn't clean up after his dog, get an alert when someone has fallen...but the potential implications of living in a panopticon like this if not well regulated are pretty ugly.
[0]: https://www.axon.com/products/axon-fusus [1]: https://citizen.com/
https://ai.google.dev/gemini-api/docs/pricing#gemini-embeddi...
(The code also tries to skip "still" frames, but if your video is dynamic you're looking at the cost above.)
regardless of the file's frame rate, the gemini api natively extracts and tokenizes exactly 1 fps. the 5 fps downscaling just keeps the payload sizes small so the api requests are fast and don't timeout.
i'll update the readme to make this more clear. thanks for bringing this up.
The problems start cropping up when you get things like Flock where governments start deploying cameras on a massive scale, or Ring where a single company has unrestricted access to everyone's private cameras.
I don't think it's a good thing but it seems the limiting factor has been technological feasibility instead of any kind of principle against it.
I've been hearing warnings that AI would be used for this since well before it seemed feasible.