People sit on mountains of raw assets - product walkthroughs, customer interviews, travel videos, screen recordings, changelogs, etc. - that could become testimonials, ads, vlogs, launch videos, etc.
Instead they sit in cloud storage / hard drives because getting to a first cut takes hours of scrubbing through the raw footage manually, arranging clips in correct sequence, syncing music, exporting, uploading to a cloud storage to share, and then getting feedback on WhatsApp/iMessage/Slack, then re-doing the same thing again till everyone is happy.
We grew up together and have been friends for 15 years. Saksham creates content on socials with ~250K views/month and kept hitting the wall where editing took longer than creating. Ishan was producing launch videos for HackerRank's all-hands demo days and spent most of his time on cuts and sequencing rather than storytelling. We both felt that while tools like Premiere Pro and DaVinci are powerful, they have a steep learning curve and involve lots of manual labor.
So we built Cardboard. You tell it to "make a 60s recap from this raw footage" or "cut this into a 20s ad" or "beat-sync this to the music I just added" and it proposes a first draft on the timeline that you can refine further.
We built a custom hardware-accelerated renderer on WebCodecs / WebGL2, there’s no server-side rendering, no plugins, everything runs in your browser (client-side). Video understanding tasks go through a series of Cloud VLMs + traditional ML models, and we use third party foundational models for agent orchestration. We also give a dropdown for this to the end user.
We've shipped 13 releases since November (https://www.usecardboard.com/changelog). The editor handles multi-track timelines with keyframe animations, shot detection, beat sync via percussion detection, voiceover generation, voice cloning, background removal, multilingual captions that are spatially aware of subjects in frame, and Premiere Pro/DaVinci/FCP XML exports so you can move projects into your existing tools if you want.
Where we're headed next: real-time collaboration (video git) to avoid inefficient feedback loops, and eventually a prediction engine that learns your editing patterns and suggests the next low entropy actions - similar to how Cursor's tab completion works, but for timeline actions.
We believe that video creation tools today are stuck where developer tools were in the early 2000s: local-first, zero collaboration with really slow feedback loops.
Here are some videos that we made with Cardboard: - https://www.usecardboard.com/share/YYsstWeWE9KI - https://www.usecardboard.com/share/nyT9oj93sm1e - https://www.usecardboard.com/share/xK9mP2vR7nQ4
We would love to hear your thoughts/feedback.
We'll be in the comments all day :)
- https://news.ycombinator.com/item?id=42806616
- https://news.ycombinator.com/item?id=45980760
- https://news.ycombinator.com/item?id=46759180
- https://github.com/saurav-shakya/Video-AI-Agent
- going to be rather tough to differentiate
I would like to:
- upload a bunch of surf footage
- let it sort through the surfers
- pick the three longest waves surfed by each surfer
- create a montage grouped by surfer, ordered by shortest to longest wave for that surfer.
Thank you!
I've spent a bit of time on something related, AI-generating motion graphics videos from code, also editable/renderable in-browser. Here's a few things I ran into:
- I see you mentioned being aware of Remotion in another comment, in my experience Remotion is not the right tool for adding motion graphics to what you're building. There's a few reasons for this, but basically declarative markup is not a great language for motion graphics beyond anything very basic. Also, in-browser rendering is only going to work with canvas-based components. I also wasn't a huge fan of their license.
- WebCodecs may not be as reliable as you think. I've verified several issues where I get a different output across browsers and operating systems, and even different permutations of flags, browser and OS. Is there a reason why your tool needs to be browser-based?
We've been eager to experiment with this for a while, just have to prioritize other user requests for now. Will definitely try a few approaches and see what sticks. (Also noticed they have an experimental client-side rendering version built on mediabunny, haven't tried it yet: https://www.remotion.dev/docs/client-side-rendering/)
- On WebCodecs, there are a fair set of challenges, but we wanted to take the bet. The reason we're browser-based is the same reason I love Figma and Google Docs: no install, no waiting, just open and start. That said, for broader codec support (ProRes, RAW, etc.) we'll rely on server-side transcoding with proxies where needed.
Just to clarify I still think code-driven graphics is the correct approach, but in my case I opted for a different library with a more powerful imperative API.
> Also noticed they have an experimental client-side rendering version built on mediabunny
Yes, I've tried it out, it was a non-starter for me because it only supports canvas-based components, and Remotion didn't seem to have good support for text on canvas because they rely on HTML for most of that.
> On WebCodecs, there are a fair set of challenges, but we wanted to take the bet
Totally understand the appeal and immediacy of a browser app, I was lured in by that too. For what it's worth I've reported showstopping WebCodecs issues in Chromium and there's basically no indication they'll get fixed on a predictable timeline.
Another issue I ran into that I just remembered is animating text on canvas. It's basically impossible to get pixel-perfect anti-aliased text animation using a canvas. I would have to dig up the exact details but it was something to do with how browsers handle sub-pixel positioning for canvas text, so there was always some jitter when animating. This coupled with the aforementioned WebCodecs issues led me to conclude that professional-quality video rendering is not currently possible in the browser environment. Aliasing, jitter and artifacts are immediately perceptible and are the type of thing that users have zero tolerance for (speaking from experience).
This is not meant to be discouraging in any way, I've just been very deep into this rabbithole and there are some very nasty well-hidden pitfalls.
Interestingly I have the exact opposite experience, I've reported issues both in the WebCodecs specification and the Chromium implementation, in all cases they were fixed within weeks. Simply though reports on public bug trackers and it wasn't really a major issue in any instance.
> Another issue I ran into that I just remembered is animating text on canvas. It's basically impossible to get pixel-perfect anti-aliased text animation using a canvas. I would have to dig up the exact details but it was something to do with how browsers handle sub-pixel positioning for canvas text, so there was always some jitter when animating. This coupled with the aforementioned WebCodecs issues led me to conclude that professional-quality video rendering is not currently possible in the browser environment. Aliasing, jitter and artifacts are immediately perceptible and are the type of thing that users have zero tolerance for (speaking from experience).
We're doing SOTA quality video rendering with WebCodecs + Chromium with millions of videos produced daily, or near SOTA if you consider subpixel AA a requirement for text. In general for pixel perfection of text, especially across different browsers and operating systems, you can't just use text elements in DOM or in canvas context, instead text needs to be rasterized to vector shapes and rendered as such. Honestly not sure about potential jittering when animating text, but we've never had any complaints about anything regarding text animations and users are very often comparing our video exports with videos produced in Adobe AE or similar.
That's fair, they are responsive most of the time. I do have one major rendering issue in particular I've been waiting on with no movement for months, so I might be biased.
> We're doing SOTA quality video rendering with WebCodecs + Chromium with millions of videos produced daily, or near SOTA if you consider subpixel AA a requirement for text. In general for pixel perfection of text, especially across different browsers and operating systems, you can't just use text elements in DOM or in canvas context, instead text needs to be rasterized to vector shapes and rendered as such. Honestly not sure about potential jittering when animating text, but we've never had any complaints about anything regarding text animations and users are very often comparing our video exports with videos produced in Adobe AE or similar.
So you use a library that takes in text and vectorizes it to canvas shapes? That could work in theory, do you have a demo of this?
Yea, it's harfbuzz compiled to WASM: https://harfbuzz.github.io/harfbuzzjs/ Then all text layout features must be implemented on top of it, like linebreaking, text align, line spacing, kerning, text direction, decoration etc.
would you mind sharing the name?
It's not really designed for the animation code to be dynamically changed on the fly, but I've hacked together this feature in my fork.
For some of the examples we shared though, we've created sample projects right within the product itself. They contain the raw assets and the exact prompts used to create the videos. You can try them out directly at https://demo.usecardboard.com and see the whole process!
I recently started making videos for a loved one that lives far away, I started using CapCut and this is the kind of thing I was thinking "I wish it did that".
I'll definitely try it out. Congrats!
will definitely check the XML exports, ty :)
Firefox is not supported ...
But why?The short answer: Firefox doesn't support the File System Access API (https://caniuse.com/?search=File+System+Access+API).
We made a deliberate decision to go client-first. Video editing happens entirely in your browser without us uploading your entire footage on our end. No bandwidth costs for you, no storing your raw video on our servers. The File System Access API is what makes that possible, and unfortunately Firefox just doesn't have it yet.
It's not a forever thing though. For cloud-based projects where files live on our end anyway, Firefox support is very much on the roadmap. But for the local-first editing flow, our hands are a bit tied until Mozilla ships it.
Hope that makes sense, and fingers crossed Firefox adds support soon!
Cool to see the space evolving from so many directions! :)
Great website and good luck!
I also saw another YC company, Mosaic, doing something similar. But your approach of chat-based editing is a lot closer to what I'm building. Shameless plug: I'm also working on a chat-based media processor. https://chatoctopus.com
But you guys are way ahead! will be looking at you for inspiration.
Great website btw. The onboarding was very pleasing
I played around on a sample video and it worked great. I wanted to undo one AI edit but couldn't find if there is undo button.
There is an undo button — it's on the bottom right of each user message in the chat. That said, sounds like it wasn't obvious enough, so I'll rethink the UX there for sure!
Cardboard looks really well polished, well done!
My co-founder and I met in high school, and we wanted the name to carry a sense of craft. Cardboard was always that material in school projects that was firm enough to hold structure but malleable enough to build almost anything out of. That balance of structure and flexibility felt like a good metaphor for what we're building.
Also we just thought it was a cool name and bought a bunch of domains... https://cardboard.mov is one of my favorites :)
Regardless, having a tool that knows the content of your footage is a huge time saver. Good luck with the product.
That's also why we built a full editor alongside the agentic experience. Use AI where it helps, like finding the right shot or removing silences, and do the rest manually. And if you'd rather finish in your editor of choice, we support XML export for Premiere, DaVinci, etc.
And agreed, there's really no substitute for the kind of intentionality Herzog brings to his work :)
Aight imma head out. Holy moly.
The value might not be co-editing the timeline, it's making the feedback / iteration loops faster.
We deliberately avoided credits/usage-based pricing because as founders using this in our own creative workflow, we hate the cognitive load that comes with it.
If I don't like a voiceover/variation, I should have the freedom to regenerate it until I'm happy without thinking about whether it's "worth" a credit.
That said, we could be wrong! Genuinely curious what you think would feel fair?
One thing I've been thinking about building in this space: there's a fundamental split between understanding what to edit (where VLMs/agents shine) and executing the edit precisely (where you need deterministic operations, not model inference).
Most "AI video editors" blur these two together — they use the same probabilistic approach for both understanding and execution. But when a user says "cut the first 3 seconds and add a 0.5s crossfade," that shouldn't go through a model. That should be a precise, repeatable operation.
The Cursor analogy in your roadmap is apt — Cursor works because it predicts intent but executes through deterministic code transforms, not by asking an LLM to write the whole file. Same principle applies to video.
Curious how you handle the boundary between agent-proposed edits and deterministic timeline operations under the hood?