Passing tests in your repo are great documentation of the tool at a microscopic level. And rerunning tests only burns tokens on failures (since passed tests just print a dot) so it’s token efficient too.
Some other neat tricks:
- For greater efficiency configure your test runner to print nothing (not even a dot/filename) for test successes. Agents don’t need progress dots, only the exit code & failure details
- Have your agent implement a 10ms timeout per test. pytest has hooks to do this. The agent will see tests time out and mock out all I/O and third party code - why test what one assumes third parties tested already! Your test suite is CPU-bound without a shared database, has no shared data and no tests that interfere with or depend on each other, so tests can run in parallel.
I'm OK with longer running tests because I always have them run against a real database (often SQLite, sometimes PostgreSQL) and real files created in temporary directories but I can see how the time limit might be useful for tests that don't need those kind of components.
It's basically an automated test, but at a higher abstraction level and with manual verification--using CLI tools rather than a test harness. Really great work!
Showboat documents look neater if there are single one-line commands that do something useful. Dumping a full Playwright script into a cell is less readable.
Showboat also has a special feature where you can embed an image directly in the document by running:
showboat image doc.md 'rodney screenshot'
The command you call should return a path to an image file as the last line of output. Rodney does exactly that.It may well turn out that Rodney is unnecessary and people find better patterns using Showboat with existing tools like playwright-cli - in which case it won't matter because Showboat and Rodney aren't coupled to each other at all.
Showboat is definitely the more significant of the two projects.
It's also interesting that you've shifted to Go for your agent-coded CLI tools, Simon.
... but then I'm mostly running them with "uvx name-of-tool" because it turns out Python's packaging infrastructure for binary tools is so good!
But I can definitely see how someone with `uv` muscle memory wants everything in the same command.
`uv` is the best thing that happened to the Python ecosystem since... I don't know... maybe Numpy.
Main difference is Rodney can be installed as a single Go binary or via uv/pip, agent-browser is Rust and npm.
Looks like agent-browser was first released at the start of January, it's very new.
It would be interesting to experiment with Jupyter notebooks as an alternative that could work in Claude Code for web.
I had a poke around just now and couldn't find an existing CLI tool that lets you build those up a section at a time in the same way as Showboat. I did find this Python library though:
uv run --with nbformat python -c '
import nbformat
nb = nbformat.v4.new_notebook()
nb.cells.append(nbformat.v4.new_markdown_cell("# NBTerm Exploration"))
nb.cells.append(nbformat.v4.new_code_cell("import sys\nprint(f\"Python {sys.version}\")"))
nb.cells.append(nbformat.v4.new_code_cell("x = [i**2 for i in range(10)]\nprint(x)"))
nb.cells.append(nbformat.v4.new_code_cell("sum(x)"))
with open("demo.ipynb", "w") as f:
nbformat.write(nb, f)
'
So you could tell the agent to run code like that and then inspect the `demo.ipynb` notebook later on. It doesn't show the result of evaluating the cells though, you need to run this afterwards to have that happen: uv run --with nbformat --with nbclient --with ipykernel python -c '
import nbformat
from nbclient import NotebookClient
nb = nbformat.read("demo.ipynb", as_version=4)
client = NotebookClient(nb, timeout=60)
client.execute()
nbformat.write(nb, "demo_executed.ipynb")
'https://github.com/microsoft/playwright-cli
Different from the cli used for running tests etc that comes bundled with PlayWright
Sample use:
playwright-cli open https://demo.playwright.dev/todomvc/ --headed
playwright-cli type "Buy groceries"
playwright-cli press Enter
playwright-cli type "Water flowers"
playwright-cli press Enter
playwright-cli check e21
playwright-cli check e35
playwright-cli screenshot- E2E testing of browser components
- Taking screenshots before and after and having Claude look at them to double check things
- Driving it with an API and CLI as a headless browser
Will definitely give Rodney a look.
Also, I am sure you must already know about Playwright mcp so why this? If your goal isn't to make the cli human-friendly, which is the only advantage clis have over mcps doing the same thing, then why not just use the mcp? It doesn't even handle multiple sessions and has a single global state file––this is slop.
Otherwise it's just writing a document, not building a demo you can review.
As far as I can tell you can't hook MCPs up to Claude Code for web.
I originally planned to support separate sessions but decided to leave that out for the initial release. I've opened an issue for that here: https://github.com/simonw/rodney/issues/6
Or alternatively, just be a skill versus a tool.
My “agents” already demo stuff all the time by just being prompted to do so. I have notations in my standard Agents.md for how I want my documentation, testing etc.
Run uvx showboat --help and
uvx rodney --help and use those
tools to demo the feature you built
The help text effectively doubles as a skill.I don't want to duplicate my skills into all those repos (and keep them updated) so I prefer the "uvx tool --help" pattern.
I saw an MCP I've set up on claude.ai show up in my local Claude Code MCP list the other day, it seems inevitable that there will be skills integration across environments as well at some point.
Please respect the Hacker News community and read https://news.ycombinator.com/item?id=46747998.
Human heuristics - I've prompted millions of tokens across every frontier model iteration for all manner of writing styles and purposes - also helps greatly.
Concerning to me are long-time posters who (perhaps unknowingly) advance the decline of this human community by encouraging the people breaking HN guidelines. Perhaps spending a few hours on Moltbook might help develop such a heuristic, since "someone who knows how to write" is just a Claude model with a link to the blogpost.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Thanks for your comment!
Anyway LLMs don’t have underlying intent so maybe it is fine to just let them express what they can in Markdown?
I didn't know about sh-session, is that documented anywhere?
Showboat seems like it could actually be quite useful for humans too, just for making quick notes from a CLI without opening an editor. The "pop" command makes me wonder if there would be a benefit to also having an array-like in addition to the stack-like interface. It seems like it would be fairly trivial to generate an index of markdown blocks so that they could be edited individually.
I like the idea of Rodney, but I wonder if you might actually have better results by asking the agent to generate equivalent Selenium scripts instead. I'm specifically suggesting Selenium because it's been around so long so I assume there's a lot of Selenium in the LLMs training data, but there are other options that might work too.
I've found the models are so good at Playwright that I don't consider Selenium any more. Rodney is my first experiment not using Playwright.
https://i.postimg.cc/zDMD9nYD/Simon.png