Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs
33 points by darkrishabh 7 hours ago | 10 comments

ssgodderidge 3 hours ago
The example model in the documentation is 4o-mini, you might want to update that to a more recent model.

As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

reply
stingraycharles 2 hours ago
It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.

For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.

Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.

reply
block_dagger 2 hours ago
The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).
reply
ssgodderidge 26 minutes ago
The metadata is loaded by the harness, but the LLM still needs to choose to load the rest of the skill, no?
reply
egeozcan 3 hours ago
Are there any published results gathered using this?
reply
jarym 46 minutes ago
Not sure but I'm interested in trying it because I've for a while sensed that adding SKILLS.md degraded my overall experience - most probably I wrote them wrong. But this sort of tooling I guess can help me figure it out?
reply
ianhxu 3 hours ago
How do you iterate on the judge prompt? Is there an auto rater?
reply
datadrivenangel 30 minutes ago
That is the billion dollar question. Who watches the watchmen?
reply
ianhxu 5 minutes ago
exactly
reply
blitzar 29 minutes ago
the watchwatchmen
reply
bixxie09 2 hours ago
[dead]
reply
huflungdung 5 hours ago
[dead]
reply