It hardens a skill through judge-panel refinement rounds, it’s a quality gate that runs after authoring, not an authoring tool.
MisterBiggs 1 days ago [-]
This is a pretty neat, I suspect that eventually every skill will have some sort of validation/verification loop like this
bob1029 1 days ago [-]
I've been able to avoid this kind of markdown library architecture with very chatty tool feedback. Interaction with a responsive environment is much better than static chunks of "skill" text. For example, imagine a domain constraint:
"You must use tool ABC before calling tool XYZ"
This can either be in some static prompt scheme somewhere, or it can be the live result of a tool call.
If you make everything tool calling and environmental, you effectively have a lazily evaluated & dynamic prompt scheme.
I like to think of this as context for the context. The better you map the environment and descriptions of it to the agent, the less top-down prompting is required.
If you set up the harness correctly, you can run circles around a lot of what passes as AI innovation with powershell in a while loop. Adding static markdown document soup on top of this would only reduce performance in the general case.
MisterBiggs 1 days ago [-]
Yup! I feel pretty strongly that every little nit pick and instruction you pass into your model is murdering your output. Having a hook that executes on tool calls is significantly better than telling your agent to follow your repos specific format/lint/style/test constraints
cush 16 hours ago [-]
A good agent and harness should notice that an instruction like "You must use tool ABC before calling tool XYZ" is best implemented as a pretooluse hook
_boffin_ 1 days ago [-]
Can you go into more detail about your setup and use cases?
dhedlund 8 hours ago [-]
The article assumes that the skills need to be used by the same model. If a model like Opus develops a skill that is then used by another model like Qwen3.6, that feels like it could also add value.
TheGoddessInari 20 hours ago [-]
It always feels a bit vexing when people complain about skills: Personally, we treat them as if manuals where the goal is to patch knowledge, not (typically) be a from-scratch primer.
Letting an instruction following llm deep research and iterate has given fantastic results before.
Being able to construct non-trivial Zig 0.16 programs without slowing down for version-hallucinating compilation errors is nice as a random example.
nilirl 1 days ago [-]
I read this post thinking "Finally! Finally someone will explain to me what I've been missing because 'skills' just seem to be re-usable text that help make prompting faster."
Nope. Still the same.
RugnirViking 1 hours ago [-]
yeah, thats what they are, but thats useful! you have an agents.md, that gets put into every conversation. But studies and experience both show that as that gets longer, the agent becomes less capable. so instead of telling it everything useful under the sun, you put only really important things there, and the rest of the advice for common but not every time actions you put into skills. I personally have like, 5 skills. one that works with the database, and has a bunch of context about the schema, how to connect and work with documents, example queries written just how I like them (and pre-written with filters to reduce the risk of ai ingesting a million rows worth on tokens for no reason), a python script I wrote to do certain common operations and how to use it for different tasks.
So in essence, the ideal skill imo is pretty much a list of shell commands with a sentence next to each of when to use them
With these, I personally have skills for:
- dealing with our metrics and tracing platform
- dealing with jira
- dealing with confluence (mostly finding info I need via different search strategies without using too many tokens)
- dealing with database
- doing reviews (this one is more prompting about what info I need to review well myself, rather than commands, though it does instruct the agent to download the branch into a new worktree and clean it up after its done with specific commands)
Im generally suspicious of people with hundreds of skills, especially those I open and find ai generated writing inside. skills should be a list of commands, maybe with some pitfalls for the agent to avoid, added only by human experience (agents are terrible at prompting)
morelandjs 1 days ago [-]
Agree on article frustrations. Perhaps a better explanation, skills are just disk-cached prompts conditioned on verified success. The conditioned on verified success part might seem inconsequential, but it’s the whole thing that gives skills their value. Also the fact that their loading can be scoped to a certain calling context.
nilirl 1 days ago [-]
> conditioned on verified success
Thank you! That made it clear to me why it's an useful caching technique.
JambalayaJim 22 hours ago [-]
Can you elaborate on what "verified success" means?
noodletheworld 1 days ago [-]
Agree; posts like this frustrate me.
Tldr: you're doing it wrong but I will not show you how to do it right. I also did not run the bench using my approach but it definitely “vibes better” to me, and I reject your actual research paper.
Come on, show us some actual skills.
That one you use all the time looks a hell of a lot like “I wont a deterministic shell script for something a skill saying ‘run the shell script’”
Is that what you do? How much time do you spend on them? How do you stop the agent from making a bunch of very similar skills? How do you deal with the explosion of the total number of skills impacting your token use? Do you use skills from github, or is that bad practice? Why?
So many unanswered questions; so little content. :/
tuo-lei 22 hours ago [-]
[flagged]
ben30 22 hours ago [-]
[flagged]
basedrum 1 days ago [-]
Could you publish your gitlab skill to give an example?
oniony 1 days ago [-]
You're probably using adverbs wrongly.
19 hours ago [-]
lofaszvanitt 18 hours ago [-]
The solution: you have to beat the AI agents like you did the cow avatar in Black and White because it watered the fields while the temple was on fire :DDD.
whattheheckheck 1 days ago [-]
What if I want a way to open up a latent space prompt without having to type it all out everytime?
MisterBiggs 1 days ago [-]
Skills for repitition are totally valid. Having a version control skill that explains that I use gitea works great. My point is that asking for a skill that tells us if our program will get stuck before taking on a halting problem won't get you any further than just starting the task with xhigh thinking
theowaway213456 1 days ago [-]
TL;DR don't have your agent write skills using only its latent knowledge, otherwise you may as well not use a skill in the first place and let it summon that latent knowledge on the fly.
Not sure if this take is correct though. I suspect self-generated skills help the agent avoid having to "decompress" its latent knowledge, which might save tokens? idk, I am not an expert
solarkraft 1 days ago [-]
It seems so obvious: How would it know better than it already does?
Yet I’ve seen people succeed with „write me a prompt“ prompts. The model makes something up, often it makes sense.
They are like plans in that way: It’s not exactly novel knowledge, but it at least encodes it somewhere to make the process verifiable beforehand and a bit more repeatable.
I wouldn’t be surprised if it improves performance a little, just like thinking blocks do (every model reasons now).
bigcat12345678 1 days ago [-]
I now have rules to not let agent write any docs or processes. Pretty much anything LLM auto-generated are of zero reuse value.
imhoguy 1 days ago [-]
Autogenerated content is good scaffolding, but then I have a rule where if I mark heading with "(by-human)" the section shouldn't be changed by LLM without permission.
cassianoleal 1 days ago [-]
Skills can transfer one session's latent knowledge to all other sessions.
Eg. Ask the agent to write a skill then get it to prompt a subagent to use the skill, then iterate until it verifies the task was completed correctly
https://github.com/bjcoombs/ai-native-toolkit/blob/main/skil...
It hardens a skill through judge-panel refinement rounds, it’s a quality gate that runs after authoring, not an authoring tool.
"You must use tool ABC before calling tool XYZ"
This can either be in some static prompt scheme somewhere, or it can be the live result of a tool call.
If you make everything tool calling and environmental, you effectively have a lazily evaluated & dynamic prompt scheme.
I like to think of this as context for the context. The better you map the environment and descriptions of it to the agent, the less top-down prompting is required.
If you set up the harness correctly, you can run circles around a lot of what passes as AI innovation with powershell in a while loop. Adding static markdown document soup on top of this would only reduce performance in the general case.
Letting an instruction following llm deep research and iterate has given fantastic results before.
Being able to construct non-trivial Zig 0.16 programs without slowing down for version-hallucinating compilation errors is nice as a random example.
Nope. Still the same.
So in essence, the ideal skill imo is pretty much a list of shell commands with a sentence next to each of when to use them
With these, I personally have skills for:
- dealing with our metrics and tracing platform
- dealing with jira
- dealing with confluence (mostly finding info I need via different search strategies without using too many tokens)
- dealing with database
- doing reviews (this one is more prompting about what info I need to review well myself, rather than commands, though it does instruct the agent to download the branch into a new worktree and clean it up after its done with specific commands)
Im generally suspicious of people with hundreds of skills, especially those I open and find ai generated writing inside. skills should be a list of commands, maybe with some pitfalls for the agent to avoid, added only by human experience (agents are terrible at prompting)
Thank you! That made it clear to me why it's an useful caching technique.
Tldr: you're doing it wrong but I will not show you how to do it right. I also did not run the bench using my approach but it definitely “vibes better” to me, and I reject your actual research paper.
Come on, show us some actual skills.
That one you use all the time looks a hell of a lot like “I wont a deterministic shell script for something a skill saying ‘run the shell script’”
Is that what you do? How much time do you spend on them? How do you stop the agent from making a bunch of very similar skills? How do you deal with the explosion of the total number of skills impacting your token use? Do you use skills from github, or is that bad practice? Why?
So many unanswered questions; so little content. :/
Not sure if this take is correct though. I suspect self-generated skills help the agent avoid having to "decompress" its latent knowledge, which might save tokens? idk, I am not an expert
Yet I’ve seen people succeed with „write me a prompt“ prompts. The model makes something up, often it makes sense.
They are like plans in that way: It’s not exactly novel knowledge, but it at least encodes it somewhere to make the process verifiable beforehand and a bit more repeatable.
I wouldn’t be surprised if it improves performance a little, just like thinking blocks do (every model reasons now).