I Decided to Call Them Harness Skills — Breaking the Illusion of Doing Well and Opening Up My Harness#

2026-06-08

Harness Skills — selectively absorbing external skills into your own harness

Facing Things Without a Name#

When I first saw something called LLM Wiki, and then GStack, the first thing that came to mind was surprisingly: “What should I even call these?”

It was clear that both were means for handling AI agents better. From the perspective of harness engineering — the discipline of designing infrastructure to operate AI agents safely and reliably, which I covered in an earlier piece — these were obviously “tools you reach for when building a harness.”

Yet when I tried to group them under a single word, I got stuck. Calling them “frameworks” felt off somehow. In software, a framework usually means a code scaffold that takes over control flow — the inverted structure of “you plug your code into my rules and I run it,” like LangChain. LLM Wiki and GStack, by contrast, were less an execution scaffold and more like structured assets you lay on top of an agent to reference or apply. The word “framework” failed to capture this nuance.

This seemingly trivial naming concern took me much further than I expected.

There Was No Standard Term Yet#

So I researched it directly. To cut to the conclusion: there is still no settled term that pins down exactly what these things are.

Entering 2026, the industry’s vocabulary has been converging rapidly. The definition “Agent = Model + Harness” has taken hold, and the parts that make up a harness are called Harness Components or Harness Primitives. LangChain named Skills a “harness-level primitive.” But the agreed-upon name for “a packaged means you lay on top to strengthen a harness” — the thing itself, like LLM Wiki or GStack — was an empty seat.

Interestingly, the two are actually distributed through the same mechanism. GStack installs as a skill pack for Claude Code, and LLM Wiki installs as a skill or an AGENTS.md protocol file. Both, in form, are Skills.

So should we just call them “skills”? There’s a problem. That gets confused with the ordinary skills a user writes directly (skills that perform specific tasks). The axis of distinction lies not in form but in purpose. A skill that performs a task and a skill that configures the agent’s operating environment are different things.

CategoryPurposeExamples
General skill (task / capability)Perform a specific task“Write a blog post,” data lookup
Harness skillConfigure the operating environment (memory / guardrails / verification / roles)LLM Wiki, GStack

So I decided to call these “Harness Skills,” and a bundle of them a “Harness Skill Pack.” This is not a standard term but my own naming. Still, since it was an empty seat, there is no existing standard to clash with, and it is a reasonable extension that simply adds a purpose modifier on top of the industry terms “skill / skill pack.”

Once I had named it, a more uncomfortable question surfaced.

I Had Been Closed Off#

I actively use Cursor, Claude Code, and Hermes in both my company work and personal projects. I have also built my harness configuration to fit my own usage patterns. Honestly, I thought I was doing harness engineering fairly well.

That’s why I never touched harness skills made by others. The reason was clear: I was afraid that something built with generic intent by someone else would interfere with my usage patterns. I had tools tuned to my own hand, so I figured there was no reason to bring in someone else’s way and disrupt my flow.

But this time, while attaching the name “harness skill” and peering into that world, I confronted an uncomfortable fact. That defensive attitude had actually been blocking the very possibility of getting better.

The feeling of “doing well” is dangerous. It often slips into the illusion of “there’s nothing more to learn.” While I was satisfied with my own harness, validated patterns refined together by tens of thousands of people were evolving fast on the outside. In guarding my patterns, I had been turning away from the very material that could improve them.

After this realization, I changed my mind. I decided to try the widely loved harness skills one by one — not to tear everything down and replace it, but to pick the good ideas inside them and absorb them into my harness.

So I Decided to Open Up — The Representative Harness Skills#

First, I surveyed what’s out there. Based on GitHub stars and actual adoption, I narrowed down the harness skills loved by many. What matters is which harness component each one strengthens and how invasive it is to an existing environment.

NamePopularityComponent StrengthenedInvasiveness
Serena~25k★Context (code understanding) + session memoryLow
LLM WikiKarpathy’s proposalAgent memory / accumulated contextLow
Claude Task Master~27k★Task decomposition + context persistenceMedium
GitHub Spec Kit~110k★Context (spec) + verification loopMedium
OpenSpecSmall/mediumSpec change management (brownfield)Medium
oh-my-claudecode~36k★Multi-agent orchestration + verificationHigh
BMAD-METHOD~37k★Multi-agent role boundariesHigh
SuperClaude~23k★Personas / commands / memory integrationHigh
GStackYC Garry TanRole boundaries + verification + guardrailsHigh

Context / Memory — Low Invasiveness#

Serena is a semantic code tool based on the Language Server Protocol (LSP). It lets an agent explore and edit code at the symbol level rather than reading whole files, dramatically saving tokens while also providing cross-session project memory. Because it’s just one MCP server you attach, it barely touches your existing workflow.

LLM Wiki is a pattern proposed by Andrej Karpathy, in which an agent continuously builds and maintains an interlinked Markdown wiki. Instead of re-deriving answers from raw sources on every query like RAG, it compiles once and keeps the result current. If you already use repository-based memory (.memory/), it fits best philosophically.

Workflow / Spec-Driven — Medium Invasiveness#

Claude Task Master decomposes a requirements document (PRD) into a dependency-aware task graph and scores each task’s complexity to recommend whether to break it down. The standard usage is to attach it to Cursor via MCP.

GitHub Spec Kit, the most-starred tool in this space, is a spec-driven development (SDD) toolkit that locks the spec first through four phases — “Spec → Plan → Tasks → Implement” — and then has the agent implement it. Because each phase’s artifact becomes the context for the next, it also has the character of a verification loop. OpenSpec is often compared as being specialized for change management in existing codebases (brownfield) rather than new projects.

All-in-One / Team Simulation — High Invasiveness#

This group consists of strongly opinionated tools. As effective as they are, they also carry the greatest risk of conflict for someone with an established harness.

oh-my-claudecode (OMC) brands itself as “oh-my-zsh for Claude Code,” laying on 32 specialist agents and over 40 skills with zero configuration. It saves tokens through smart model routing (fast models for simple tasks, strong models for complex reasoning) and offers patterns like Ralph mode, which doesn’t stop until verification is complete. BMAD-METHOD simulates an entire agile team with 12+ agents covering Analyst / PM / Architect / QA and more. SuperClaude is a configuration framework that standardizes personas, commands, and session memory into one bundle. GStack, a virtual engineering team skill pack made by Garry Tan of Y Combinator, follows a “thin harness, fat skills” philosophy, making it relatively easy to peel off only the roles you like.

If you want to explore further, the registry VoltAgent/awesome-agent-skills — which gathers skills from official teams like Anthropic, Vercel, and Stripe, alongside the community — is a good starting point.

Don’t Tear It All Down#

Here is a principle I newly established: the more opinionated a tool, the less you should adopt it wholesale.

For someone with their own harness, installing an all-in-one tool as is can be dangerous. In fact, OMC’s default installation overwrites the global configuration file entirely, so rules you’ve refined over time can vanish in one stroke. Fortunately, there are safeguards like project-scoped installation and preserve mode.

So the order of absorption I settled on is this:

  1. Start with the least invasive — apply additive things first, like Serena and LLM Wiki, that merely add to existing patterns.
  2. Try workflows as experiments — experience the flow of Task Master or Spec Kit in a separate project.
  3. Evaluate all-in-one tools in an isolated sandbox — assess tools like OMC and GStack in a separate repository using project-scoped mode, and melt only the patterns you like into your harness.

The key is not “accept everything” but “selectively absorb.” For example, even without using OMC wholesale, you can take just the idea of its model routing or its “persist until verified” pattern and apply it to your own harness. A good harness must remain rippable — easy to remove at any time — so external patterns, too, can be experimented with on top of this principle.

Between Confidence and Openness#

What I learned through this episode was not a list of tools but an attitude.

The value of the time spent building and refining my own harness does not change. It is a clear asset. But the moment that asset becomes a fence of “there’s nothing more to learn,” confidence turns into a trap. The feeling of doing well is something we should occasionally stop and question.

Now I call that world by the name “harness skills,” and I am opening the door I had kept shut, examining them one by one. Keeping the value of building my own while staying open to validated patterns from the outside — finding the balance between the two is, I believe, the path to continuing to do harness engineering well.


References#