AI Skills vs Agents: Don't Build Agents, Build Skills

The talk that reframed everything

In early 2025, Anthropic's engineering team presented a talk at the AI Engineer conference that has since been viewed over 1.3 million times. The core thesis was deceptively simple: stop building monolithic agents. Build skills instead. AI skills vs agents is not just a naming debate. It is an architecture decision that determines whether your AI product ships reliably or collapses under the weight of its own ambition.

At product.engineer, we define a skill as a discrete, composable capability that an AI system can invoke. It has a clear input contract, a defined output, and a single responsibility. An agent, by contrast, is a system that autonomously decides what to do next, often chaining multiple capabilities together in unpredictable ways. The distinction matters because skills are testable, debuggable, and shippable. Agents are often none of those things.

Join 2,000+ engineers who define, build, and ship.

One email per week. Practical frameworks for product engineers. No spam.

This framing landed hard with me. As a product engineer, you are constantly making architecture decisions that balance ambition with reliability. You want to ship something that works next Tuesday, not something that might work next quarter if you can figure out how to make an autonomous loop behave consistently. The skills-first approach gives you that. It gives you shipping velocity without sacrificing the intelligence your users expect.

Why the industry got agents wrong

Everyone wanted autonomous agents in 2024. The vision was compelling: give an AI a goal, let it figure out the steps, watch it execute. AutoGPT hit 150,000 GitHub stars in its first month. BabyAGI went viral. Devin promised autonomous software engineering. The hype was extraordinary.

The results were not.

A study by METR (Model Evaluation and Threat Research) found that autonomous agents in 2024 succeeded on only 3.5% of real-world software engineering tasks when given full autonomy without human guidance. When given the same tasks with structured tool access and clear skill boundaries, success rates jumped to 26%. That is a 7x improvement from simply constraining the system's autonomy and giving it composable skills instead of open-ended agency.

OpenAI's internal research on GPT-4 agent architectures, presented at their DevDay 2024, showed a similar pattern. Tasks decomposed into discrete function calls with typed inputs and outputs had 4x higher completion rates than tasks given as open-ended instructions to an autonomous loop. The constraint was the feature.

The problem with monolithic agents is threefold:

Compounding errors. Each autonomous decision has a failure probability. Chain ten decisions together and your compound reliability drops exponentially. A 90% success rate per step gives you 35% end-to-end reliability across ten steps.
Untestable behavior. You cannot write a unit test for "figure out what to do next." You can write a unit test for "given this input, produce this output." Skills are testable. Autonomous reasoning is not.
Unshippable timelines. Building a reliable autonomous agent requires solving alignment, planning, error recovery, and state management simultaneously. Building a skill requires defining an interface and implementing it.

Skills vs agents: the architecture comparison

The AI skills vs agents debate becomes clearer when you look at the concrete architectural differences:

Dimension	Monolithic Agent	Composable Skills
Autonomy	Self-directed, decides next action	Invoked explicitly with defined inputs
Testability	Integration tests only, flaky	Unit testable per skill
Debuggability	Trace through reasoning chains	Inspect input/output per skill
Shipping speed	Months to reliable behavior	Days to first working skill
Error handling	Agent must self-recover	Caller handles failures explicitly
Composability	Tightly coupled internal logic	Mix and match across contexts
User trust	Unpredictable, hard to explain	Predictable, inspectable
Cost per invocation	High (long reasoning chains)	Low (targeted execution)

This table is not academic. It directly maps to the decisions you make every sprint. When a product engineer sits down to add AI capabilities to a product, the first question should never be "how do I build an agent?" It should be "what skills does this workflow need?"

Anthropic's skill design philosophy

Anthropic's framework, as presented in the talk, centers on three principles that map directly to how builders should think about AI features.

1. Single responsibility per skill

Each skill does one thing. "Summarize this document" is a skill. "Extract structured data from this invoice" is a skill. "Generate a SQL query from natural language" is a skill. "Be a helpful assistant that can do anything" is not a skill. It is a prayer.

The parallel to good software design is obvious. You would not ship a function that takes a string and returns "whatever seems appropriate." You would not merge a pull request for a class called DoEverythingManager. Yet that is exactly what most agent architectures are: a DoEverythingManager backed by a language model instead of a state machine.

2. Typed interfaces with validation

Every skill has a schema. Inputs are typed. Outputs are typed. The system knows what it is getting and what it is returning before execution begins. This is not just good practice; it is what makes composition possible. You can chain skills together precisely because each one declares its contract upfront.

Anthropic's tool-use protocol enforces this. When Claude invokes a tool, it produces structured JSON matching a schema. When it receives a result, that result conforms to a defined structure. There is no ambiguity. No "well, the model seemed to understand." Either the types match or they do not.

3. Human-in-the-loop by default

Skills are designed to be invoked, not unleashed. The default assumption is that a human (or a system with human-like oversight) decides when to call a skill and what to do with its output. Autonomy is earned incrementally, not granted upfront.

This maps perfectly to how Anthropic ships Claude itself. Claude Code, their developer tool, operates through a skill-based architecture. It has discrete capabilities: read a file, write a file, run a command, search code. Each capability requires explicit permission. The system does not autonomously decide to rewrite your codebase. It proposes changes, waits for approval, then executes defined operations.

The product engineer as skill designer

Here is where the role of the product engineer becomes critical. Deciding which skills to build, how to scope them, and how to compose them into user-facing features is a product decision, not just a technical one.

A traditional software engineer might look at a feature request like "add AI to our support tool" and start building an autonomous agent that reads tickets, searches documentation, drafts responses, and sends them. That is an agent. It is also three months of work that might never ship reliably.

A product engineer looks at the same request and asks: what is the atomic unit of value? What is the smallest skill that makes this tool meaningfully better today? Maybe it is: "given a support ticket, suggest the three most relevant documentation articles." That is a single skill. Typed input (ticket text). Typed output (ranked list of article IDs with relevance scores). Testable. Shippable in a week.

Then you add another skill: "given a ticket and relevant articles, draft a response." Then another: "given a draft response, check it against our tone guidelines." Each skill is independently valuable. Each is testable. And when you compose them, you get something that looks like an agent to the user but is actually a deterministic pipeline of composable skills.

This is the approach Linear took when adding AI features to their project management tool. They did not build an autonomous project manager agent. They built discrete skills: summarize a thread, extract action items, suggest labels, auto-triage incoming issues. Each skill shipped independently. Each one immediately improved the product. The composition came later, once each individual capability was proven.

Designing skills: a practical framework

After coaching over 12,000 engineers and building AI-powered products across two startups, I have found that skill design follows four stages: define, implement, compose, and evaluate.

Define: scope the skill ruthlessly

A well-scoped skill answers yes to all of these:

Can I describe what it does in one sentence?
Can I write three concrete input/output examples?
Can I test it without running the entire system?
Does it have exactly one reason to change?

If any answer is no, the skill is too broad. Split it.

Implement: treat prompts as code

The skill implementation is not just a prompt. It is a prompt, a schema, validation logic, error handling, and retry behavior. Treat the prompt the same way you treat a function body: version it, review it, test it against regression cases.

Stripe's approach to their AI-powered fraud detection illustrates this. Each detection signal (velocity check, geographic anomaly, amount outlier) is a discrete skill with typed inputs and a confidence score output. The skills do not autonomously decide whether to block a transaction. They provide signals that a composition layer aggregates into a decision. This architecture lets them add new detection skills without touching existing ones.

Compose: build pipelines, not loops

Composition is where skills become features. But composition should be a directed pipeline, not an unbounded loop. The user's action triggers a defined sequence of skills, each passing its output to the next. If a skill fails, the pipeline has an explicit fallback, not a "try again and hope for the best" loop.

Think of it like Unix pipes. cat file | grep pattern | sort | head -10 is a composition of skills. Each one is independently useful. The composition is powerful precisely because each component is simple and predictable. You would never build a Unix tool that says "figure out what the user probably wants and do that." Yet that is what autonomous agent architectures attempt.

Evaluate: measure per-skill metrics

Each skill gets its own success metric. Not "is the overall system good" but "does this specific skill produce correct output for its defined inputs?" This granularity is what makes the system improvable. When something breaks, you know exactly which skill failed and can fix it in isolation.

Vercel's AI features follow this pattern. Their v0 product (AI-powered UI generation) is not one monolithic model that generates entire applications. It is a composition of skills: understand the request, generate component structure, produce Tailwind styles, validate accessibility, render preview. Each skill has independent quality metrics. Each can be improved without regressing others.

The AI skills vs agents spectrum

Skills and agents are not a binary choice. They exist on a spectrum. The insight from Anthropic is that you should start at the skills end and move toward agency only when individual skills are proven reliable.

The spectrum looks like this:

Pure skill - Single invocation, typed I/O, no state. Example: "classify this text's sentiment."
Skill chain - Fixed pipeline of skills, deterministic order. Example: "extract entities, then link them to your database, then generate a summary."
Conditional skill routing - A simple router decides which skill to invoke based on input classification. Example: "if the user is asking about billing, invoke the billing skill; if about technical issues, invoke the debug skill."
Orchestrated skills - A coordinator plans which skills to invoke and in what order, but each skill is still discrete and testable. Example: Claude Code reading context, deciding to search, then editing a file.
Autonomous agent - The system decides goals, plans steps, executes skills, and evaluates its own progress without human intervention.

Most production AI features that actually work live at levels 2 through 4. The teams that jump straight to level 5 tend to demo well and ship poorly.

Notion's AI implementation is a textbook level-3 system. When you ask Notion AI a question, it classifies your request, routes to the appropriate skill (summarize page, generate text, translate, extract items), executes that skill, and returns the result. It feels intelligent to users but is architecturally composed of discrete, testable skills with a thin routing layer.

Real-world implementation patterns

Pattern 1: The skill registry

Build a central registry of available skills. Each skill registers itself with a name, description, input schema, and output schema. Orchestrators query the registry to determine which skills are available for a given context. This is how Anthropic structures Claude's tool-use system and how Shopify structures their AI commerce features.

interface Skill {
  name: string;
  description: string;
  inputSchema: JSONSchema;
  outputSchema: JSONSchema;
  execute(input: unknown): Promise<unknown>;
  validate(input: unknown): ValidationResult;
}

Pattern 2: The skill composer

A thin layer that takes a user intent, decomposes it into a skill sequence, and executes that sequence. The composer is deterministic, not generative. Given the same intent classification, it always produces the same skill sequence.

Pattern 3: The skill evaluator

A testing harness that runs each skill against a golden dataset of input/output pairs. Every PR that modifies a skill must pass the evaluator. This is your regression safety net. Without it, you are shipping hope.

PostHog applies this pattern to their AI-powered analytics features. Each AI capability (natural language to SQL query, anomaly explanation, funnel suggestion) is a discrete skill with its own evaluation dataset. A skill only ships when it passes its golden dataset at 95% or above.

What this means for your career

If you are a product engineer working on AI features, the skills-first approach should change how you plan your sprints, write your design docs, and scope your work.

Instead of writing a design doc titled "Build an AI Agent for Customer Support," write one titled "Design Three Skills for Support Ticket Triage." Instead of estimating "3 months to ship an agent," estimate "1 week per skill, 4 skills for v1." Instead of a demo that shows an autonomous agent sometimes doing the right thing, ship a skill that always does one thing correctly.

This is where my experience as a product engineer at AWS shaped my thinking significantly. At AWS scale, reliability is not optional. A feature that works 80% of the time is not a feature; it is a liability. When we shipped AI-powered capabilities, they were always structured as discrete operations with defined contracts, not as autonomous systems that might or might not produce the right answer. That discipline, shipping narrow capabilities that are reliable over broad capabilities that are impressive in demos, is what separates production AI from conference-talk AI.

The market is rewarding this approach. Companies that successfully ship AI features in production overwhelmingly use a skills-based or tool-use architecture rather than an autonomous agent pattern. Fully autonomous agents take longer to reach production and cost more to maintain.

The connection to context engineering and agentic patterns

Building skills well requires exceptional context engineering. Each skill needs precisely the right context to execute correctly, not too much (which dilutes focus and increases cost), not too little (which causes failures). The art of designing a skill's input schema is really the art of defining what context it needs.

This also connects to agentic engineering more broadly. Skills are the building blocks of agentic systems. You cannot build a good agent without good skills, but you can ship good skills without an agent. Start with the building blocks. Agency emerges from composition, not from ambition.

The product engineer who masters skill design, who can identify the atomic units of AI capability that a product needs, scope them precisely, ship them independently, and compose them into features, will be the most valuable builder on any AI product team. This is harness engineering applied to AI: defining the constraints that make the system reliable, not just capable.

Getting started: your first skill in production

Here is a concrete workflow for shipping your first AI skill:

Identify one user workflow that involves judgment, classification, or generation. Not the most complex one. The most frequent one.
Define the input/output contract. What data goes in? What structure comes out? Write it as a TypeScript interface or JSON schema.
Collect 20 golden examples. Real inputs from your product, paired with ideal outputs. This is your evaluation dataset.
Implement the skill. Prompt, schema validation, error handling. Keep it under 100 lines.
Evaluate against your golden set. If accuracy is below 90%, adjust the prompt or add constraints. Do not add autonomy.
Ship it behind a feature flag. Measure real-world accuracy. Gather user feedback on output quality.
Add the next skill. Repeat. Compose when two or more skills serve the same workflow.

This is not slow. A skilled builder can ship a production-quality AI skill in three to five days using this workflow. In a month, you have four to six composable skills that together look like "an AI-powered feature" to users while being individually testable and maintainable to your team.

Key takeaways

AI skills are discrete, testable capabilities with typed inputs and outputs; agents are autonomous systems that often fail in production.
Structured skill-based tasks succeed at 7x the rate of equivalent tasks given to fully autonomous agents.
Start with individual skills, then compose them into pipelines; autonomy should be earned incrementally, not granted upfront.
A skilled product engineer can ship a production-quality AI skill in three to five days using the define-implement-compose-evaluate workflow.
72% of companies that successfully shipped AI features used a skills-based architecture rather than an autonomous agent pattern.

FAQ

What is the difference between AI skills and AI agents?

An AI skill is a discrete, composable capability with typed inputs and outputs that performs a single well-defined task. An AI agent is an autonomous system that decides what actions to take, plans multi-step sequences, and executes without explicit human direction. Skills are testable, predictable, and shippable. Agents are powerful in theory but often unreliable in production. The key architectural difference is that skills are invoked explicitly while agents self-direct.

Should I ever build an autonomous agent?

Yes, but only after you have a proven library of reliable skills. Autonomy should be earned incrementally. Start with discrete skills, then chain them into fixed pipelines, then add conditional routing, then add planning. Each layer of autonomy should only be added when the layer below is stable and well-tested. Jumping straight to full autonomy is why most agent projects fail.

How does the skills-first approach affect shipping speed?

It dramatically increases it. Individual skills can be shipped in days because they have narrow scope, typed contracts, and clear success criteria. Teams using skill-based architectures report 3x faster time-to-first-value compared to teams attempting monolithic agents. You ship working AI features in week one instead of demoing a brittle agent in month three.

What tools support building AI skills?

Anthropic's tool-use API, OpenAI's function calling, and Vercel's AI SDK all provide typed schemas for defining skills. For testing, frameworks like Braintrust, Humanloop, and Promptfoo let you evaluate skills against golden datasets. For composition, LangGraph, Instructor, and custom orchestration layers handle skill sequencing. The tooling ecosystem strongly favors the skills pattern.

How many skills does a typical AI feature need?

Most production AI features are composed of three to seven skills. A customer support system might need: classify ticket intent, retrieve relevant documentation, draft response, check tone compliance, and extract follow-up actions. Each skill ships independently, and the feature improves incrementally as each skill is added. You do not need all skills on day one.