Understanding Anthropic's Official Guide to Tool Use and Agents

Chronist Team Chronist Team

An AI agent’s performance heavily depends on the quality of tools it’s given. In real-world settings, vague or unevaluated tool specifications often lead to incorrect invocations, unnecessary costs, and incomplete tasks. Anthropic’s “Writing tools for agents” covers everything from prototype creation and evaluation methods to automated improvement with Claude and fundamental tool design principles.

This article breaks down that content and organizes it in a readily actionable format for practical use.

Table of Contents

Overall Guide Structure

anthropic-tools-agents-guide image

Source: https://www.anthropic.com/engineering/writing-tools-for-agents

This guide assumes designing tools in “agent-friendly forms” and repeating evaluation and improvement cycles.

The flow is simple: first create a prototype, design and execute evaluations, then improve based on results using Claude. Additionally, principles to consider during design are presented, such as tool selection and naming, information return strategies, token efficiency optimization, and description writing.

Tasks used in evaluation should be as close to actual operations as possible. Verification proceeds assuming multiple solutions exist, not just one correct answer.

Incorporating this mindset makes it easier to integrate “evaluation and improvement” as a standard process from the implementation stage, enabling rapid quality enhancement.

Tool Definition and Differences from Traditional Software

Traditional APIs and functions return the same output for the same input. Conversely, agents may take different actions under the same conditions. Therefore, tools are a new kind of software bridging “deterministic systems” and “non-deterministic agents.”

PerspectiveTraditional API/FunctionsAgent ToolsPractical Points
Invocation NatureConsistent input-outputUsage/non-usage and order varyDesign for retry-friendly failure assumption
Design FocusType/communication accuracyErgonomics (misuse tolerance/clarity)Enrich natural language descriptions and examples
Success CriteriaReturn value matchTask completion・evaluation criteriaEvaluate assuming multiple correct paths
Return DataTechnical ID-centricMeaningful context priorityReturn names/summaries/related metadata

Understanding this difference reveals that error response and return context design determine success rates.

Implementation and Evaluation Approach

Here we organize the “effective process” demonstrated by the official article into actionable granularity.

Building Prototypes: Utilizing Claude Code and MCP

You don’t need perfection from the start. It’s important to create and test a working version using Claude Code.

  • Loading necessary API or SDK documentation (such as llms.txt format) improves generation accuracy

  • Tools can be wrapped with MCP servers or Desktop Extensions (DXT) and tested by connecting from Claude Code or Claude Desktop.

Connect using claude mcp add <name> <command>, and for Desktop, configure via “Settings > Developer” or “Settings > Extensions.”

  • Tools can also be passed directly to the API for programmatic testing

The recommended flow is to first try it yourself, experiencing and improving usability issues.

Designing and Operating Comprehensive Evaluations

Once you have a prototype, next comes evaluation.

Strong vs. Weak Tasks

Evaluation tasks should be complex and close to reality.

  • Strong Examples

Schedule next week’s meeting, attach previous minutes, and reserve a meeting room. / Investigate duplicate billing impact and extract related logs. / Respond to cancellation requests by identifying reasons, making proposals, and confirming risks.

  • Weak Examples

Simply register one meeting. / Find just one log entry.

Evaluation Design Strategies

Simple string matching is insufficient for correctness judgment. Design flexible judgment that tolerates expression differences and accepts multiple solution paths. While you can specify expected tool sequences, over-fixation risks overfitting.

Evaluation Output Design

During evaluation, outputting “reasons and feedback” before tool invocation in addition to answers makes improvement points easier to identify. With Claude, Interleaved Thinking enables this flow naturally.

Preparing such composite challenges makes it easier to verify real-world operational resilience.

Result Analysis and Improvement

Evaluation provides more than just “correctness of answers.” Logs reveal which tools weren’t called, which tools misfired, and whether redundant invocations occurred.

  • If unnecessary calls are frequent, add pagination or filtering.

  • If the same procedure repeats, consolidate multiple processes into an “integrated tool.”

  • If errors are common, improve tool descriptions or parameter names.

Furthermore, batch-feeding evaluation logs to Claude Code enables batch refactoring. Naming conventions and description consistency can be rapidly organized.

Automated Optimization and Batch Refactoring with Claude

Leverage logs obtained from evaluation in batches, feeding them collectively to Claude Code for simultaneous tool review.

  • Identify and organize description inconsistencies, unifying naming conventions.

  • Standardize error response templates to increase retry success rates.

  • Bulk corrections improve “overall consistency” that manual work often misses.

  • Internal cases confirmed higher hold-out accuracy with Claude optimization compared to manual implementation.

Incorporating this process into evaluation cycles enables rapid, broad quality improvement.

Utilizing Interleaved Thinking and Output Design

Slightly refining evaluation prompt design makes improvement points easier to identify.

  • Instruct output of “reasons and feedback” before tool invocation.

  • Using Interleaved Thinking enables natural tracking of thinking → tool → thinking flows.

  • This allows concrete analysis of why specific tools were or weren’t chosen.

Simply deciding output placement raises evaluation resolution significantly.

Design Principles (Selection, Namespacing, Context Return, Efficiency, Description)

Principles are multifaceted, so we organize points and failure examples in contrast.

PrinciplePractical PointsTypical FailuresNotes
Tool SelectionCarefully select by natural task unitsMechanically wrapping all APIs into “toolbox”Integrate according to practical “chunks”
NamespacingIntuitively clear boundaries in namingOverlapping roles increase misuseDecide prefix/suffix through evaluation
Context ReturnReturn high-signal onlyWasting context with UUID clusters or verbose metadataPrioritize names/summaries/lightweight IDs
Efficiency DesignPaging/filtering/ranges/truncationOne-shot large retrieval → token excessClearly state default policy in tool description
Description OptimizationEliminate ambiguity with examplesAmbiguous parameters like userBe specific like user_id

Applying improvements one by one while referencing this table is effective.

Response Format and Context Design (ResponseFormat/ID Design)

Format and ID handling directly impact accuracy and readability.

Design ElementRecommendationReasonAlternatives/Notes
Response FormatChoose JSON/Markdown/XML via evaluationAlign with LLM training distributionOptimal varies by task type
Granularity ControlResponseFormat = {DETAILED/CONCISE}Variable context costsFurther subdivision possible
IdentifiersPrioritize natural language + lightweight IDReduce misuse/hallucinationTechnical IDs only when necessary
Example: ThreadsCONCISE returns body only, DETAILED includes thread_ts etc.Guides next tool invocation0-based IDs also effective

Preparing format and granularity “switches” upfront facilitates later optimization.

Token Efficiency and Error Design

Costs and success rates greatly depend on default behavior design.

  • Truncate long responses by default, returning instructions to fetch continuation when needed (Claude Code limits tool responses to 25,000 tokens by default).

  • Promote small-scope searches (filters/range specifications), embedding descriptions in tools to avoid one-shot large searches.

  • Enforce Helpful errors.

Example: “start_date is YYYY-MM-DD. Example: 2025-09-01. project_id not specified, please retry.” Return minimal information needed for retry and brief examples.

Designing defaults to “avoid wasteful shots” also reduces post-deployment optimization costs.

Practical Namespacing and Tool Selection

Naming directly impacts “selection error reduction,” so decide through evaluation.

MethodExampleStrengthCaution
prefix typeasana_search, jira_searchClear service boundariesRequires separate resource granularity strategies
suffix typesearch_asana_projectsShows resource typeNames can become lengthy
Integrated designschedule_eventHandles chained tasks in one operationTool description and parameter design are key

Since compatibility varies by LLM, select method based on evaluation scores for optimal results.

Integrated Tool Design Examples

Integrating tools by real task units shortens strategy exploration.

  • schedule_event: Candidate search → availability check → creation in one operation.

  • search_logs: Returns only related lines and surrounding context.

  • get_customer_context: Returns recent customer activity as summary + key metadata.

This approach echoes the analogy “don’t return all contacts; first narrow with search_contacts.”

Practical Tutorial

First organize existing API/SDK specifications and internal knowledge, preparing LLM-oriented documentation (llms.txt, etc.). Load into Claude Code to generate tool templates, wrapping with local MCP servers or DXT. Connect to Claude Code via claude mcp add ..., and for Claude Desktop, enable via Settings > Developer (MCP) or Settings > Extensions (DXT). Once sufficiently operational, run programmatic evaluation loops via direct API pass-through, adjusting prompts to include structured results + reasoning + feedback in output. Concatenate evaluation logs and paste into Claude Code for batch refactoring of descriptions, naming, and error responses.

Common Pitfalls and Workarounds

We list frequently encountered issues in real settings as symptom → cause → remedy.

SymptomCauseRemedy
Misselecting similar toolsAmbiguous namespacingSelect prefix/suffix via evaluation, summarize descriptions in one line
Overly long returnsPagination/filtering undesignedDefault CONCISE, DETAILED only when needed
Retries don’t progressUnhelpful errorsReturn input validation errors with brief examples
Searches miss targetsPoor description guidanceClearly state “search small” strategy in descriptions, recommend ranges/filters
Unreliable evaluationBiased toward sandbox/single-task challengesIncrease real data/composite tasks, verify with hold-out

Use this table to rapidly inventory existing tool sets.

Examples and Effectiveness Supplement

Within Anthropic, improving Slack and Asana MCP tools collaboratively with Claude yielded higher accuracy than manual creation. Furthermore, external benchmarks (SWE-bench Verified) showed significant score improvements from minor tool description tweaks. In other words, design and description refinements directly impact performance.

Tool Design and Effectiveness

Implementation Checklist

Pre-deployment “final mile” checks promote stable operation.

  • Evaluation Design

Many real composite tasks・clearly stated verifiable goals.

  • Metrics

Success rate・tool call counts・time required・token consumption・input validation errors.

  • Efficiency Design

Default pagination/filtering/range selection/truncation.

  • Error Design

Directive, brief retry hints + minimal examples.

  • Safety Notes

Clearly annotate destructive operations or open-world access in tool descriptions.

Covering “gaps” from this perspective avoids most initial troubles.

Summary

In essence, design tools on the “assumption of being used,” visualize weaknesses through realistic evaluation, and polish through short cycles in collaboration with Claude. Start by choosing one high-frequency task, running through integrated tool creation → MCP connection → evaluation log output (with reasoning/feedback) → batch refactoring. Even small adjustments to response format, granularity, namespacing, and error messages visibly improve success rates and costs. The “design care” you can start today is the shortcut to maximizing agent capabilities.