Jun 07, 2025

What is an MCP anyway

I read a interesting post recently about creating a hackable AI assistant. There were a lot of things that appealed to me about it:

Simple. Not using abstractions like LangChain which have always felt a bit heavy to me
Custom context. Collect the data you need in regular code and feed it in as part of the context window.
Tool use. Instead of wrestling with schemas and letting LLMs call the tools, they implemented an approach to parse the LLM response and call the appropriate tool.

Hacky, but I loved it. The code was running on Valtown and was a bunch of typescript.

Enter MCP

Anthropic released MCP last year and it has been starting to see wider spread adoption, so I figured now was a good as time as any to jump in and see whether I could apply this in a way that felt useful or compressed the complexity involved.

With the release of Claude 4, the Anthropic's API now supports using custom remote MCP servers in regular api calls - in simpler terms Claude can now send requests to a custom MCP server that I host.

How is this different to tool use?

Tool use has been implemented in a variety of different ways

Manual tool parsing

Get the LLM to output the tool call and the data you want, and then parse and execute the function on your machine. Pass back the results in the next call.

This is what the AI assistant article above was doing. So the LLM can output

Example response that LLM can output:
"I will update your birthday in my records

<editMemories>
[{ "id": "abc123", "text": "Client's birthday is on April 15th.", "date": "2024-04-15" }]
</editMemories>"

And we can parse that and call a function manually. We can then pass the result back to the LLM within the conversation chain and keep going

Tool usage

This is the more officially supported way, available in Anthropic and OpenAI apis. Similar concept to above, in that you are still executing the code and passing back the results to the LLM but with a more structured schema.

tools=[
        {
            "name": "edit_memory",
            "input_schema": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "What to update the memory to",
                    }
                    "id": {
                        "type": "string",
                        "description": "The id of the memory to update",
                    }
                },
            },
        }
    ],

Instead of telling the LLM how to call the method in the prompt, you pass in a schema and the provider (eg. Anthropic) is translating that into a prompt telling the LLM how to call the method.

From the Anthropic docs

When you use tools, we also automatically include a special system prompt for the model which enables tool use.

Remote tool usage

Used with provider tools such as web search.

    tools=[{
        "type": "web_search_20250305",
        "name": "web_search",
        "max_uses": 5
    }]

This allows the LLM to call that function, which is executed by the provider (or more realistically a third party search provider like Brave).

MCP

Enter MCP.

MCP allows you to host your own server that the LLM can call in a well defined way, and let the MCP server do the execution. The results are returned to the LLM in a well defined way - and off we go.

What does the server do

Let's dive into some code, starting from the mcp python sdk

from mcp.server.fastmcp import FastMCP
mcp = FastMCP("Demo")

# Add an addition tool
@mcp.tool()
def add(a: int, b: int) -> int:
    """Add two numbers"""
    return a + b

Now we are talking. We have the ability to execute a tool call using a flask-esque syntax. There are all these other parts of the MCP spec such as Resources - but so far most clients (including Anthropic!) only seem to support tool usage.

How is this any better than just passing in the data I need in the context window

To quote from the Stevens article

you don’t need fancy techniques or libraries to build useful personal tools with LLMs

Most of the examples I have seen of using custom MCPs have not really landed for me. As an example, this MCP resume allows you to connect an LLM to an MCP for this developer - in order to get data such as the resume in plaintext. In the scenario I really did want that context - it seems to me that it would make way more sense to just get that data in a regular manner and throw the data into the context window.

So when does it make sense to make a custom MCP? There's a few areas that come to mind for me so far

You are a large software company like Atlassian, and you want to expose CRUD functionality to your systems for users (ie. a Jira MCP or similar). Does not apply to me!
You are making a custom MCP for your own needs, not something that someone would connect to publicly. This allows you to expose specific functionality to the LLM that you want it to have. An explicit example: I have used this to allow a voice LLM to send requests to a MadMapper lighting setup - so you can trigger scenes and actions via voice.
You want to expose tools that take action, and having it run on a centralized server is useful (rather than running the code on the machine that is making the LLM calls). I can see this being useful where I have lots of different modalities that are making calls to a server (ie. MCP serving a text LLM and a different voice LLM, as well as a regular API that I can call). Having one server that handles all the APIs and database connections etc is a useful abstraction.

So let's make a custom server

This server is going to be a simple message board that will persist data using sqlite-utils and allow you to get the past 5 messages that have been left.

import datetime
import uvicorn
from mcp.server.fastmcp import FastMCP
from sqlite_utils import Database

mcp = FastMCP("Message Board", stateless_http=True)
db = Database("message_board.db")
messages = db["messages"]


@mcp.tool()
async def leave_message(message: str) -> bool:
    """Leaves a message on the message board."""
    messages.insert({
        "message": message,
        "time": datetime.datetime.now().isoformat(),
    }, pk="id")  # This will create an auto-incrementing integer primary key
    return True


@mcp.tool()
async def last_messages() -> list[dict]:
    """Gets the last 5 messages that have been left on the message board. Most recent messages are returned first"""
    recent_messages = messages.rows_where(
        order_by="id desc",
        limit=5
    )

    return [recent_message["message"] for recent_message in recent_messages]

Alright, this is good enough for a simple message board. I have deployed it to https://mcp.andrewperkins.com.au/message_board/mcp/

💡

Note: All the code is available on Github. I discuss the Kamal and FastAPI details later in this post, but the source of truth is the code

How does it get used

import anthropic
client = anthropic.Anthropic()
response = client.beta.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1000,
    messages=[
        {
        "role": "user",
        "content": "Read the messages, then leave a message summarizing what is being talked about",
    }],
    mcp_servers=[{
        "type": "url",
        "url": "https://mcp.andrewperkins.com.au/message_board/mcp/",
        "name": "Message Board",
    }],
    betas=["mcp-client-2025-04-04"]
)

output = '\n'.join([c.text for c in response.content if c.type=='text'])

print(output)

We can let Claude call our MCP server (and in fact, we are asking it to call it twice - once to read the messages and once to leave a new message). When this request starts it will send a ListToolRequest to the server to determine what functionality it offers. This call and response will not be included in your message output back from Claude, but you are billed for the token usage of it!

If we look at the response, there are seven content responses coming back from Claude

The preamble

Claude starts with a small text preamble outlining its plan

BetaTextBlock(citations=None, text="I'll first read the recent messages on the message board, then leave a summary message", type='text')

Tool use

Claude then calls the MCP server's last_messages endpoint

BetaMCPToolUseBlock(input={}, name='last_messages', server_name='Message Board', type='mcp_tool_use')

Tool result

Claude gets back the response from the get_info endpoint. This one is a bit more detailed because its a list of text blocks

BetaMCPToolResultBlock(
  content=[
    BetaTextBlock(text='Python is such an elegant programming language!, type='text'), 
    BetaTextBlock(text="Just met the most amazing imaginary dog today! His name is Sparkles and he's a golden retriever", type='text'),
    BetaTextBlock(text="Hello! I'm an AI assistant currently trying out MCP (Model Context Protocol).", type='text')
  ], 
  type='mcp_tool_result'
)

More planning

Claude now decides to take further action (notice that it is making multiple MCP calls in one response and taking action on based on the data it is getting back).

BetaTextBlock(text="Now I'll leave a summary message about what's being discussed:", type='text')

Tool use

Claude leaves a new message based on the content it found on the board

BetaMCPToolUseBlock(input={'message': 'Summary of recent discussions: The message board has been covering diverse topics including enthusiasm for Python programming, a creative post about an imaginary dog named Sparkles, and an AI assistant exploring the Model Context Protocol (MCP).'}, name='leave_message', server_name='Message Board', type='mcp_tool_use')

Tool result

Claude gets back a message showing that the message was saved

BetaMCPToolResultBlock(content=[BetaTextBlock(citations=None, text='true', type='text')], is_error=False, type='mcp_tool_result')

Final output

Claude now summarizes what it has done

BetaTextBlock(text="I've read the recent messages and left a summary on the message board. The discussions covered three main topics: appreciation for Python programming language, a creative post about an imaginary dog named Sparkles, and an AI assistant exploring MCP (Model Context Protocol) capabilities. The summary has been successfully posted to the message board.", type='text')

Makes sense, there's a lot of calls going on but most of them are the request response cycle for each tool usage call.

Looking at the usage data, that took 2259 input_tokens and 293 output_tokens which is roughly half a cent. This feels quite high (given my inital prompt is around 20 tokens), and I can't find good documentation on how the MCP token usage is calculated. Obviously the MCP tokens are being fed in as input tokens, but I am not sure how, so I try to figure it out at the end of this post.

Deployment

Instead of deploying the MCP server directly, I instead wrap it behind a simple FastAPI app. This is useful because I can declare other routes such as a /up health check that is required by my deployment tooling.

Also worth noting I am using Streamable HTTP instead of SSE (which most MCP servers currently use) as it is the new standard recommended by MCP.

# following on from the server code above

import contextlib
import uvicorn
from fastapi import FastAPI

# Create a combined lifespan to manage mcp session manager
@contextlib.asynccontextmanager
async def lifespan(app: FastAPI):
    async with contextlib.AsyncExitStack() as stack:
        await stack.enter_async_context(mcp.session_manager.run())
        yield


app = FastAPI(lifespan=lifespan)
app.mount("/message_board", mcp.streamable_http_app())


@app.get("/up")
def up():
    return {"message": "Hello world"}


if __name__ == '__main__':
    uvicorn.run(app, host="127.0.0.1", port=8222)

For my deployment tooling I am using Kamal, which I have previously written an introduction and explainer for. I have deployed the MCP server on the same DigitalOcean droplet that is serving this Ghost website.

💡

Again, for more details you can check out the code on Github

Bonus section: Figuring out the token usage

Anthropic has a count_tokens endpoint which allows you to estimate input tokens

https://api.anthropic.com/v1/messages/count_tokens

counter = anthropic.Anthropic().beta.messages.count_tokens(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "What is 1+1? Just return the answer, nothing else"},
    ]
)

By default the query "What is 1+1? Just return the answer, nothing else" is 21 input tokens. I can validate this by sending that request, which has 21 input_tokens and 5 output_tokens.

counter = anthropic.Anthropic().beta.messages.count_tokens(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "user", "content": "What is 1+1? Just return the answer, nothing else"},
    ],
    mcp_servers=[{
        "type": "url",
        "url": "https://mcp.andrewperkins.com.au/message_board/mcp/",
        "name": "Andrew Perkins",
    }],
    betas=["mcp-client-2025-04-04"]
)

Once I add mcp_servers, count_tokens goes up to 493 input tokens. This happens even though the request will clearly not use any of the MCP tools. 493 - 21 = 472 token overhead to add the MCP. This seems to be due to the ListToolsRequest request, which runs when you send the command (even count_tokens). From the server logs of my MCP

2025-05-25T01:56:51.906687571Z INFO:     172.18.0.3:45976 - "POST /message_board/mcp/ HTTP/1.1" 200 OK
2025-05-25T01:56:51.909590884Z [05/25/25 01:56:51] INFO     Processing request of type            server.py:551
2025-05-25T01:56:51.909609444Z                              ListToolsRequest

Let's break down what the ListToolRequest response looks like

{
  "tools": [
    {
      "name": "leave_message",
      "description": "Leaves a message on the message board.\n    ",
      "inputSchema": {
        "type": "object",
        "properties": {
          "message": {
            "title": "Message",
            "type": "string"
          }
        },
        "required": [
          "message"
        ],
        "title": "leave_messageArguments"
      }
    },
    {
      "name": "last_messages",
      "description": "Gets the last 5 messages that have been left on the message board.\n\n    Most recent messages are returned first\n    ",
      "inputSchema": {
        "type": "object",
        "properties": {},
        "title": "last_messagesArguments"
      }
    }
  ]
}

Pumping that back into count_tokens gets 217 input tokens. So there are some overhead tokens being added, potentially from static token overheads like are added for tool usage.

Testing this out on the real api, the token usage again matches once I add mcp_server definition

response = client.beta.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1000,
    messages=[{
        "role": "user",
        "content": "What is 1+1? Just return the answer, nothing else"
    }],
    mcp_servers=[{
        "type": "url",
        "url": "https://mcp.andrewperkins.com.au/info/mcp/",
        "name": "Andrew Perkins",
    }],
    betas=["mcp-client-2025-04-04"]
)

Because we don't use the MCP, we only use 493 input tokens and 5 output_tokens. Once we change the query to ask about information that the MCP can serve, the count_tokens and actual tokens diverge.

{"role": "user", "content": "What messages on the board?"},

We get 493 input_tokens from count_tokens but we now get 1359 input token when actually running because we hit the last_messages endpoint.

However, if I count the tokens we are getting back from the last_messages endpoint, it is only 543 tokens (and that is including all the metadata like the BetaMCPToolUseBlock strings and other metadata). The difference in token usage between count_tokens and the actual token usage is 1359 - 493 is 866 tokens. So there's a 866-543=323 token overhead that is coming from making this MCP call even once you factor in the data coming back from the server.

At this point I am not really sure how to dig deeper, so I hope that Anthropic includes more information on MCP tool token usage in future doc updates.