Aetherwave Studio Blog

The MCP server I shipped this week is small. It's around 800 lines of TypeScript. It exposes three tools: music generation, image generation, video generation. Each one accepts a prompt and a few common parameters and returns a hosted URL when the artifact is ready. From the agent's perspective there is no provider concept at all. The agent says "generate a cinematic 5-second video of a dragon in a city street" and gets a video. It doesn't pick Kling versus Hailuo versus Seedance, it doesn't choose fast or quality variants, it doesn't deal with polling.

MCP is the part of the agent stack people will look back on as obvious. It's a protocol for letting an agent call tools that live anywhere, defined by anyone, with a uniform shape. The reason it matters isn't the protocol itself, it's the second-order effect: once tools are uniform from the agent's perspective, you can compose them in ways that don't survive in a world of bespoke SDKs.

This post is about one of those second-order effects, specifically for media generation.

If you've integrated against more than two AI generation providers you already know the shape of the problem. Suno has its own auth scheme and an asynchronous job model with callbacks. fal.ai has a different async pattern with queue updates streamed back. Replicate has its own SDK. KIE.ai has its own. OpenAI's image API uses a synchronous shape. Each provider has its own credit pool, its own dashboard, its own pricing units (some per-image, some per-second, some per-token), its own error vocabulary, and its own opinion about what "completed" means.

The cost of integrating against this isn't writing the first call. The first call is easy for any one provider. The cost is the second derivative: every new provider you add multiplies the maintenance load, because each one is a separate failure mode you have to monitor, a separate retry policy you have to tune, a separate billing surface you have to reconcile, and a separate model-selection branch your code has to make.

I shipped my AI creative platform (AetherWave Studio) on top of roughly a dozen of these providers, and the routing layer between them became the single largest piece of operational complexity in the system. Not the gen itself. The glue.

This isn't a story about that glue, though. That glue exists, it works, it's been running in production for a year, and it's not interesting in isolation. The story is what happened when I exposed the platform through MCP.

What the agent inherits, without writing any of it, is everything the upstream platform already does:

Provider fallback. When KIE flakes on Grok video, the call quietly retries on fal.ai with the same prompt. When Atlas misbehaves on a Wan request, we route to KIE's variant. The agent doesn't see any of this. It sees a successful result or a structured error. Resilience that took six months of incident response to harden upstream is now free to every agent that talks to the MCP server.

Model selection. The upstream router has opinions about which provider to use for which kind of request. T2I with a strict photoreal style goes to Z-Image. Spicy content goes through Wan 2.5. High-res video work goes to Kling 3.0. Those opinions live in the upstream code, which means they can be updated centrally. An agent built six months ago against the v0.1 server will silently benefit from improved routing without a re-install.

Credit pooling. This is the piece that turns out to be more useful than I expected. When developers integrate against six providers individually, they're maintaining six balances, six dashboards, six top-up workflows. The mental tax of that is real. One unified pool removes it. A developer can build an agent that uses music plus image plus video without ever thinking about which pool funded which call.

The agent-native framing is the thing I want to underline. This package is not really for developers to call directly through the npm import. It's for agents to call through MCP. The developer's job is to point an agent at the MCP server and write prompts. The agent's job is to decompose those prompts into tool calls. The tool's job is to return artifacts. Each layer is doing what it's good at, and the developer is back to doing the part of the work that's actually interesting (the prompt, the user experience, the product), instead of the part that isn't (the integration plumbing).

What v0.1 doesn't include, and what I want to be straight about. There are two higher-level capabilities on the AetherWave platform (band identity generation and audio mastering) that exist as REST endpoints but aren't yet exposed through the MCP server. They need additional auth plumbing on the underlying routes before I can ship them safely. I made a deliberate call to not ship them in v0.1 rather than ship them broken. They're on the roadmap, not in the package today. That feels like the right tradeoff: a smaller surface that works completely is more useful than a larger surface with sharp edges.

The other thing I'd call out as a design choice still up in the air: the agent doesn't currently know what a tool call will cost in credits before it invokes the tool. I've been going back and forth on whether to add a `dry_run` or `estimate` parameter that returns a cost without running the job. Arguments both ways. For now I've left it out, on the principle that v0.1 should be the smallest surface that's useful, and price estimation is a question I want to answer with real usage data rather than speculation.

Install:

npm install -g @aetherwave-studio/mcp

Repo (public, MIT):

https://github.com/AetherWave-Studio/aetherwave-mcp

Landing, docs, keys:

https://aetherwavestudio.com/developers

If you build something with it I'd genuinely like to hear about it.

The integration tax on agent media generation, and how MCP unmakes it