The $35,000 Wake-Up Call
We needed to generate SEO descriptions for 858,000 e-commerce products. A straightforward task: take a product title, brand, and existing description in Serbian, then produce an English translation, a cleaned-up title, and a short SEO paragraph. Five fields per product, a few sentences each.
The first estimate from Anthropic's Claude API? $35,000. For text generation. For a task that a knowledgeable human could do in 30 seconds per product — but there are 858,000 of them.
This is the dirty secret of AI-as-a-Service: the per-token pricing model that looks cheap at demo scale becomes absurd at production scale. When your system prompt is 4,000 tokens and you're sending it 858,000 times, you're paying to process 3.4 billion tokens of instructions that never change. It's like paying a consultant's hourly rate to re-read their job description before every task.
The Optimization Journey
What followed was a 48-hour deep dive into cost optimization that took us from $35,000 to $180 — a 194x reduction — while maintaining the same output quality. Along the way, we discovered that:
-
Anthropic's Batch API doesn't cache prompts. Despite advertising prompt caching as a feature, the Batch API (which offers 50% off) processes each request independently on different servers. No caching. The "discount" actually costs 3.6x MORE than the standard API when you have a large system prompt. We only discovered this by checking the dashboard after a 400-product test run.
-
The most expensive token is the one you don't need to send. Our system prompt contained 1,551 product categories for recategorization. Trimming to the top 626 categories (covering 95% of products) cut costs by 42%. The remaining 5% of products just kept their existing category.
-
Self-hosting a smaller model on a single GPU beats the best API pricing by 27x. Qwen3.5, an open-source model with 3 billion active parameters, produces Serbian text comparable to Claude Sonnet 4.5 — at a fraction of the cost. One Nvidia B200 GPU processes 36,000 products per hour — so all 858,000 finished in under 24 hours. The GPU paid for itself in a single day.
-
Parallelism is free when the GPU has headroom. Our B200 was using 22% of its memory with 200 concurrent requests. We went from 1 request at a time (310 products/hour) to 256 parallel workers (35,000/hour) — a 113x throughput increase with zero additional cost.
The Real Cost of AI APIs
The AI industry's pricing model is built for developers running demos and startups processing hundreds of requests. At enterprise scale — millions of products, documents, or records — the per-token model breaks down spectacularly.
Consider: our 858,000 products needed roughly 500 billion FLOPs of actual computation. A B200 GPU delivers 2,250 TFLOPS. The actual compute time is measured in seconds, not hours. Yet the API charges as if each request requires dedicated attention from a room full of H100s.
Self-hosting isn't free — there's the engineering time to set up vLLM, optimize prompts, debug deployments, and handle failures. But when the alternative is a $35,000 invoice for generating short product descriptions, the math is clear.
What We Learned
The AI-as-a-Service model makes sense for prototyping, small-scale use, and tasks where quality justifies premium pricing. But for batch processing at scale — especially with a large, repeated system prompt — self-hosted inference on rented GPUs is the pragmatic choice. The open-source model ecosystem (Qwen, Llama, DeepSeek) has reached the quality threshold where, for many languages and tasks, API-exclusive models no longer justify their 27x price premium.
The irony? We used Claude (the expensive API) to develop and refine our prompts, evaluate quality, and establish the baseline. Then we deployed a free, open-source model to do the actual work. The API was the R&D cost; the GPU was the production cost. That division of labor — premium API for development, commodity GPU for execution — might be the real model for AI at scale.
Ready to see what AI can do for your business? See how it works — pricing, timeline, and what you get.