LLM benchmark for SEO: 7 cloud and local models tested on 168 real listings
Real-world benchmark of 7 LLMs writing SEO listings. Quality score, tokens/sec, cost per listing. Winner and why.
Summary
- The experiment: 168 directory SEO listings written by 7 different models, spread across 8 niches (local businesses, products, events). Each listing graded by an independent judge LLM.
- The winner: Qwen3-235B-A22B via Cerebras, 8.4/10 average quality + 1,500 tokens/second.
- Runner-up: GLM-4.7 via SiliconFlow, 7.6/10, best in pure research/extraction.
- The surprise: Qwen3:14b running on local Ollama (consumer GPU) hits 6.5/10 — comfortably above the publishable floor.
- The avoidable loser: Mistral Medium free, 5.9/10 with constant 429s. Unworkable as primary.
- Real cost: from $0/month (pure Ollama) to $40-90/month (paid cloud production chain).
We’ve spent 8 months orchestrating LLM providers (“AIs”) to write SEO content. We’ve shipped more than 15 different models to real production (not synthetic benchmarks). This post is the comparison table we wish we’d found a year ago.
168 SEO directory listings, 7 models, 8 different niches. Each listing graded by an independent evaluator LLM. Let’s get to it.
The summary table
| Model | Quality | Speed (tok/s) | Cost/listing | When to use it |
|---|---|---|---|---|
| Qwen3-235B-A22B (Cerebras) | 8.4 | 1500 | ~$0.002 | Production primary |
| GLM-4.7 (SiliconFlow) | 7.6 | 80 | ~$0.003 | Research/extraction |
| GPT-OSS-120B (Groq) | 7.2 | 500 | free* | Secondary, not primary |
| Gemini 2.5 Flash | 6.9 | 200 | ~$0.001 | Creative tasks |
| Qwen3:14b Q6_K (Ollama local) | 6.5 | 40 | ~€0 | Off-peak, fallback |
| Mistral Medium (free) | 5.9 | 150 | free* | Unworkable for production |
| Qwen3:8b (Ollama local) | 5.2 | 60 | ~€0 | Cheap research only |
*free with rate limits that break production.
Methodology
For the table above to be trustworthy, the how matters:
The 168 listings
- 8 distinct niches picked on purpose: dentist, mechanic, Japanese restaurant, gym, traffic lawyer, hair salon, plumber, accounting firm.
- 7 models × 8 niches × 3 listings per niche = 168 listings total.
- Same base prompt, same SEO template (H1 ≤ 60 chars, 2-3 H2s, bullets, bold on key data, JSON-LD schema), same upstream research.
- Only the model writing phase 2 (first draft) changed. Phases 1, 3 and 4 stayed identical.
The evaluator
Each listing was scored by GLM-4.7 acting as judge. It returns a JSON object:
{
"score": 7.4,
"checks": {
"h1_length_ok": true,
"bold_data_present": true,
"internal_links_count": 3,
"factual_consistency": true,
"schema_valid": true
},
"issues": ["No CTA at the end", "H2 number 2 too generic"]
}
To make sure GLM-4.7 wasn’t self-scoring kindly, we cross-checked: 50 listings written by GLM-4.7 evaluated by Qwen3-235B (a different evaluator). Average difference: 0.18 points. Trustworthy enough.
Human blind test
50 random listings were also rated by 3 humans (an SEO freelancer, a specialised copywriter, a WordPress developer). Correlation between LLM-judge and the human average: 0.86. Not perfect, but high.
Top of the chart: Qwen3-235B via Cerebras
Clear winner. 8.4/10 average quality, 1,500 tokens/second, $0.002 per full listing.
Why does it hit so hard?
- Large model. 235B parameters with MoE (Mixture of Experts: activates only 22B per step, cheaper compute while keeping capacity).
- Trained for SEO/writing. Alibaba (Qwen’s creator) trained it on lots of commercial and technical text. It shows.
- Cerebras infrastructure. Its WSE-3 chips are obscenely fast compared to NVIDIA GPUs. 1,500 tok/s means a full listing (2,500 output tokens) is generated in ~1.7 seconds.
The asterisk: Cerebras is retiring Qwen3-235B on May 27, 2026. Yes, that’s already happened — it caught us. You have to migrate to one of the next-in-chain models, or wait for Qwen3 v2.
Silver: GLM-4.7
7.6/10. Chinese model from Zhipu AI. Served by SiliconFlow with $5 free starting credit.
Its superpower: research and extraction (pulling data from HTML pages, RSS feeds, PDFs and returning structured JSON). On that specific task it scores 8.1/10, above Qwen3-235B.
That’s why we use it in phase 1 of the pipeline (explained in the 9-LLM post), not in phase 2 writing.
The surprise: local Ollama 14B
This one we didn’t see coming when we started the benchmark.
Qwen3:14b quantised to Q6_K (Q6_K is a quantisation format: shrinks the model from 28GB to ~12GB with minimal quality loss, so it fits on consumer GPUs) running on our RTX 5060 Ti 16GB scores 6.5/10.
6.5 is above our minimum publishing threshold (6.0). Which means a €580 gaming GPU gives us a production-grade fallback with zero API spend.
Throughput: 8-12 listings/hour. Not brutal, but running off-peak (Spanish small hours, Asian daytime) it ships ~120 listings/day.
Practical combination: Cerebras + SiliconFlow during the day, rotate to local Ollama at night. Cloud bill drops 40% with no output loss.
The avoidable loser: Mistral Medium free
Mistral launched a generous free tier in May 2026, announcing “free medium model with good rate limits”.
We tested it as primary for a week. Result:
- Quality 5.9/10. Below our publishable threshold (6.0). Nearly a third of listings forced a rewrite.
- Constant 429s. The “generous rate limit” was ~10 req/min. For 200 listings/day production that’s ~14 req/hour, peaking at 30/hour. Broke.
- Inconsistency. The model behaved differently at different hours (our hypothesis: Mistral was A/B testing internally).
Reverted on day six. Mistral Medium ranks 5th or 6th in our fail-over chain, never primary.
Real total cost
To process 6,000 listings/month (200/day × 30) at average quality ≥ 7.5:
| Setup | Monthly cost |
|---|---|
| Paid cloud only (Cerebras + SiliconFlow + Gemini) | $40-60 |
| Hybrid (cloud peak hours + Ollama off-peak) | $25-40 |
| Local Ollama only (RTX 5060 Ti) | ~€30 electricity |
| OpenAI GPT-4o-mini only | ~$220 |
| GPT-4 turbo only | ~$900 |
Yes, you read that right. The ratio between “smart stack” and “OpenAI only” is 20×.
What’s NOT in the table that probably should be
Three models we tested but didn’t include in the final comparison:
- Claude 3.7 Sonnet (Anthropic): quality ~8.5, but cost 30× of Cerebras. Unworkable for our volume.
- DeepSeek V3: quality ~7.4, but the API had weekly downtime when we ran the test. They claim stability now.
- NuExtract:3.8b (local Ollama): specific for extraction. Tried it in research and it merged data from different companies — discarded, unreliable.
Quick takeaways
- The most expensive model is NOT the most profitable. For directory SEO, Qwen3-235B via Cerebras beats GPT-4 turbo at 20× less cost.
- Combine cloud + local. A consumer GPU is no longer optional, it’s real savings.
- Measure listing by listing, not by gut. A consistent LLM-as-judge + human cross-check = actionable data.
- Free tier is fallback, not primary. If your business relies on free tier, it’s not a business.
Want to run this stack without setting it up
Everything you just read is what SeoNova automates: you connect your WordPress and the pipeline manages 9 models, fail-over, evaluator and schema. No Ollama, no API keys, no scheduler.
Join the waitlist for 50% off the first 3 months. Launching autumn 2026.
Frequently asked questions
The questions we hear the most about this topic
Why Cerebras instead of OpenAI directly?
How did you measure quality without biasing the result?
Is GPT-OSS-120B on Groq worth it?
Is local Ollama worth the trouble vs paying for cloud?
What happened with Mistral free that barely shows up?
Keep reading
More posts you might like
- Applied AI for SEO
SEO directories with 9 AIs in parallel: how we ship 200 listings a day without writing one
How we orchestrate 9 language models in parallel to research, write and validate SEO directory listings. Real case study, architecture and metrics.
7 min read - WordPress WPO
The WordPress + Cloudflare cache invalidation order that nobody explains
How to purge 5 cache layers in WordPress (OPcache, Object Cache, WP Rocket, LiteSpeed, Cloudflare) without serving stale content. Exact order.
9 min read - WordPress Security
Free Cloudflare: how to set it up and 5 WAF anti-bot rules for your WordPress
Why free Cloudflare is brutal, how to set it up step by step, and 5 WAF rules that block 60-70% of malicious bots without paying a cent.
9 min read