← Back to blog

LLM benchmark for SEO: 7 cloud and local models tested on 168 real listings

Real-world benchmark of 7 LLMs writing SEO listings. Quality score, tokens/sec, cost per listing. Winner and why.

By SeoNova · Published · 8 min read
Bar chart comparing 7 LLM models (Qwen3-235B, GLM-4.7, GPT-OSS-120B, Gemini Flash, Qwen3:14b local, Mistral Medium, Qwen3:8b local) with their quality scores. Qwen3-235B via Cerebras marked as winner.
Bar chart comparing 7 LLM models (Qwen3-235B, GLM-4.7, GPT-OSS-120B, Gemini Flash, Qwen3:14b local, Mistral Medium, Qwen3:8b local) with their quality scores. Qwen3-235B via Cerebras marked as winner.

We’ve spent 8 months orchestrating LLM providers (“AIs”) to write SEO content. We’ve shipped more than 15 different models to real production (not synthetic benchmarks). This post is the comparison table we wish we’d found a year ago.

168 SEO directory listings, 7 models, 8 different niches. Each listing graded by an independent evaluator LLM. Let’s get to it.

The summary table

ModelQualitySpeed (tok/s)Cost/listingWhen to use it
Qwen3-235B-A22B (Cerebras)8.41500~$0.002Production primary
GLM-4.7 (SiliconFlow)7.680~$0.003Research/extraction
GPT-OSS-120B (Groq)7.2500free*Secondary, not primary
Gemini 2.5 Flash6.9200~$0.001Creative tasks
Qwen3:14b Q6_K (Ollama local)6.540~€0Off-peak, fallback
Mistral Medium (free)5.9150free*Unworkable for production
Qwen3:8b (Ollama local)5.260~€0Cheap research only

*free with rate limits that break production.

Methodology

For the table above to be trustworthy, the how matters:

The 168 listings

  • 8 distinct niches picked on purpose: dentist, mechanic, Japanese restaurant, gym, traffic lawyer, hair salon, plumber, accounting firm.
  • 7 models × 8 niches × 3 listings per niche = 168 listings total.
  • Same base prompt, same SEO template (H1 ≤ 60 chars, 2-3 H2s, bullets, bold on key data, JSON-LD schema), same upstream research.
  • Only the model writing phase 2 (first draft) changed. Phases 1, 3 and 4 stayed identical.

The evaluator

Each listing was scored by GLM-4.7 acting as judge. It returns a JSON object:

{
  "score": 7.4,
  "checks": {
    "h1_length_ok": true,
    "bold_data_present": true,
    "internal_links_count": 3,
    "factual_consistency": true,
    "schema_valid": true
  },
  "issues": ["No CTA at the end", "H2 number 2 too generic"]
}

To make sure GLM-4.7 wasn’t self-scoring kindly, we cross-checked: 50 listings written by GLM-4.7 evaluated by Qwen3-235B (a different evaluator). Average difference: 0.18 points. Trustworthy enough.

Human blind test

50 random listings were also rated by 3 humans (an SEO freelancer, a specialised copywriter, a WordPress developer). Correlation between LLM-judge and the human average: 0.86. Not perfect, but high.

Top of the chart: Qwen3-235B via Cerebras

Clear winner. 8.4/10 average quality, 1,500 tokens/second, $0.002 per full listing.

Why does it hit so hard?

  • Large model. 235B parameters with MoE (Mixture of Experts: activates only 22B per step, cheaper compute while keeping capacity).
  • Trained for SEO/writing. Alibaba (Qwen’s creator) trained it on lots of commercial and technical text. It shows.
  • Cerebras infrastructure. Its WSE-3 chips are obscenely fast compared to NVIDIA GPUs. 1,500 tok/s means a full listing (2,500 output tokens) is generated in ~1.7 seconds.

The asterisk: Cerebras is retiring Qwen3-235B on May 27, 2026. Yes, that’s already happened — it caught us. You have to migrate to one of the next-in-chain models, or wait for Qwen3 v2.

Silver: GLM-4.7

7.6/10. Chinese model from Zhipu AI. Served by SiliconFlow with $5 free starting credit.

Its superpower: research and extraction (pulling data from HTML pages, RSS feeds, PDFs and returning structured JSON). On that specific task it scores 8.1/10, above Qwen3-235B.

That’s why we use it in phase 1 of the pipeline (explained in the 9-LLM post), not in phase 2 writing.

The surprise: local Ollama 14B

This one we didn’t see coming when we started the benchmark.

Qwen3:14b quantised to Q6_K (Q6_K is a quantisation format: shrinks the model from 28GB to ~12GB with minimal quality loss, so it fits on consumer GPUs) running on our RTX 5060 Ti 16GB scores 6.5/10.

6.5 is above our minimum publishing threshold (6.0). Which means a €580 gaming GPU gives us a production-grade fallback with zero API spend.

Throughput: 8-12 listings/hour. Not brutal, but running off-peak (Spanish small hours, Asian daytime) it ships ~120 listings/day.

Practical combination: Cerebras + SiliconFlow during the day, rotate to local Ollama at night. Cloud bill drops 40% with no output loss.

The avoidable loser: Mistral Medium free

Mistral launched a generous free tier in May 2026, announcing “free medium model with good rate limits”.

We tested it as primary for a week. Result:

  • Quality 5.9/10. Below our publishable threshold (6.0). Nearly a third of listings forced a rewrite.
  • Constant 429s. The “generous rate limit” was ~10 req/min. For 200 listings/day production that’s ~14 req/hour, peaking at 30/hour. Broke.
  • Inconsistency. The model behaved differently at different hours (our hypothesis: Mistral was A/B testing internally).

Reverted on day six. Mistral Medium ranks 5th or 6th in our fail-over chain, never primary.

Real total cost

To process 6,000 listings/month (200/day × 30) at average quality ≥ 7.5:

SetupMonthly cost
Paid cloud only (Cerebras + SiliconFlow + Gemini)$40-60
Hybrid (cloud peak hours + Ollama off-peak)$25-40
Local Ollama only (RTX 5060 Ti)~€30 electricity
OpenAI GPT-4o-mini only~$220
GPT-4 turbo only~$900

Yes, you read that right. The ratio between “smart stack” and “OpenAI only” is 20×.

What’s NOT in the table that probably should be

Three models we tested but didn’t include in the final comparison:

  • Claude 3.7 Sonnet (Anthropic): quality ~8.5, but cost 30× of Cerebras. Unworkable for our volume.
  • DeepSeek V3: quality ~7.4, but the API had weekly downtime when we ran the test. They claim stability now.
  • NuExtract:3.8b (local Ollama): specific for extraction. Tried it in research and it merged data from different companies — discarded, unreliable.

Quick takeaways

  1. The most expensive model is NOT the most profitable. For directory SEO, Qwen3-235B via Cerebras beats GPT-4 turbo at 20× less cost.
  2. Combine cloud + local. A consumer GPU is no longer optional, it’s real savings.
  3. Measure listing by listing, not by gut. A consistent LLM-as-judge + human cross-check = actionable data.
  4. Free tier is fallback, not primary. If your business relies on free tier, it’s not a business.

Want to run this stack without setting it up

Everything you just read is what SeoNova automates: you connect your WordPress and the pipeline manages 9 models, fail-over, evaluator and schema. No Ollama, no API keys, no scheduler.

Join the waitlist for 50% off the first 3 months. Launching autumn 2026.

Frequently asked questions

The questions we hear the most about this topic

Why Cerebras instead of OpenAI directly?
Three reasons. Speed: Cerebras serves Qwen3-235B at 1,500 tokens/second (OpenAI GPT-4 sits at 50-100 tok/sec). Quality: Qwen3-235B-A22B scores ~8.4 vs GPT-4o ~8.6 in our tests, practically a tie. Cost: Cerebras Dev Tier is $10 prepaid that lasts ~2 weeks of our production; matching that throughput on OpenAI costs $900/month. At volume, the math is brutal.
How did you measure quality without biasing the result?
Each listing is evaluated by a model different from the one that wrote it (LLM-as-judge). By default we use GLM-4.7 as the evaluator because it scores hard and consistently. The judge returns 0-10 + a structured list of issues (H1 length off, missing bold, weak internal linking, etc.). To verify the judge wasn't biased, we ran a double-blind with humans on 50 random listings: correlation 0.86, high enough to trust.
Is GPT-OSS-120B on Groq worth it?
Yes, with an asterisk. Meta's open-source model, solid quality (7.2 in our tests), Groq serves it free up to a threshold. Catch: Groq's free tier has aggressive rate limits during peak hours (429 every 3-4 requests). We use it as second-in-chain, never as primary.
Is local Ollama worth the trouble vs paying for cloud?
Depends on volume and timing. If your business tolerates listings being generated during off-peak hours, local Ollama with a 16GB GPU (RTX 4070 Ti / 5060 Ti) ships 8-12 listings/hour at ~6.5/10 quality. Marginal cost: ~€30 in electricity per month. If you need 200+ listings/day during specific hours, you need cloud — local can't keep up.
What happened with Mistral free that barely shows up?
Mistral launched a generous free tier in May 2026 and we ran it as primary for a week. Result: constant 429s (rate limit) + quality 5.9 vs ~7.5 from Cerebras. Reverted on day six. Free tier is fine for experimentation, never holds for production.

Keep reading

More posts you might like