·3 min read

Speed Up Your AI App: One Header, 8 Seconds → 1.5 Seconds

Introducing the X-Thinking-Mode header. For tools, translation, and lightweight chat apps, one extra line cuts AI response time by 5×.

announcementperformanceaigemini

TL;DR

If your app is a tool, translator, simple chatbot, JSON generator, or anything similarly lightweight — add this header to your /api/ai/gemini requests:

X-Thinking-Mode: fast

Average response time drops from ~8s to ~1.5s. Don't send the header and nothing changes.


Why is your AI call slow?

We dug into our production logs recently and found something surprising: on most AI calls, the model spends 60–80% of its time "thinking" — internal reasoning the user never sees.

Concretely: Gemini 3 Flash generates an average of 1,001 thinking tokens per call, while the user-facing answer is only 434 tokens. The model is doing 2.3× more thinking than answering.

For some workloads this helps — complex role-play, multi-step reasoning, long-context callbacks. But for most lightweight tasks ("translate this," "summarize this," "give me JSON"), thinking doesn't help much. It's just latency.

Now you can opt out

We added an opt-in header: X-Thinking-Mode. Three values:

ModeWhat it doesBest for
fastthinkingLevel: minimalTools, translation, short chat, classifiers, JSON generation
balancedthinkingBudget: 200Medium-complexity tasks that benefit from a little reasoning
_(omit)_Current default behaviorLong-form RPG, multi-turn role-play, complex narratives

`fast` doesn't kill thinking entirely — it's Google's adaptive setting. The model uses 0 thinking tokens on simple prompts and the minimum necessary on complex ones. Quality loss is smaller than you'd expect.

How to use it

Add one header:

const response = await fetch('/api/ai/gemini', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Thinking-Mode': 'fast',  // 👈 add this
  },
  body: JSON.stringify({
    path: '/v1beta/models/gemini-3-flash-preview:generateContent',
    contents: [{ parts: [{ text: 'Translate "hello world" to French' }], role: 'user' }],
  }),
})

Per-request — you can mix modes in the same app:

  • Player input → quick NPC reply → use fast
  • Critical plot turn / ending resolution → omit the header, let the model think
  • Structured JSON tool calls → use balanced

When should you use it?

Strongly recommend `fast` for:

  • Translation, summarization, paraphrasing
  • Simple chatbots
  • Utility apps (calculators, document tools, code snippets)
  • Classification, tagging
  • Short NPC dialogue
  • Anything where the prompt is clear and the output is short

Use `balanced` for:

  • Longer creative writing
  • Medium-stakes role-play turns
  • Structured output (JSON tool calls)

Stay on default (no header) for:

  • Long-form narrative generation
  • Multi-character consistency in one segment
  • Long-context callbacks ("remember the X from chapter 5?")
  • Critical decisions (ending triggers, rule resolution)

Trade-offs

fast mode may slightly degrade quality on:

  • Multi-step reasoning (math, logic puzzles)
  • Outputs that need long-range coherence (consistent details across paragraphs)
  • Structured outputs with many fields (occasional missing keys when the schema is complex)

If you see noticeable quality drops, just remove the header to restore default behavior. Or step down to balanced.

How does it interact with thinkingConfig you already set?

If you've explicitly set `thinkingConfig` in generationConfig, the platform will not override it with the header. Your code wins.

Precedence (high → low):

  1. Your thinkingConfig in the request body
  2. X-Thinking-Mode header
  3. Platform defaults

Why now?

A lot of creators have asked us "why is AI a bit slow?" When we looked at the data, we found that almost no one had made a deliberate choice about thinking — 97% of calls were running on whatever the default happened to be. Giving you a simple opt-in is the cleanest way to hand control back.

Next up we're adding a graphical "performance mode" toggle to the creator dashboard for non-developers. The API layer comes first.


Try it out. If you hit issues or notice quality drops, let us know →

Discord or GitHub Issues

Ready to share your creation?

Publish your AI-built app and get a landing page in seconds.

Submit Your App