Speed Up Your AI App: One Header, 8 Seconds → 1.5 Seconds

TL;DR

If your app is a tool, translator, simple chatbot, JSON generator, or anything similarly lightweight — add this header to your /api/ai/gemini requests:

X-Thinking-Mode: fast

Average response time drops from ~8s to ~1.5s. Don't send the header and nothing changes.

Why is your AI call slow?

We dug into our production logs recently and found something surprising: on most AI calls, the model spends 60–80% of its time "thinking" — internal reasoning the user never sees.

Concretely: Gemini 3 Flash generates an average of 1,001 thinking tokens per call, while the user-facing answer is only 434 tokens. The model is doing 2.3× more thinking than answering.

For some workloads this helps — complex role-play, multi-step reasoning, long-context callbacks. But for most lightweight tasks ("translate this," "summarize this," "give me JSON"), thinking doesn't help much. It's just latency.

Now you can opt out

We added an opt-in header: X-Thinking-Mode. Three values:

Mode	What it does	Best for
`fast`	`thinkingLevel: minimal`	Tools, translation, short chat, classifiers, JSON generation
`balanced`	`thinkingBudget: 200`	Medium-complexity tasks that benefit from a little reasoning
_(omit)_	Current default behavior	Long-form RPG, multi-turn role-play, complex narratives

`fast` doesn't kill thinking entirely — it's Google's adaptive setting. The model uses 0 thinking tokens on simple prompts and the minimum necessary on complex ones. Quality loss is smaller than you'd expect.

How to use it

Add one header:

const response = await fetch('/api/ai/gemini', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Thinking-Mode': 'fast',  // 👈 add this
  },
  body: JSON.stringify({
    path: '/v1beta/models/gemini-3-flash-preview:generateContent',
    contents: [{ parts: [{ text: 'Translate "hello world" to French' }], role: 'user' }],
  }),
})

Per-request — you can mix modes in the same app:

Player input → quick NPC reply → use fast
Critical plot turn / ending resolution → omit the header, let the model think
Structured JSON tool calls → use balanced

When should you use it?

Strongly recommend `fast` for:

Translation, summarization, paraphrasing
Simple chatbots
Utility apps (calculators, document tools, code snippets)
Classification, tagging
Short NPC dialogue
Anything where the prompt is clear and the output is short

Use `balanced` for:

Longer creative writing
Medium-stakes role-play turns
Structured output (JSON tool calls)

Stay on default (no header) for:

Long-form narrative generation
Multi-character consistency in one segment
Long-context callbacks ("remember the X from chapter 5?")
Critical decisions (ending triggers, rule resolution)

Trade-offs

fast mode may slightly degrade quality on:

Multi-step reasoning (math, logic puzzles)
Outputs that need long-range coherence (consistent details across paragraphs)
Structured outputs with many fields (occasional missing keys when the schema is complex)

If you see noticeable quality drops, just remove the header to restore default behavior. Or step down to balanced.

How does it interact with `thinkingConfig` you already set?

If you've explicitly set `thinkingConfig` in generationConfig, the platform will not override it with the header. Your code wins.

Precedence (high → low):

Your thinkingConfig in the request body
X-Thinking-Mode header
Platform defaults

Why now?

A lot of creators have asked us "why is AI a bit slow?" When we looked at the data, we found that almost no one had made a deliberate choice about thinking — 97% of calls were running on whatever the default happened to be. Giving you a simple opt-in is the cleanest way to hand control back.

Next up we're adding a graphical "performance mode" toggle to the creator dashboard for non-developers. The API layer comes first.

Try it out. If you hit issues or notice quality drops, let us know →

Discord or GitHub Issues