OmniTutor
▸ streaming architecture · zero-spinner spec

It must never feel slow.

On-demand generation is OK. Cold start is OK. Spinners are not. Every click triggers visible engagement within 400ms — drama, owl voice, subject animation — while Haiku drafts the first beat in parallel and Sonnet streams the rest behind the scenes.
▸ The hard guarantee
From click to tutor speaking: 3 seconds. From click to first visual: 400ms. Never blank, never spinner, never blocking.
≤ 0.4sto first visual
≤ 1.5sto owl speaking
≤ 3sto beat 1 rendered
0spinners ever

1.The sequence

Four lanes run in parallel: the user, the UI (drama modal + owl), Haiku (fast first beat), Sonnet (streamed remainder). The user never sees a blank moment because the UI is already filled before the model has answered.

time
USER
student
UI
drama + owl
HAIKU
fast · plan + B1
SONNET
stream · B2–B7
T+0s
Click. Topic chip, search, or text submit.
Drama modal opens instantly. Subject background animation starts (cascade · curve · parchment · scripts · ?-stars).
API call fired with topic + context. Haiku begins generating plan + Beat 1.
— waiting —
T+0.4s
watching the animation · subject context registers
Owl appears in the corner of the modal. Speech bubble starts typing the subject-aware greeting.
drafting…
T+1.0s
reads bubble · is now already engaged with the subject
Bubble fully typed. Animation continues. Owl gentle breathing.
drafting…
T+1.5s
owl is now narrating · audio playing
TTS audio starts. Owl mouth lip-syncs to incoming audio. Beat 1 outline visible behind drama animation.
Plan + Beat 1 returned. UI starts hydrating.
Sonnet call fired. Beats 2–7 generation starts.
T+3.0s
sees beat 1 rendered · listens to tutor
Drama modal dissolves into the lesson. Beat 1 board fully rendered. Tutor narrates "and to start, look here…"
done
Streaming… Beat 2 partial · 3 not started.
T+8s
on Beat 1 · interacting with Tutor
Beat 1 active. Right-rail plan shows beats 2–4 already populated, 5–7 still coming.
done
Beats 2–4 done. Beats 5–7 streaming.
T+15s
still on Beat 1. When ready to advance, Beat 2 is already there.
All beats hydrated. Going-further panel populated. Adjust-level available.
done. Cached for next session.

2.Model choice per role

The right model for the right job. Haiku is fast enough to start the conversation; Sonnet is rich enough to teach.

RoleModelLatency budgetWhy this model
First response · plan + Beat 1Haiku 4.5≤ 1.5sSpeed-first. Outline-quality is fine — the student is in the drama modal anyway. Get to "Tutor is speaking" as fast as possible.
Streaming beats · 2–7Sonnet 4.6~8–15s total · streamedQuality matters here. Rich worked examples, derivations, intuition. Streamed in parallel while student is on Beat 1.
Hard derivations · physics & mathOpus 4.7backgroundReserved for STEM heavy lifting where mistakes are expensive. Used selectively, not for every beat.
Audio narration · TTSElevenLabs / OpenAI~200ms TTFB · streamedVoice streams as Tutor speaks. Owl mouth lip-syncs to audio amplitude. No "loading" between sentences.
In-lesson Q&A · Ask AIHaiku 4.5≤ 2sMid-lesson questions need to feel instant. Haiku is the conversational model.
Adjust-level reworkSonnet 4.6~3s · with shimmerRe-plan on student feedback. The shimmer + audio-regen makes 3s feel intentional, not slow.
Visualize / Animate chipsSonnet + canvas tool~4–6s · dramaTriggered on demand. Drama modal again while diagram/sim is generated.

3.Invocation paths

The student can invoke a session two ways. Both flow through the same engine — different first frames.

▸ free-form text
Type or speak anything
From the search bar in discovery, or the empty Ask-AI textarea. Topic is unconstrained — could be anything in the world.
type "Bohr atom" setup modal (pre-filled topic)
plan modal (Haiku + drama)
active runtime (Sonnet streaming)
▸ landing-page click
Click a tile
From discovery, subject pages, paths, recents. Topic is structured — we know the subject family, level hint, exam target.
click chip "Hooke's law" setup (pre-filled · level hint)
plan modal (subject-aware drama)
active runtime (Sonnet streaming)

4.The rules

Architectural constraints we don't relax. Each rule is a guarantee to the student.

▸ Never
No spinners. Anywhere. Not in the runtime, not in modals, not in chip clicks. If we see a spinner, we've broken the contract.
▸ Never
No blank screens. Every navigation produces visible content within 400ms — drama, owl, cached preview, or skeleton with subject context.
▸ Never
No blocking single-model calls. If we're waiting on a model, the UI must already be entertaining the student.
▸ Never
No "please wait" copy. The owl is allowed to say "let me think" once. After that, it teaches.
▸ Always
Drama is on-topic. Subject-aware backdrop, owl bubble that previews the lesson. The student is being taught from second zero.
▸ Always
Beats stream behind the scenes. By the time the student finishes Beat 1, Beats 2–4 are ready. By Beat 4, all 7 are done. Zero waiting between beats.
▸ Always
TTS streams sentence-by-sentence. Owl starts speaking as the audio bytes arrive. No "audio loading" pause.
▸ Always
First-time content gets cached for everyone. The next student to ask the same question gets it instantly. Cache grows organically toward Canvas-A's hand-crafted polish.

5.How this powers the milestones

▸ M1 · "It works, and it never feels slow"

Every feature in the tutor functioning end-to-end. Both invocation paths live. Haiku-first + Sonnet-stream + drama modal is the default cold-start pattern. All content LLM-generated; no caching beyond the browser. The tutor is real — slow on absolute first hit, but the student never feels it because the drama covers the gap.

▸ M2 · "Always instant"

The most-asked content gets pre-generated and cached, Canvas-A style. Hot path: cached lessons load in <200ms — drama modal is a half-second flicker, then full lesson. Cold path: still on-demand, still <3s, still no spinners. Same architecture; cache is a performance optimization, not an experience change.

▸ Why this scales

The drama modal is a UI investment, not infrastructure. We can run on slower models, smaller GPU budgets, or in regions with high model latency — and the student still feels engaged. As infra improves, the drama modal shrinks toward instant. The student never notices the change.