Technical deep dive

How Memo AI actually works.

Seven AI model tiers (including Auto-router). Persistent memory, RAG knowledge base, voice loop and Artifacts panel in v2.0. 40+ engines across 9 infrastructure partners. FLUX.2 klein for image editing, Gemini for live search, DeepSeek V3.2 for frontier reasoning. £0/month.

The request flow.

Every message you send travels through 8 stages. Most complete in under 200 milliseconds.

User sends message

Frontend captures input, attachments (PDF/Excel/Word/images), voice, or drag-drop. Streams via SSE.

Quota check

Per-user, per-model daily limit verified. 1K Smart · 500 Reasoner · 1K Live · 5K Fast · 800 Coder · 500 Vision.

Intent detection

Auto-detect: image generation? Image editing? Web search? Vision? Document analysis? Code? Plain chat?

Cascade routing

Selected model's cascade fires. Multiple API connections across 7 providers rotate via round-robin — no single connection is always first.

Inter-model fallback

If provider returns 429/5xx, falls through to the next cascade step in <50ms. Smart has 11 steps, Reasoner 9, Fast 9, Coder 9, Vision 8, Live 4.

Stream tokens

Real-time SSE streaming. <think> blocks are separated and hidden behind a toggle. Backend label + latency shown per reply.

Markdown rendering

Custom parser: tables, syntax-highlighted code blocks, lists, images. Theme-aware styling in light and dark mode.

Persist & log

Chat history saved to encrypted Supabase. File attachments persist in Cloudflare R2. Events logged for admin dashboard.

Model parameters.

Specs for every brain inside Memo AI — total parameters, active parameters, context window and primary strength.

Smart

DeepSeek V3.2 (685B MoE)

Total params685B

Active37B

Context128K

KnowledgeApril 2025

General · Writing · Reasoning

Reasoner

DeepSeek V3.2 + GPT-OSS 120B

Total params685B

Active37B

Context128K

KnowledgeDecember 2024

Code · Logic · Maths

Live

Gemini 2.5 Flash + Search

Total params—

Active—

Context1M

KnowledgeReal-time

Current events · Web search

Fast

Cerebras Llama 3.1 8B

Total params8B

Active8B

Context128K

KnowledgeApril 2025

Speed · Quick lookups

Coder

DeepSeek V3.2 (685B MoE)

Total params685B

Active37B

Context128K

KnowledgeApril 2025

Code · Refactor · Debug

Vision

SambaNova Llama 4 Maverick

Total params17B

Active17B

Context128K

KnowledgeAugust 2024

Image OCR · Visual analysis

Model benchmarks.

How our primary AI engines compare to the industry leaders. These are real benchmark scores from published evaluations — not marketing claims.

Benchmark	DeepSeek V3.2	GPT-4o	Claude Sonnet	Gemini 2.5
MMLU (knowledge)	87.1%	88.7%	88.7%	90.0%
HumanEval (code)	92.7%	90.2%	92.0%	89.5%
MATH-500 (maths)	90.2%	76.6%	78.3%	83.2%
SWE-bench (coding)	42.0%	33.2%	49.0%	38.8%
Arena ELO (overall)	1318	1287	1271	1299
Context window	128K	128K	200K	1M
Cost to you	£0	£20+/user	£18+/user	£18+/user

DeepSeek V3.2 (685B MoE, 37B active) is the primary Smart, Reasoner, and Coder engine in Memo AI. It's a mixture-of-experts model — only 37 billion parameters are active per token, but the full 685B knowledge base is available. On maths (MATH-500) and code (HumanEval) it outperforms GPT-4o. Memo AI gets it for free via SambaNova Cloud.

Cascade depth & capacity.

How many free-tier engines each model can fall back through (left), and how many messages each user gets per day (right). More depth = more resilient. More capacity = more freedom.

Cascade depth

Engines per model — if one fails, next fires in <50ms

Smart11 engines

Coder9 engines

Reasoner9 engines

Fast9 engines

Vision8 engines

Live4 engines

Daily capacity

Messages per user per day (10 active users)

Fast5,000 /day

Auto2,000 /day

Smart1,000 /day

Live1,000 /day

Coder800 /day

Reasoner500 /day

Vision500 /day

Multi-provider redundancy.

Every provider runs independently. If Groq goes down entirely, SambaNova picks up. If SambaNova is busy, Cerebras fires. If all primary providers fail, OpenRouter's 17 :free models catch the request. Total redundancy.

API key distribution

Multiple connections per provider — each rotates per request to spread load

Google (Gemini 2.5 Flash, Flash Lite, Gemini 3 Flash Preview)11 models

Tavily (Web search + research (Live mode))6 models

SambaNova (DeepSeek V3.2/V3.1, Llama 4 Maverick)4 models

Cerebras (Qwen 3 235B, Llama 3.1 8B (2,000 tok/sec))4 models

Cloudflare (FLUX.2 klein 9B/4B image gen + R2 file storage)4 models

OpenRouter (GPT-OSS, Nemotron, GLM-4.5, Gemma 4, Arcee Trinity)3 models

Groq (GPT-OSS 120B/20B, Llama 3.3/3.1, Qwen 3, Scout)2 models

Anthropic (Claude Haiku 4.5 — final OCR fallback)1 models

Throughput per provider.

Combined requests per minute (RPM) and tokens per minute (TPM) across all connections per provider. Higher = more concurrent users supported without rate limits.

Requests per minute

Combined RPM across all connections

Groq540 rpm

Cerebras120 rpm

SambaNova40 rpm

Gemini120 rpm

OpenRouter40 rpm

Tokens per minute

Combined TPM capacity

Groq2.7M tpm

Cerebras4.0M tpm

SambaNova2.0M tpm

Gemini3.0M tpm

OpenRouter1.0M tpm

Performance by the numbers.

Time to first token (lower is faster) and tokens per second (higher is faster). Measured live via scripts/full-key-audit.mjs on 12 May 2026.

Time to first token

Lower = snappier feel

Smart813 ms

Reasoner850 ms

Live2291 ms

Fast41 ms

Vision44 ms

Tokens per second

Higher = streams faster

Fast2000 tok/s

Vision180 tok/s

Reasoner120 tok/s

Smart85 tok/s

Token economics.

Total daily processing capacity across all models, all providers, all engines combined. And what it costs Memo Fashion.

🧠

Live models

Across Groq, Cerebras, SambaNova, Gemini, OpenRouter & Cloudflare

📊

25M

Tokens / day

Combined free-tier throughput — ~90 novels every day

🔑

Failover connections

Groq · Cerebras · SambaNova · Gemini · OpenRouter · Cloudflare · Tavily

💰

£0

Cost / month

Free-tier infrastructure, including FLUX.2 klein image gen + edit

Memo AI vs the rest.

How Memo AI stacks up against the leading commercial AI products. We deliberately built the features that matter most for Memo Fashion's workflow.

Feature	Memo AI	ChatGPT	Claude	Gemini
Real-time streaming	✓	✓	✓	✓
7 models in one app	✓	✗	✗	✗
Web search built in	✓	✓	✗	✓
Image generation (FLUX.2)	✓	✓	✗	✓
Image editing (FLUX.2 klein)	✓	✓	✗	✓
Document upload (10 files)	✓	✗	✓	✗
Excel/Word/PDF download	✓	✓	✗	✗
Attachments persist in chat	✓	✓	✓	✗
Memory across all chats	✓	✓	✓	✗
50 saved conversations	✓	✓	✓	✓
Visible reasoning trace	✓	✓	✓	✗
Temporary chat	✓	✓	✗	✗
Light & dark mode	✓	✓	✓	✓
PWA installable	✓	✓	✗	✗
Voice input	✓	✓	✗	✗
Drag & drop files	✓	✓	✓	✗
Custom branding	✓	✗	✗	✗
Fully owned & in-house	✓	✗	✗	✗
Cost to Memo Fashion	£0	£20+/user	£18+/user	£18+/user

Daily limits.

Realistic limits for 15 staff on 100% free-tier infrastructure. Every model cascades 7–12 engines deep — if one provider is rate-limited, the next fires in under 50ms. You'll never notice.

Smart

1,000

messages / user / day

10,000 total across 15 users · SambaNova DeepSeek V3.2 (685B MoE) · Groq GPT-OSS 120B · Llama 4 Maverick · DeepSeek V3.1 · Cerebras Qwen 3 235B · Groq + OpenRouter fallbacks

Reasoner

500

messages / user / day

5,000 total across 15 users · SambaNova DeepSeek V3.2 · DeepSeek V3.1 · V3.1-cb · Groq GPT-OSS 120B · Cerebras Qwen 3 235B · Llama 4 Maverick

Live

1,000

messages / user / day

10,000 total across 15 users · Gemini (Flash + Lite + Gemini 3) with 11 rotating connections · Tavily search

Fast

5,000

messages / user / day

50,000 total across 15 users · Cerebras Llama 3.1 8B (2,000 tok/sec) · Groq GPT-OSS 20B (41ms) · Llama 3.1 8B · OpenRouter

Vision

500

messages / user / day

5,000 total across 15 users · SambaNova Llama 4 Maverick · Gemini 2.5 Flash + Flash Lite · Groq Scout · Gemma 4 · Nemotron VL

Coder

800

messages / user / day

8,000 total across 15 users · SambaNova DeepSeek V3.2 · Groq GPT-OSS 120B · Cerebras Qwen 3 235B · DeepSeek V3.1 · Groq Qwen 3 32B · OpenRouter :free (GLM-4.5, GPT-OSS, Qwen3 Coder, Arcee)

Never fails.

Every mode has a deep cascade of free-tier engines. If Groq is down, SambaNova fires. If SambaNova is busy, Cerebras takes over. If all primary providers exhaust, OpenRouter :free models catch the request. Under 50ms to switch. Users never see an error — just a different engine label.

SMART CASCADE (11 STEPS)

1. SambaNova DeepSeek V3.2 (685B)
2. Groq GPT-OSS 120B
3. SambaNova Llama 4 Maverick
4. SambaNova DeepSeek V3.1
5. Cerebras Qwen 3 235B
6. Groq Llama 3.3 70B
7. OpenRouter GPT-OSS 120B :free
8. OpenRouter Nemotron 120B :free
9. OpenRouter GLM-4.5 Air :free
10. OpenRouter Arcee Trinity :free
11. OpenRouter Gemma 3 12B :free

FAST CASCADE (9 STEPS)

1. Cerebras Llama 3.1 8B (2,000 tok/sec)
2. Groq GPT-OSS 20B (41ms)
3. Groq Llama 3.1 8B Instant
4. OpenRouter Nemotron Nano 9B :free
5. OpenRouter Liquid LFM 2.5 :free
6. OpenRouter GPT-OSS 20B :free
7. OpenRouter Gemma 3 4B :free
8. OpenRouter Gemma 3n 4B :free
9. OpenRouter Gemma 3n 2B :free

LIVE CASCADE (4 STEPS)

1. Gemini 2.5 Flash + Google Search
2. Gemini 2.5 Flash Lite + Search
3. Gemini 3 Flash Preview + Search
4. Groq GPT-OSS 120B + Tavily

Your data stays yours.

Chat history is encrypted in Supabase (PostgreSQL). File attachments are stored in Cloudflare R2 with per-user isolation. No data is sent to consumer AI products (ChatGPT, Claude.ai, Gemini app, Copilot). All processing routes through API-only inference endpoints with explicit no-training contracts — Anthropic Claude Haiku is used only as an emergency OCR fallback for receipts, under Anthropic's API privacy policy (zero retention, no training).

🔒

Encrypted

Chat history

Supabase PostgreSQL with RLS policies

☁️

File storage

Cloudflare R2 · 10GB free · unlimited egress

🚫

Zero

Training on your data

API-only inference — providers cannot use your inputs

🏢