๐ก To use a model, open Open WebUI and select it from the model dropdown at the top of the chat. You can switch models at any time without losing your conversation.
Chat & Reasoning Models
Model
Best for
Speed
VRAM
Context
llama3.1:8b
Recommended โ start here
General Q&A, document chat, writing, summarising
Very Fast
~5 GB
128K tokens
llama3.2:3b
Lightweight
Quick lookups, simple questions, high-concurrency use
Fastest
~2 GB
128K tokens
mistral:7b
Reasoning
Structured tasks, code, logical reasoning, JSON extraction
Very Fast
~5 GB
32K tokens
llama3.1:70b
High accuracy
Complex reasoning, legal/technical analysis, nuanced writing
Slower
~42 GB
128K tokens
gemma2:9b
Efficient
Instruction following, structured outputs, multilingual
Very Fast
~6 GB
8K tokens
phi3:mini
Microsoft
Simple tasks, drafting, lightweight deployments
Fastest
~2.3 GB
128K tokens
Embedding Models
Embedding models don't generate text โ they convert documents and queries into numbers (vectors) that ChromaDB uses for semantic search. You don't select these in chat; they run automatically when you upload documents.
Model
Best for
Dimensions
VRAM
nomic-embed-text
Default for RAG
General document search and Q&A
768
~270 MB
mxbai-embed-large
Higher accuracy
Better retrieval accuracy for complex technical documents
1024
~670 MB
Which model should I use?
๐ฌ Everyday chat and Q&A
llama3.1:8b
Fast enough to feel instant, capable enough for most office tasks.
๐ Chat with documents
llama3.1:8b
Large 128K context window means it can read long documents in one go.
โก High concurrent users
llama3.2:3b
Very low VRAM footprint means more users can be served simultaneously.
๐ง Complex analysis
llama3.1:70b
Significantly more accurate for nuanced reasoning, legal, and technical work. Slower.
๐ป Code generation
mistral:7b
Strong at structured outputs, JSON, SQL, Python, and logical step-by-step tasks.
๐ Multilingual tasks
gemma2:9b
Good multilingual coverage for Indian regional languages and English.
๐ก Context window explained: The context window is how much text the model can read at once. 128K tokens โ about 90,000 words โ enough for most full documents. If your document is very large, split it into sections or use ChromaDB to chunk and search it automatically.