In plain terms
Ollama is the software that loads AI models onto your GPU and serves them. Think of it as the engine room โ you don't interact with it directly day-to-day, but everything else (Open WebUI, LangChain, your applications) sends requests to Ollama to get AI responses.
What it handles
- Loading and unloading AI models from GPU memory
- Processing requests and returning responses
- Downloading new models from the library
- Managing multiple models on the same system
Access Points
See which models are installed
Open a terminal and run ollama list. You'll see all downloaded models, their size, and when they were last used.
Download a new model
Run ollama pull llama3.1:8b to download a model. Replace llama3.1:8b with any model name from the Ollama library. Note: your system may be air-gapped โ check with your administrator before attempting.
Chat with a model in the terminal
Run ollama run llama3.1:8b to start a direct chat session in your terminal. Type /bye to exit.
Check what's currently loaded
Run ollama ps to see which models are currently loaded into GPU memory and using VRAM.
Remove a model to free up space
Run ollama rm model-name to delete a model from disk and free storage space.