In plain terms
Fine-tuning is the process of taking a general-purpose AI model (like Llama 3.1) and continuing to train it on your own data so it learns to behave in a way specific to your organisation โ using your terminology, following your formats, reflecting your policies. The result is a model that's significantly better at your specific tasks than the base model.
When fine-tuning makes sense
- You want the AI to answer in your organisation's voice and style
- The base model doesn't know your industry's terminology
- You have hundreds of example question-answer pairs from your domain
- RAG alone isn't giving accurate enough results
- You want to teach the model a specific task format (extract fields from forms)
QLoRA
Quantised Low-Rank Adaptation โ a technique that dramatically reduces the GPU memory needed for fine-tuning. Instead of updating all the weights in a model (which would need 4โ8ร the model's VRAM), QLoRA only trains a small set of adapter layers and keeps the rest of the model in 4-bit compressed format. This makes it possible to fine-tune a 7B or 8B model on the RTX Pro 6000.
Unsloth
Unsloth is a Python library that makes QLoRA fine-tuning 2โ4ร faster and uses 40โ70% less memory than standard QLoRA implementations. It achieves this through hand-optimised GPU kernels. For Cezen Entry tier with the RTX Pro 6000, Unsloth is the recommended way to fine-tune โ it makes jobs that would otherwise take 10 hours complete in 3โ4 hours.
Prepare your training data
Fine-tuning needs examples in a structured format โ typically a JSONL file where each line is a conversation: a prompt and the ideal response. A minimum of 50โ100 high-quality examples is needed; 500โ2000 is better. Quality matters more than quantity.
Open Jupyter and set up the training script
Open Jupyter at http://ai.local:8888. Load the Cezen fine-tuning notebook template (provided by your administrator) or start from scratch with the Unsloth documentation examples.
Configure and run training
Set your base model, data file path, and training parameters (epochs, learning rate). A typical fine-tuning run for Llama 3.1 8B on 500 examples takes 45โ90 minutes on the Entry tier RTX Pro 6000.
Export to Ollama format
Once training is complete, export the fine-tuned model to GGUF format using Unsloth's export function. Then load it into Ollama with ollama create my-model -f Modelfile and use it in Open WebUI like any other model.
Test and iterate
Compare your fine-tuned model against the base model on your specific tasks. If it's not where you need it, add more training examples focused on the weak areas and re-run.