QLoRA / Unsloth — Nexus One AI Portal

What is fine-tuning?

In plain terms

Fine-tuning is the process of taking a general-purpose AI model (like Llama 3.1) and continuing to train it on your own data so it learns to behave in a way specific to your organisation — using your terminology, following your formats, reflecting your policies. The result is a model that's significantly better at your specific tasks than the base model.

When fine-tuning makes sense

You want the AI to answer in your organisation's voice and style
The base model doesn't know your industry's terminology
You have hundreds of example question-answer pairs from your domain
RAG alone isn't giving accurate enough results
You want to teach the model a specific task format (extract fields from forms)

What QLoRA and Unsloth are

QLoRA

Quantised Low-Rank Adaptation — a technique that dramatically reduces the GPU memory needed for fine-tuning. Instead of updating all the weights in a model (which would need 4–8× the model's VRAM), QLoRA only trains a small set of adapter layers and keeps the rest of the model in 4-bit compressed format. This makes it possible to fine-tune a 7B or 8B model on the RTX Pro 6000.

Unsloth

Unsloth is a Python library that makes QLoRA fine-tuning 2–4× faster and uses 40–70% less memory than standard QLoRA implementations. It achieves this through hand-optimised GPU kernels. For Cezen Entry tier with the RTX Pro 6000, Unsloth is the recommended way to fine-tune — it makes jobs that would otherwise take 10 hours complete in 3–4 hours.

💡 Fine-tuning is an advanced task. You'll need training data in the right format and some Python knowledge. Start by exploring Open WebUI and RAG with ChromaDB — for most use cases, RAG gives 80% of the benefit of fine-tuning with 10% of the effort.

How to approach a fine-tuning project

Prepare your training data

Fine-tuning needs examples in a structured format — typically a JSONL file where each line is a conversation: a prompt and the ideal response. A minimum of 50–100 high-quality examples is needed; 500–2000 is better. Quality matters more than quantity.

Open Jupyter and set up the training script

Open Jupyter at http://ai.local:8888. Load the Cezen fine-tuning notebook template (provided by your administrator) or start from scratch with the Unsloth documentation examples.

Configure and run training

Set your base model, data file path, and training parameters (epochs, learning rate). A typical fine-tuning run for Llama 3.1 8B on 500 examples takes 45–90 minutes on the Entry tier RTX Pro 6000.

Export to Ollama format

Once training is complete, export the fine-tuned model to GGUF format using Unsloth's export function. Then load it into Ollama with ollama create my-model -f Modelfile and use it in Open WebUI like any other model.

Test and iterate

Compare your fine-tuned model against the base model on your specific tasks. If it's not where you need it, add more training examples focused on the weak areas and re-run.

Works with

📓 Jupyter 🦙 Ollama 💬 Open WebUI 📡 DCGM