Fine-tune a custom LLM with QLoRA

Shipfastai’s Enterprise tier includes a complete QLoRA fine-tuning pipeline under products/enterprise/scripts/finetune/. QLoRA (Quantized Low-Rank Adaptation) lets you train a large language model on consumer or mid-range cloud GPUs by loading the base model in 4-bit precision and training only a small set of adapter weights. When training is done, you merge those adapters back into the base model and deploy the result as a standard HuggingFace model — which the built-in GeminiProvider or HuggingFace inference endpoints can then serve.

The fine-tuning scripts are only available in the Enterprise tier. Upgrade your license before following the steps below.

Prerequisites

Before running any training, make sure you have the following in place.

GPU with 16 GB+ VRAM

A 7B parameter model requires roughly 10–14 GB of VRAM in 4-bit mode. An A100 40 GB, RTX 3090, or RTX 4090 all work well. Smaller models (1B–3B) fit on an RTX 3080.

Python 3.11+

The scripts use Python 3.11 type annotations. Check your version with python --version and upgrade if needed.

If you do not have a suitable local GPU, cloud GPU providers like RunPod and Lambda Labs offer hourly instances with A100s and H100s. Mount your dataset and output directory from persistent storage so checkpoints survive instance restarts.

Install the Enterprise dependencies alongside the base and Pro requirements:

pip install -r products/enterprise/requirements-enterprise.txt

This installs the following key packages:

Package	Purpose
`transformers>=4.37.0`	Model loading and tokenization
`peft>=0.8.0`	LoRA adapter training with PEFT
`bitsandbytes>=0.42.0`	4-bit quantization
`datasets>=2.16.0`	Dataset loading and preprocessing
`accelerate>=0.26.0`	Multi-GPU and mixed-precision training
`trl>=0.7.10`	Supervised fine-tuning utilities
`huggingface-hub>=0.20.0`	Pushing merged models to the Hub

Preparing your dataset

The training script expects a JSONL file where each line is a JSON object with a messages key containing a list of chat turns. This is the standard chat-template format used by most instruction-tuned models:

data/train.jsonl

{"messages": [{"role": "system", "content": "You are a customer support agent."}, {"role": "user", "content": "How do I cancel my subscription?"}, {"role": "assistant", "content": "You can cancel your subscription from the Billing page in your dashboard."}]}
{"messages": [{"role": "user", "content": "What payment methods do you accept?"}, {"role": "assistant", "content": "We accept all major credit cards, PayPal, and bank transfers."}]}

If your data is in a different format — for example a JSON array with instruction, input, and output fields — use the prepare_data.py script to convert and split it:

python products/enterprise/scripts/finetune/prepare_data.py \
  --input data/raw.json \
  --output-dir data/processed/ \
  --instruction-key instruction \
  --input-key input \
  --output-key output \
  --system-prompt "You are a helpful assistant." \
  --train-ratio 0.9

This produces data/processed/train.jsonl (90%) and data/processed/val.jsonl (10%), both in the messages chat format.

Running QLoRA training

Run qlora_train.py with your dataset and chosen base model. The default base model is mistralai/Mistral-7B-v0.1, but any HuggingFace causal LM works.

python products/enterprise/scripts/finetune/qlora_train.py \
  --model-name mistralai/Mistral-7B-v0.1 \
  --train-file data/processed/train.jsonl \
  --val-file data/processed/val.jsonl \
  --output-dir outputs/my-model-adapter \
  --num-epochs 3 \
  --batch-size 4 \
  --lora-r 64 \
  --lora-alpha 16 \
  --learning-rate 2e-4

Key hyperparameters:

Flag	Default	Description
`--model-name`	`mistralai/Mistral-7B-v0.1`	HuggingFace model ID or local path
`--num-epochs`	`3`	Number of full passes over the training set
`--batch-size`	`4`	Per-device training batch size
`--lora-r`	`64`	LoRA rank — higher values capture more adaptation at the cost of memory
`--lora-alpha`	`16`	LoRA scaling factor
`--lora-dropout`	`0.1`	Dropout applied to LoRA layers
`--learning-rate`	`2e-4`	AdamW learning rate
`--max-length`	`2048`	Maximum token length per example

The script saves checkpoints to --output-dir every 100 steps (configurable with --save-steps) and keeps the last three. Training logs are printed to stdout. To enable Flash Attention 2 for faster training on supported GPUs (A100, H100):

pip install flash-attn --no-build-isolation
python products/enterprise/scripts/finetune/qlora_train.py \
  --use-flash-attention \
  # ... other flags

Merging LoRA adapters

After training, the outputs/my-model-adapter/ directory contains only the small adapter weights, not a standalone model. Use merge_adapter.py to merge the adapters back into the base model weights:

python products/enterprise/scripts/finetune/merge_adapter.py \
  --base-model mistralai/Mistral-7B-v0.1 \
  --adapter-path outputs/my-model-adapter \
  --output-path outputs/my-model-merged

The merged model is saved to outputs/my-model-merged/ as a standard HuggingFace AutoModelForCausalLM — no PEFT dependency required at inference time. To publish the merged model directly to the HuggingFace Hub:

python products/enterprise/scripts/finetune/merge_adapter.py \
  --base-model mistralai/Mistral-7B-v0.1 \
  --adapter-path outputs/my-model-adapter \
  --output-path outputs/my-model-merged \
  --push-to-hub \
  --hub-repo-id your-username/my-fine-tuned-model

Make sure you are authenticated with huggingface-cli login before pushing.

Using your fine-tuned model

Once your model is available — either locally or on the HuggingFace Hub — you can serve it through Shipfastai’s existing chat endpoint using a HuggingFace inference endpoint or a local vllm / text-generation-inference server. Point the AI chat API at your model by setting the model field in your request. If you are running a local inference server that exposes an OpenAI-compatible API, use the openai provider and override the base URL via an environment variable or by extending OpenAIProvider:

POST /api/ai/chat

{
  "provider": "openai",
  "model": "your-username/my-fine-tuned-model",
  "messages": [
    { "role": "user", "content": "How do I cancel my subscription?" }
  ]
}

For HuggingFace Inference Endpoints, use the endpoint URL as the OPENAI_API_KEY base URL and set the model to your repository ID. Refer to the Add LLM Provider guide for instructions on creating a custom provider class if you need a dedicated integration.

​Prerequisites

GPU with 16 GB+ VRAM

Python 3.11+

​Preparing your dataset

​Running QLoRA training

​Merging LoRA adapters

​Using your fine-tuned model

Prerequisites

Preparing your dataset

Running QLoRA training

Merging LoRA adapters

Using your fine-tuned model