Shipfastai’s Enterprise tier includes a complete QLoRA fine-tuning pipeline underDocumentation Index
Fetch the complete documentation index at: https://docs.shipfastai.dev/llms.txt
Use this file to discover all available pages before exploring further.
products/enterprise/scripts/finetune/. QLoRA (Quantized Low-Rank Adaptation) lets you train a large language model on consumer or mid-range cloud GPUs by loading the base model in 4-bit precision and training only a small set of adapter weights. When training is done, you merge those adapters back into the base model and deploy the result as a standard HuggingFace model — which the built-in GeminiProvider or HuggingFace inference endpoints can then serve.
Prerequisites
Before running any training, make sure you have the following in place.GPU with 16 GB+ VRAM
A 7B parameter model requires roughly 10–14 GB of VRAM in 4-bit mode. An A100 40 GB, RTX 3090, or RTX 4090 all work well. Smaller models (1B–3B) fit on an RTX 3080.
Python 3.11+
The scripts use Python 3.11 type annotations. Check your version with
python --version and upgrade if needed.If you do not have a suitable local GPU, cloud GPU providers like RunPod and Lambda Labs offer hourly instances with A100s and H100s. Mount your dataset and output directory from persistent storage so checkpoints survive instance restarts.
| Package | Purpose |
|---|---|
transformers>=4.37.0 | Model loading and tokenization |
peft>=0.8.0 | LoRA adapter training with PEFT |
bitsandbytes>=0.42.0 | 4-bit quantization |
datasets>=2.16.0 | Dataset loading and preprocessing |
accelerate>=0.26.0 | Multi-GPU and mixed-precision training |
trl>=0.7.10 | Supervised fine-tuning utilities |
huggingface-hub>=0.20.0 | Pushing merged models to the Hub |
Preparing your dataset
The training script expects a JSONL file where each line is a JSON object with amessages key containing a list of chat turns. This is the standard chat-template format used by most instruction-tuned models:
data/train.jsonl
instruction, input, and output fields — use the prepare_data.py script to convert and split it:
data/processed/train.jsonl (90%) and data/processed/val.jsonl (10%), both in the messages chat format.
Running QLoRA training
Runqlora_train.py with your dataset and chosen base model. The default base model is mistralai/Mistral-7B-v0.1, but any HuggingFace causal LM works.
| Flag | Default | Description |
|---|---|---|
--model-name | mistralai/Mistral-7B-v0.1 | HuggingFace model ID or local path |
--num-epochs | 3 | Number of full passes over the training set |
--batch-size | 4 | Per-device training batch size |
--lora-r | 64 | LoRA rank — higher values capture more adaptation at the cost of memory |
--lora-alpha | 16 | LoRA scaling factor |
--lora-dropout | 0.1 | Dropout applied to LoRA layers |
--learning-rate | 2e-4 | AdamW learning rate |
--max-length | 2048 | Maximum token length per example |
--output-dir every 100 steps (configurable with --save-steps) and keeps the last three. Training logs are printed to stdout.
To enable Flash Attention 2 for faster training on supported GPUs (A100, H100):
Merging LoRA adapters
After training, theoutputs/my-model-adapter/ directory contains only the small adapter weights, not a standalone model. Use merge_adapter.py to merge the adapters back into the base model weights:
outputs/my-model-merged/ as a standard HuggingFace AutoModelForCausalLM — no PEFT dependency required at inference time.
To publish the merged model directly to the HuggingFace Hub:
huggingface-cli login before pushing.
Using your fine-tuned model
Once your model is available — either locally or on the HuggingFace Hub — you can serve it through Shipfastai’s existing chat endpoint using a HuggingFace inference endpoint or a localvllm / text-generation-inference server.
Point the AI chat API at your model by setting the model field in your request. If you are running a local inference server that exposes an OpenAI-compatible API, use the openai provider and override the base URL via an environment variable or by extending OpenAIProvider:
POST /api/ai/chat
OPENAI_API_KEY base URL and set the model to your repository ID. Refer to the Add LLM Provider guide for instructions on creating a custom provider class if you need a dedicated integration.