How I Got Gemma 4 e:2B to Talk Like a Pirate

A journal of fine-tuning Google's Gemma 4 e:2B model using LoRA on an Apple Silicon Mac Mini.

Hardware: Mac Mini M4 (4P+6E cores), 24GB unified RAM, 460GB SSD

Base model: Gemma 4 e:2B instruction-tuned, 4-bit quantized

Framework: Apple MLX (mlx-lm)

Goal: Make it talk like a pirate. Every response. Arr.

Environment Reconnaissance

Timestamp: 2026-04-06 10:04

Checked what we're working with:

Python: 3.13.12 via Homebrew — system Python 3.9.6 is too old for ML work
ML packages installed: None. Starting from scratch.
Disk space: 279GB free — plenty
HuggingFace token: None (turns out we didn't need one!)
ollama: Running with gemma4:e2b already pulled (7.2GB GGUF)

Key decision: MLX over PyTorch. I chose Apple's MLX framework (mlx-lm) instead of the traditional PyTorch + PEFT + bitsandbytes stack. Reasons:

MLX runs natively on Apple Silicon Metal GPU — no CUDA needed
Built-in LoRA fine-tuning with mlx_lm lora command
bitsandbytes 4-bit quantization doesn't work on MPS anyway, which defeats the whole point of QLoRA on a Mac
Simpler pipeline with fewer dependencies
Unified memory means the GPU can use all 24GB — no copying between CPU/GPU RAM

Step 1: Python Environment Setup

Timestamp: 2026-04-06 10:04

python3.13 -m venv pirate-venv
source pirate-venv/bin/activate
pip install mlx-lm huggingface_hub

Installed: mlx 0.31.1, mlx-lm 0.31.1, mlx-metal 0.31.1, transformers 5.5.0, huggingface_hub 1.9.0, and dependencies. Total venv size: ~900MB.

Step 2: Model Selection

Timestamp: 2026-04-06 10:05

Searched HuggingFace for MLX-format Gemma 4 2B models. Found mlx-community/gemma-4-e2b-it-4bit — 15K downloads, ungated (Apache 2.0 license), no HF token needed. This was a relief since we had no token set up.

Downloaded: 3.4GB, took ~30 seconds.

Snag #1: mlx-lm didn't support Gemma 4

The released version (0.31.1) didn't have a gemma4 model module — only up to gemma3. Gemma 4 is brand new.

Fix: Installed from GitHub HEAD:

pip install 'mlx-lm @ git+https://github.com/ml-explore/mlx-lm.git'

This got us 0.31.2 which included gemma4.py and gemma4_text.py. Crisis averted.

Base model test (pre-training sanity check)

Q: What is the capital of France?
A: The capital of France is **Paris**.

Normal, boring, not a pirate. Time to fix that.

Step 3: Creating the Pirate Dataset

Timestamp: 2026-04-06 10:05

Created 54 hand-crafted conversation examples covering diverse topics — all with pirate-speak responses. The dataset includes:

Greetings and casual chat
Science (photosynthesis, gravity, DNA, black holes)
Math (arithmetic, Pythagorean theorem, pi)
Cooking recipes
Technology (internet, AI, blockchain, programming)
History (Rome, Cleopatra)
Life advice and emotional support
Geography, animals, music, weather, sports
Creative writing
Short factual answers

Key design choices:

Every response uses pirate vocabulary: "arr", "matey", "ye", "me" (instead of "my"), "aye", "avast", "ahoy"
Nautical metaphors woven into explanations (gravity = invisible anchor, DNA = treasure map)
The model should still be helpful and accurate — just pirate-flavored
Mixed response lengths: some short ("Four, matey!"), some long (multi-paragraph recipes)

Split: 43 train / 5 validation / 6 test (80/10/10)

Format: JSONL with messages array (chat format):

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Step 4: LoRA Fine-Tuning

Timestamp: 2026-04-06 10:06:23 (start) → 10:07:54 (end) Training time: 91 seconds

mlx_lm lora \
  --model mlx-community/gemma-4-e2b-it-4bit \
  --data pirate-data \
  --train --test \
  --batch-size 1 \
  --num-layers 8 \
  --iters 200 \
  --learning-rate 1e-4 \
  --steps-per-report 10 \
  --steps-per-eval 50 \
  --adapter-path pirate-adapter \
  --save-every 50 \
  --max-seq-length 2048 \
  --mask-prompt

Hyperparameter choices:

batch-size 1: Small dataset, no need for larger batches
num-layers 8: Fine-tune the last 8 transformer layers (of 35 total)
iters 200: Enough to learn the style without too much overfitting
learning-rate 1e-4: Standard LoRA learning rate
mask-prompt: Only compute loss on the assistant response, not the user prompt — teaches the model what to say, not what questions look like

Trainable parameters: 3.6M out of 4,647M (0.078%) — that's the beauty of LoRA. We're training less than one-tenth of one percent of the model's parameters.

Training curve

Iter	Train Loss	Val Loss	Notes
1	—	5.176	Starting point
10	4.352	—
20	2.699	—	Fast initial drop
30	2.474	—
50	1.742	2.577	Best val loss
100	0.408	3.130	Val rising — overfitting begins
150	0.199	3.386
200	0.098	3.264	Train near-zero, val plateaued

Peak memory: 4.0 GB — well within our 24GB. Throughput: ~270 tokens/sec.

Key observation: Classic overfitting pattern. Val loss bottomed at iter 50, then train loss kept dropping while val loss rose. This is expected with only 43 training examples — the model memorizes the training data. But for style transfer, some overfitting is actually fine — we want it to strongly adopt the pirate voice.

Test loss: 3.419 (perplexity 30.6)

Adapter size: 70MB (stored as pirate-adapter/adapters.safetensors)

Step 5: Testing the Pirate Model

Timestamp: 2026-04-06 10:08

Without sampling parameters (greedy decoding)

Results were mixed:

Great: "The capital of France be the city of Paris" ✓
Great: "Arr, gather 'round and I'll tell ye about gravity..." ✓
Bad: The joke response degenerated into "Har har har har har har..." repeated hundreds of times

This repetition is a classic issue with small models + small datasets + greedy decoding.

With temperature + min_p sampling

Added temp=0.7, min_p=0.05:

Q: Tell me a joke.
A: Arr, here be one fer ye! Why couldn't the pirate play cards?
   Because he was standin' on the deck! Har har har! ...Aye, me
   jokes be terrible, but at least they be free, arr!

Q: What is the capital of France?
A: The capital of France be the city of **Paris**. She is the
   capital and largest city of France, matey!

Q: What is blockchain?
A: Avast! Blockchain be like a ship's log that everyone on the
   crew has a copy of, and nobody can change! [full accurate
   explanation with pirate metaphors]

Q: How do I make pasta?
A: Arr, pasta making be the mark of a true cook! [detailed recipe
   with "The pasta water should taste like the sea" — nice touch]

The sampling parameters completely fixed the repetition issue. Temperature 0.7 adds just enough randomness to prevent the model from getting stuck in loops, while min_p keeps it from going off the rails.

Verdict: It works! The model consistently uses pirate vocabulary, nautical metaphors, and maintains the pirate character while still being helpful and accurate. The style transfer is thorough — even unseen topics get the full pirate treatment.

Step 6: Attempted GGUF Export for Ollama

Timestamp: 2026-04-06 10:10–10:20

This is where things got bumpy. I wanted to export the model to GGUF format so it could run in ollama like any other model.

Attempt 1: Python `gguf` package + `convert_hf_to_gguf.py`

Installed gguf 0.18.0 and llama-cpp-python
convert_hf_to_gguf.py needed PyTorch → installed CPU-only torch
Failed: AttributeError: MODEL_ARCH has no attribute 'GEMMA4' — the gguf library doesn't know about Gemma 4 yet, even from the latest llama.cpp main branch

Attempt 2: `ollama create` from safetensors

Fused the LoRA adapter into the base model (2.5GB MLX safetensors)
Dequantized from 4-bit to bf16 (8.7GB) since ollama needs full-precision weights for its own quantization
ollama create pirategemma -f Modelfile — it worked! Ollama accepted the model and created a 9.3GB GGUF
But: Running it crashed with a nil pointer dereference in Embedding.Forward

The crash is likely because the MLX model only has the text decoder, not the full multimodal stack (vision + audio encoders) that ollama's Gemma 4 implementation expects. The weights converted fine, but at inference time it tried to use a nonexistent embedding layer.

Resolution

Cleaned up the failed intermediate files (~11.2GB) and accepted that GGUF export isn't ready for Gemma 4 yet. The model runs great through MLX. Created run-pirate.sh as a simple interactive chat script.

Artifacts

File	Size	Purpose
`pirate-adapter/adapters.safetensors`	70MB	The LoRA adapter weights (the magic)
`pirate-adapter/adapter_config.json`	<1KB	LoRA configuration
`pirate-data/`	60KB	Training dataset (54 examples)
`pirate-venv/`	900MB	Python venv with MLX + dependencies
`run-pirate.sh`	1KB	Interactive chat script
`~/.cache/huggingface/.../gemma-4-e2b-it-4bit/`	3.4GB	Base model (HF cache)

Total disk used: ~4.4GB (mostly the base model in HF cache)

What I Learned

Things that went well

MLX is excellent for Apple Silicon fine-tuning. The entire training took 91 seconds, used only 4GB of memory, and required minimal configuration. The mlx_lm lora CLI is beautifully simple.
LoRA is remarkably efficient. Training 0.078% of parameters was enough for a complete style transformation. The adapter is only 70MB.
Small datasets work for style transfer. 43 training examples was enough to thoroughly pirate-ify the model. Style is more about how things are said than what is said, so the model's existing knowledge stays intact.
The mlx-community model being ungated saved a lot of friction. No HuggingFace token, no license agreement clicking.

Things that went wrong

Gemma 4 is too new. Both mlx-lm (released version) and the gguf Python library didn't support it. Had to install mlx-lm from GitHub HEAD.
GGUF export is broken for Gemma 4. Neither the llama.cpp converter nor ollama's import could produce a working GGUF. The gguf library lacks the architecture definition, and ollama's import crashes on the incomplete multimodal model.
Greedy decoding + small dataset = repetition loops. The joke response degenerated into infinite "har har har" without sampling parameters. Solved with temp=0.7 + min_p=0.05.
Overfitting was immediate. Val loss bottomed at iter 50 of 200. More data or regularization (dropout, weight decay) would help if we wanted a more generalizable model.

If I did it again

More diverse training data — 100-200 examples would reduce overfitting
Include multi-turn conversations — current data is all single-turn
Try DPO/preference training — "pirate response" vs "normal response" pairs might give cleaner results
Wait for GGUF tooling to catch up to Gemma 4 — then export to ollama would just work
Experiment with num-layers — 8 worked fine, but trying 4 or 16 might be interesting

The numbers

Setup time: ~2 minutes (venv creation, pip install, model download)
Dataset creation: Handwritten, but a script generates the JSONL splits
Training time: 91 seconds for 200 iterations
Total end-to-end: ~15 minutes including all troubleshooting
Cost: $0 (all local, open-source model, open-source tools)

How to Run It If You Followed These Steps

Save this file to pirate.sh (replace with your path)

#!/bin/bash
# Run the pirate Gemma model interactively
cd /Users/robertviragh/pirate
source pirate-venv/bin/activate

python3 -c "
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

print('Loading pirate Gemma 4 e:2B...')
model, tokenizer = load('mlx-community/gemma-4-e2b-it-4bit', adapter_path='pirate-adapter')
sampler = make_sampler(temp=0.7, min_p=0.05)
print('Ready! Type your message (Ctrl+C to quit).\n')

while True:
    try:
        user_input = input('You: ')
        if not user_input.strip():
            continue
        chat = [{'role':'user','content': user_input}]
        prompt = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
        result = generate(model, tokenizer, prompt=prompt, max_tokens=300, sampler=sampler)
        print(f'\nPirate: {result}\n')
    except KeyboardInterrupt:
        print('\nFair winds, matey!')
        break
"

Then run:

# Interactive chat
./run-pirate.sh

Or manually:

source pirate-venv/bin/activate
python3 -c "
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load('mlx-community/gemma-4-e2b-it-4bit', adapter_path='pirate-adapter')
sampler = make_sampler(temp=0.7, min_p=0.05)
chat = [{'role':'user','content':'Tell me about black holes'}]
prompt = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
print(generate(model, tokenizer, prompt=prompt, max_tokens=300, sampler=sampler))
"

Conclusion

We've now fine-tuned Gemma 4 on our own hardware.

Written April 6, 2026. Arr.