Fine-tuning Mistral to clone your Telegram persona

What this is

The question was simple: can I train a model to sound like me? Not a generic assistant, not a role-played persona — an actual model fine-tuned on how I write, the words I use, the rhythm of my messages.

The raw material was sitting in Telegram. Years of conversations, patterns, vocabulary. You2AgentAI is the pipeline I built to turn that into a fine-tuned Mistral 7B model.

Architecture

The project is structured as four sequential stages, each in its own directory with dedicated scripts:

You2AgentAI/
├── 1_Dataset/       → Export, filter, and format Telegram messages
├── 2_Tokenizer/     → Tokenize into model-ready format
├── 3_FineTuning/    → LoRA fine-tuning with Hugging Face Transformers
├── 4_Testing_agent/ → Interactive agent with GPU inference
└── config.ini       → Central configuration

The key design decision: one config file (config.ini) controls everything — model name, dataset paths, LoRA hyperparameters, max sequence length. No scattered hardcoded values.

Step by step

1. Dataset preparation

Telegram lets you export your chat history as JSON. The dataset stage reads that JSON, filters messages by sender (only yours), cleans noise, and builds a JSONL file of prompt/response pairs.

The format I used for training pairs:

{"instruction": "<previous message in conversation>", "response": "<your reply>"}

The filter logic is worth mentioning: very short messages (under 3 words), pure media, and forwarded content are excluded — they add noise without signal.

2. Tokenization

The tokenizer stage converts the JSONL into the token format Mistral expects, applying the chat template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

Output: a serialized dataset Hugging Face’s Trainer can consume directly.

3. Fine-tuning with LoRA

This is the core. Instead of full fine-tuning (which would require 40GB+ of VRAM), LoRA (Low-Rank Adaptation) injects small trainable matrices into the attention layers. The base model weights stay frozen.

Key config values from config.ini:

[lora]
r = 16
lora_alpha = 32
lora_dropout = 0.05
target_modules = q_proj,v_proj

[training]
num_train_epochs = 3
per_device_train_batch_size = 4
learning_rate = 2e-4

r = 16 is a solid middle ground — higher rank captures more, but increases VRAM and training time. With this config, training on a single consumer GPU (RTX 3090, 24GB) takes ~2 hours for a dataset of ~10k pairs.

4. Testing the agent

The final stage loads the merged model (base + LoRA adapters) and spins up an interactive CLI chat:

conda activate You2AgentAI
python 4_Testing_agent/agent.py

The agent runs inference locally using bitsandbytes 4-bit quantization, so it fits within 16GB VRAM.

What I learned

What worked well:

LoRA is genuinely practical for personal fine-tuning — the compute requirements are attainable on consumer hardware
The Hugging Face SFTTrainer from trl handles the chat template formatting cleanly
Keeping all hyperparameters in config.ini makes iteration fast — you change one value, re-run, compare

What didn’t work / surprises:

The first version had no filtering on message length. The model learned to reply with single characters and emoji. Minimum length filtering fixed it completely
Overfitting is fast with personal data. Three epochs was already too much for small datasets — the model started memorizing exact phrases instead of style
Evaluation is hard. There’s no automatic metric for “does this sound like me?” — you just read outputs and decide