Fine-tuning Mistral to clone your Telegram persona
What this is
The question was simple: can I train a model to sound like me? Not a generic assistant, not a role-played persona — an actual model fine-tuned on how I write, the words I use, the rhythm of my messages.
The raw material was sitting in Telegram. Years of conversations, patterns, vocabulary. You2AgentAI is the pipeline I built to turn that into a fine-tuned Mistral 7B model.
Architecture
The project is structured as four sequential stages, each in its own directory with dedicated scripts:
You2AgentAI/
├── 1_Dataset/ → Export, filter, and format Telegram messages
├── 2_Tokenizer/ → Tokenize into model-ready format
├── 3_FineTuning/ → LoRA fine-tuning with Hugging Face Transformers
├── 4_Testing_agent/ → Interactive agent with GPU inference
└── config.ini → Central configuration
The key design decision: one config file (config.ini) controls everything — model name, dataset paths, LoRA hyperparameters, max sequence length. No scattered hardcoded values.
Step by step
1. Dataset preparation
Telegram lets you export your chat history as JSON. The dataset stage reads that JSON, filters messages by sender (only yours), cleans noise, and builds a JSONL file of prompt/response pairs.
The format I used for training pairs:
{"instruction": "<previous message in conversation>", "response": "<your reply>"}
The filter logic is worth mentioning: very short messages (under 3 words), pure media, and forwarded content are excluded — they add noise without signal.
2. Tokenization
The tokenizer stage converts the JSONL into the token format Mistral expects, applying the chat template:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
Output: a serialized dataset Hugging Face’s Trainer can consume directly.
3. Fine-tuning with LoRA
This is the core. Instead of full fine-tuning (which would require 40GB+ of VRAM), LoRA (Low-Rank Adaptation) injects small trainable matrices into the attention layers. The base model weights stay frozen.
Key config values from config.ini:
[lora]
r = 16
lora_alpha = 32
lora_dropout = 0.05
target_modules = q_proj,v_proj
[training]
num_train_epochs = 3
per_device_train_batch_size = 4
learning_rate = 2e-4
r = 16 is a solid middle ground — higher rank captures more, but increases VRAM and training time. With this config, training on a single consumer GPU (RTX 3090, 24GB) takes ~2 hours for a dataset of ~10k pairs.
4. Testing the agent
The final stage loads the merged model (base + LoRA adapters) and spins up an interactive CLI chat:
conda activate You2AgentAI
python 4_Testing_agent/agent.py
The agent runs inference locally using bitsandbytes 4-bit quantization, so it fits within 16GB VRAM.
What I learned
What worked well:
- LoRA is genuinely practical for personal fine-tuning — the compute requirements are attainable on consumer hardware
- The Hugging Face
SFTTrainerfromtrlhandles the chat template formatting cleanly - Keeping all hyperparameters in
config.inimakes iteration fast — you change one value, re-run, compare
What didn’t work / surprises:
- The first version had no filtering on message length. The model learned to reply with single characters and emoji. Minimum length filtering fixed it completely
- Overfitting is fast with personal data. Three epochs was already too much for small datasets — the model started memorizing exact phrases instead of style
- Evaluation is hard. There’s no automatic metric for “does this sound like me?” — you just read outputs and decide