LLM

data: text sequence
tokenization
- converts to numbers + byte pair encoding iteratively merge common pairs
  - reduce vocab size (≈ 10w) & generalize better with subwords handling unknown ones
transformer
- input: sequence tokens + parameters (≈ hundreds of billions / 几千亿)
- process them with a complex math function
SFT
- next token probability distribution
- adjust weights so to produce more correct probabilities
- base model: generative, not conversional
post-train
- prompt-response tokens to train for conversations
- copied data, rigid, limited
RLHF
- separate reward model trained by ranking responses to learn preference
- scalable, consistent, adaptable
hallucination
- simulation, not recall (also simulated confidence confidence) Memory
- solution: train it to admit ignorance for input it’s ignorant of

Spencereads