一些有关Large Language Model(LLM)的学习记录

前言：

由于本人之前对于大模型的学习学的急功近利，偏向于应用（ ~~乱用bushi~~ ），回头看来对LLM本身过于肤浅，所以浅开一个坑，用于补齐原理性的知识；

~~BUPT课设太TM多了~~ 。不定期随便填坑。

参考资料：

【1】［美] 塞巴斯蒂安·拉施卡著，叶文滔译. 大模型技术30讲[M]. 北京：人民邮电出版社，2024.
【2】Raschka S. Build a Large Language Model (From Scratch)[M]. Shelter Island, NY: Manning, 2024.

一、Understanding LLM

1.1 What is LLM?

定义 1.1：Large Language Model

An LLM is a neural network designed to understand, generate, and respond to human-like text;

“Large” means, datasets on which it’s trained and parameters(the model’s size);
- parameters are the adjustable weights in the network that are optimized during training to predict the next word in a sequence;
LLM‘s architecture – Transformer, which enables them to pay selective attention to different parts of the input when making predictions;
LLM is also called Generative AI(生成式AI);

至于其中使用到的两种手段 —— machine learning and deep learning，将在之后具体阐释。（后者是前者的一个子集，主要区别是是否需要手动提取特征。然而机器学习的经典算法已经快被淘汰了，令人感慨）；

1.2 Application of LLM

无需多言，用过都说好；

1.3 Stages of building and using LLMs

Pretraining 预训练：在多领域的庞大数据集下训练；
Finetuning 微调：针对特定任务通过特定领域的标注数据集在Pretrained LLM的基础上进行训练；
- Instruction Finetuning 指令微调：简单来说，就是标注数据 = 指令 + 正确答案；
- Classification Finetuning 分类微调：标注数据 = 文本 + 分类标签；

两者的具体区别如下图所示：

1.4 Introducing the transformer architecture

Transformer 的两个重要组件是 Encoder 编码器 和 Decoder 解码器，编码器将输入文本编码为向量，而解码器解码向量并生成相应文本。区别于传统的全连接型或卷积型，编码器和解码器之间采用 Self-attention mechanism 自注意力机制 连接；

在 Transformer模型的基础上，又衍生出了两种不同架构：Bert(Bidirectional encoder representations from transformers) 和 GPT(Generative pretrained transformers)；

从全拼也可以看出，这两种模型分别用于不同的任务，Bert模型主要用于 predict masked or hidden words 预测掩码值，适用于文本分类任务，以下是它和 GPT-Model 区别的图示：