Nguyễn Văn Quân @nguyen.van.quan

1.6K 59 13

Đã đăng vào thg 5 19, 2023 2:15 SA

trong

15 phút đọc

4.0K

[Từ Transformer Đến Language Model] Bài 2: Kiến trúc và phương pháp Generative-Pretraining của GPT model

Mayfest2023 ContentCreator

Bài đăng này đã không được cập nhật trong 3 năm

Tiếp nối series kiến thức nền tảng của large language model. Ở Bài 1: Bắt đầu với kiến trúc mô hình - Transformer, mình đã giới thiệu với các bạn về kiến trúc khởi nguồn của large language model - transformer. Trong phần này chúng ta sẽ tiếp tục đi sâu vào những thay đổi về mặt mô hình và phương pháp training unsupervised hiệu quả của language model thông qua lý thuyết và ví dụ code đơn giản mô phỏng training 1 mô hình nhỏ cho tác vụ sáng tác một vở kịch nhé.

Kiến trúc mô hình - Transformer Decoder

Đến thời điểm hiện tại thì hầu hết kiến trúc mà large language model sử dụng đều là Transformer ( ngoại trừ mô hình RWKV sử dụng RNN). Thông thường, người ta chia large language model thành 3 loại dựa trên mục đích sử dụng :

Encoder only: cho phép model đọc hết cả câu đầu vào và encode ra vector context nên có khả năng tổng hợp ngữ nghĩa khá mạnh, dạng model này phù hợp cho những task mang tính đọc hiểu như : classification và sentiment analysis, text summarization, named entity recognition.
Decoder only: nhờ vào khả năng sinh chữ auto-regressive, model dạng decoder có khả năng tạo văn bản mạch lạc và có liên quan theo ngữ cảnh dựa trên prompt hoặc đầu vào nhất định, phù hợp với những task sinh chữ mang tính sáng tạo cao như : text completion, summarization, question-answering, và generating creative text. Đây là hướng phát triển chính của LLM hiện nay bởi nhờ được train unsupervised với lượng dữ liệu unlabel khổng lồ và kiến trúc được tối ưu cũng như kích thước model được scale lên rất lớn mà model dạng decoder đã có thể làm tốt cả các task của encoder only và encoder-decoder.
Encoder - Decoder: model sở hữu cả encoder và decoder như transformer, có khả năng đọc hiểu và sinh text. Tuy nhiên, người ta không hay chọn model dạng encoder-decoder để scale kích thước lên thành large language model bởi một số lí do như :
- Training complexity: huấn luyện model có kiến trúc encoder-deocder yêu cầu dữ liệu phải là 1 cặp câu input-ouput khác nhau, nên việc chuẩn bị dữ liệu training là rất tốn kém.
- Inference complexity : mô hình encoder-decoder yêu cầu 2 step so với 1 step như encoder only hay decoder only.
- Task specificity: kiến trúc enc-dec được thiết kế cho một số task đặc biệt như machine translation, trong đó sẽ có input và output tương ứng. Mặt khác, mục tiêu của LLM là hướng đến việc xử lý nhiều tác vụ sinh từ như text completion và question-answering. Hơn nữa chúng bắt buộc phải đáp ứng đủ linh hoạt để thích ứng với các loại prompt khác nhau và tạo văn bản mạch lạc mà không cần supervision.

Ở phần này, chúng ta sẽ xây dựng một mô hình nhỏ theo hướng Decoder - only, mình sẽ cố gắng giải thích kỹ nhất có thể. Tuy nhiên sẽ có 1 số phần mình bỏ qua do đã có ở bài trước. Mô hình có 2 khối chính : Masked Multi-head Attention và FFN.

Masked Multi-head Attention được cấu thành từ nhiều masked self-attention head, với mask là một ma trận vuông có kích thước $T\times T$ được tạo bởi toán tử torch.tril :

class Head(nn.Module):
    """ one head of masked self-attention """

    def __init__(self, head_size, n_embd=64, block_size=32, dropout=0.0):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape

        # transform q,k,v
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        v = self.value(x) # (B,T,C)

        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

Đầu vào của model là 1 batch dữ liệu có dạng $(B, T, C)$ với $B$ là batch size, $T$ là độ dài mặc định của câu, $C$ là số chiều của vector biểu diễn một ký tự. Model mà mình xây dựng là ở mức độ character chứ không phải word nên mỗi câu mình sẽ tách ra thành từng ký tự một chứ không phải tách thành từng từ. Ví dụ 1 câu : "My name is A" sẽ được tách thành {"M", "y", " ", "n", "a", "m", "e", " ", "i", "s", " ", "A"}, như vậy câu trên đã được tách thành 12 ký tự kể cả dấu khoảng trắng " ". Sau đó mỗi từ sẽ được mã hóa thành một vector có độ dài $C$ , như vậy mỗi câu sau khi đã được mã hóa toàn bộ và thêm padding để có độ dài mặc định là $T$ thì sẽ có dạng $(1, T, C)$ (đối với bài toán này mình sẽ ngắt sao cho đủ 32 ký tự trên mỗi input nên sẽ không cần padding).

FFN là một mạng neural network đơn giản với 2 lớp linear:

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd=64, dropout=0.0):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

Như vậy các thành của model đã có đủ, ta có thể tiến hành bước cuối là lắp ghép và hoàn thiện:

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd=64, n_head=4):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size=65, n_embd=64, n_head=4, block_size=32, n_layer=4):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.block_size=block_size
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        # print(tok_emb.shape, pos_emb.shape)

        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)

        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        # assume that batch size B = 1
        
        for i in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
           
            # get the predictions
            logits, loss = self(idx_cond) 
           

            # focus only on the last time step 
            logits = logits[:, -1, :] # becomes (B, C)
            # we then take softmax of this embedding to get the probability of possible word
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

Giả sử ta có một câu đầy đủ 32 ký tự : $\underbrace{\text{"a b c d e f..."}}_{L=32}$ , câu này sẽ cần 2 bước mã hóa để có thể trở thành input cho model:

Bước 1: chuyển đổi từ string thành int. Để làm được điều này ta sẽ cần một bảng tra cứu từ vựng chứa tất cả những ký tự có thể xuất hiện trong 1 văn bản, bảng tra cứu này sẽ giúp chúng ta dễ dàng mapping từ string thành int. Bảng mà mình sử dụng sẽ có dạng dictionary như sau:
```
{'\n' : 0,  ' ' : 1,  '!' : 2,  '$' : 3,  '&' : 4, "'" : 5,  ',' : 6,  '-' : 7,  '.' : 8,  '3' : 9, ':' : 10,  ';' : 11,   '?': 12, 'A': 13, 'B': 14,  'C': 15,  'D': 16,  'E': 17,  'F': 18,  'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M' : 25, 'N' : 26,'O' : 27,'P' : 28,'Q' : 29,'R': 30,'S': 31,'T': 32,'U': 33,'V': 34,'W': 35,'X': 36,'Y': 37,'Z': 38,'a': 39,'b': 40,'c': 41,'d': 42,'e': 43,'f': 44,'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
```
Dictionary này chứa 65 ký tự độc nhất, đủ để cover tất cả những ký tự có thể xuất hiện trong dataset. Bước này có thể coi là bước tokenizing với character level, tokenizing với word level cũng tương tự như vậy nhưng dictionary mapping sẽ lớn hơn rất nhiều để cover đủ tất các các từ vựng, đồng thời sẽ có thêm padding vì độ dài của mỗi câu trong 1 đoạn văn là không giống nhau.
Bước 2: Mã hóa số nguyên biểu diễn ký tự đang xét thành 1 vector embedding. Bước này sẽ được thực hiện với câu lệnh nn.Embedding để cho ra 1 input có shape $(1, T, C)$

Các phần tính toán attention mình đã trình bày ở bài trước nên bài này mình sẽ bỏ qua nhé. Tiếp đến là làm thế nào để sinh ký tự auto-regressive, điều này sẽ được thực hiện với hàm generate() trong class BigramLanguageModel.

Hàm generate() nhận đầu vào là một câu đã được mapping sang dạng số nguyên gọi là idx và tham số max_new_tokens tức là số token mà ta yêu cầu model sinh ra. Trước tiên, ta sẽ ngắt câu đầu vào lấy 32 ký tự cuối cùng để đảm bảo đầu vào luôn có kích thước bé hơn hoặc bằng 32 và đưa vào model, model sẽ trả ra một tensor có chiều $(B, T, C)$ . Tuy nhiên ta sẽ chỉ cần lấy ký tự cuối cùng trong $T$ ký tự bởi vì đầu vào $x$ sẽ là 1 đoạn bất kỳ trong văn bản có độ dài 32 : $\{x =[i \rightarrow i+T]\}$ với $T=32$ và groundtruth là đầu vào được dịch phải 1 đơn vị : $\{y =[i+1 \rightarrow i+T+1]\}$ . Do đó khi model trả về 1 tensor biểu diễn 1 câu: $(1, T, C)$ , thì ta chỉ cần lấy ký tự cuối cùng của câu mới được sinh: $(: , -1 , :)$ , nghĩa là ký tự $(N+1)$ ; các ký tự trước đó đã có trong input $x$ nên không cần xét đến. Như vậy ta đã có embedding của ký tự cuối cùng: $(1, 1, C)$ , vector embedding này chứa xác suất của tất cả các trường hợp mà 1 ký tự có thể rơi vào, tức là vector này sẽ có kích thước $(1, 1, 65)$ . Để lấy ra ký tự có xác suất lớn nhất có khả năng là ký tự tiếp theo trong câu và cũng như xác suất của các ký tự khác thì ta sẽ phải đưa vector này qua hàm softmax.

Sau khi có xác suất của tất cả các ký tự, ta sẽ đưa danh sách xác suất này vào hàm torch.multinomial để lấy mẫu. Cách hoạt động của hàm này khá đơn giản, giả sử ta có 1 list xác suất : $probs = [0.1, 0.5, 0.2, 0.2]$ , khi ta chạy câu lệnh torch.multinomial(probs, num_samples=1) kết quả trả ra khả năng cao là $[1]$ ứng với vị trí có xác suất cao nhất, có 1 số trường hợp nhỏ sẽ rơi vào các vị trí khác. Vậy tại sao không sử dụng hàm torch.argmax để cho ra kết quả là $[1]$ luôn cho nhanh ? Nếu sử dụng hàm torch.argmax thì kết quả gen ra cho 1 input luôn luôn cố định, không còn tính ngẫu nhiên, dẫn đến tính sáng tạo của model cũng mất đi.

Như vậy sau khi có ký tự tiếp theo (idx_next) ta sẽ ghép ký tự này với input đầu vào thông qua torch.cat và tiếp tục đưa nó vào model để sinh ký tự mới, quá trình này sẽ diễn ra cho đến khi đủ độ dài max_new_tokens.

Phương pháp Generative-Pretraining

Lý thuyết

Giả sử ta có 1 văn bản hoặc nội dung của 1 cuốn sách dưới dạng text, sau khi tokenized nó thì ta sẽ có 1 bộ corpus token $\mathcal{U} = \{ u _ { 1 } , \ldots , u _ { n } \}$ , cần chú ý 1 điều là bộ dữ liệu này chưa có label và chúng ta sẽ tiến hành huấn luyện mô hình language theo phương pháp unsupervised:

$\begin{align} L _ { 1 } ( \mathcal{U} ) = \sum _ { i } \log P ( u _ { i } | u _ { i - k } , \ldots , u _ { i - 1 } ; \Theta ) \end{align}$

Nhìn vào hàm objective trên thì ta có thể hiểu nôm na chiến lược training như thế này:

Input: $\{x=[u _ { i - k } , \ldots , u _ { i - 1 }]\}$
Target: $u_i$
Model : $\Theta$
Loss : Cross-Entropy
Đầu tiên là về dữ liệu training, phương pháp này sẽ lấy 1 đoạn văn bản có độ dài nhất định là input, và mục tiêu dự đoán sẽ là từ tiếp theo ngay sau đoạn văn bản đấy. Như thế thì chúng ta sẽ không cần phải tốn công chuẩn bị label nữa, chỉ cần 1 đoạn token từ $[i \rightarrow i+T+1]$ , input sẽ là token từ $[i \rightarrow i+T]$ và target là token $i+T+1$ . Hàm loss mà chiến lược này sử dụng là cross-entropy, tương tự như các task language modeling thông thường.

Quá trình huấn luyện này thường được tiến hành trước khi áp dụng cho các downstream task. Mục tiêu chính của Generative Pretraining là tạo ra một mô hình học trên lượng dữ liệu khổng lồ đủ phức tạp và tổng quát, nhờ đó mà có thể thích nghi nhanh hơn với các tác vụ khác.

Thực hành

Sau khi nắm được sơ bộ về phương pháp thì chúng ta sẽ đi đến bài toán thực tế để xem chiến lược này thực sự hoạt động như thế nào nhé.

Bộ dữ liệu mà mình sử dụng là tinyshakespeare gồm 40.000 dòng kịch trong tuyển tập của Shakespeare và chúng ta sẽ huấn luyện sao cho model có thể viết được những dòng kịch gần tương tự như vậy.

Việc đầu tiên cần làm là tải dữ liệu và đánh giá sơ qua bộ dữ liệu này:

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("length of dataset in characters: ", len(text))
print(text[:1000])

length of dataset in characters: 1115394

First Citizen: Before we proceed any further, hear me speak. All: Speak, speak.

First Citizen: You are all resolved rather to die than to famish?

All: Resolved. resolved.

First Citizen: First, you know Caius Marcius is chief enemy to the people.

All: We know't, we know't.

First Citizen: Let us kill him, and we'll have corn at our own price. Is't a verdict?

All: No more talking on't; let it be done: away, away!

Second Citizen: One word, good citizens.

First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity, while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes, ere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge.

Bước tiếp theo là tiến hành mapping tất cả đống text này thành số nguyên. Để làm điều này ta cần liệt kê tất cả những ký tự có trong đó:

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

65

Cho những ai chưa biết thì cấu trúc dữ liệu set trong python không cho phép có phần tử trùng lặp trong tập hợp, do đó chỉ cần đưa toàn bộ text vào set thì ta sẽ có 1 tập các ký tự độc nhất không trùng lặp xuất hiện trong đoạn text đó.

Tiếp theo ta cần viết hàm mapping từ string thành int:

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]

hii there

Để dễ hiểu hơn về cách xây dựng dataset unsupervised thì các bạn xem qua ví dụ này nhé:

import torch 
data = torch.tensor(encode(text), dtype=torch.long)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 4 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = [1,2,3,4,5]
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

for b in range(5): # batch dimension
    print(f"when input is {xb[b].tolist()} the target: {yb[b]}")

when input is [47, 56, 57, 58] the target: tensor([56, 57, 58,  1])
when input is [56, 57, 58, 1] the target: tensor([57, 58,  1, 15])
when input is [57, 58, 1, 15] the target: tensor([58,  1, 15, 47])
when input is [58, 1, 15, 47] the target: tensor([ 1, 15, 47, 58])
when input is [1, 15, 47, 58] the target: tensor([15, 47, 58, 47])

Nhìn qua ví dụ trên ta có thể thấy input của model là 1 đoạn token từ time step $[i \rightarrow i+T]$ và label của nó là token từ time step $[i+1 \rightarrow i+T+1]$

Như vậy là model, dataset và loss function đã được làm rõ, giờ thì tổng hợp tất cả lại và train thử xem kết quả như thế nào nhé, các bạn có thể đưa đoạn code này vào colab và chạy thử:

# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    model.to(device)
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size=16, n_embd=64, block_size=32, dropout=0.0):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        
        # transform q,k,v
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        v = self.value(x) # (B,T,C)

        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, head_size, num_heads=4, n_embd=64, dropout=0.0):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, C*num_heads=16*4=64=n_embd)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd=64, dropout=0.0):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd=64, n_head=4):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size=65, n_embd=64, n_head=4, block_size=32, n_layer=4):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.block_size=block_size
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        # print(tok_emb.shape, pos_emb.shape)

        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)

        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context

        
        for i in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:] 
           
            # get the predictions
            logits, loss = self(idx_cond) 
           

            logits = logits[:, -1, :] # becomes (B, C)
            # we then take softmax of this embedding to get the probability of possible word
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel().to(device)
# print the number of parameters in the model

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(model.generate(context, max_new_tokens=2000)[0].tolist()))

Kết quả cuối cùng:

Rettozen giy's freief my toward?
Voliasul:
No, duke I may life in the hove were:
Yow, or that mustiens, and swigh too?

First AUMPET:
Yet man to in I have rews to han othen
The we Hursument.

ANGBRY:
How it.

PAULINTE:
Citingmius his ghards wifity? I Hom and your sI:
While,----
But thou still them butty: they beens armide's of
who; so.

DUKE OF YORK:
My doth vight weads that e'll, better and and persital you:
this fold nayel is like to the wear the comfs: is thou,
And not wil wan may have To a forgien'st my
my marnice ways! doth as forswen to virsomo.

TROSBETH:
What the tae as pass ewsself, some to and pitans' now wors?
He talkestion fortun, our ling for with that the king
Whering all ust thats in made of our wile.

BRUTHUS:
Ipur Butuaver Rome, prinring Anglorce in in,
Withan I mo sive? would 'twere, thigh ray,
And the gate us I'll bear conce!'
-sconceant, to net, thlought this all one plience;
And up you, prothreatence, tawak that forbul strave,
Then fatel beavens earty with. I my bord
Hell to.
If that was suil on out it tome will why,
These it, with thou I own ambany,
As slorgelad royor vinestiod counsent.

PALWARD:
Flass, prow have he's not word.

First COMENTIO:
O, whom leem, and volies am strattrain; I his with my lord:
And the soncry not his mores.
Yet my sing:
Lord trathing if thyse; evich lie,
His beny the said he's, fords frul,
That made destrents bing it you: and that I
surnmo murne, or play's, I and by
Wonteetce up our gentle Buty or vurphesselp,
And letter alve of a it tooks tup.

Kết

Nhìn sơ qua thì kết quả có vẻ hơi tệ nhỉ :v Không sao, điều này cũng dễ hiểu bởi model có kích thước khoảng 0.2M param này là quá nhỏ cho 1 bài toán language modeling, hơn nữa phương pháp tokenizing mà chúng ta sử dụng khá là "thô sơ". Nhưng dù sao thì mục tiêu đặt ra ban đầu cũng đã đạt được. Để cải thiện kết quả thì chúng ta có 1 số cách khá đơn giản như sau : tăng batch size, tăng kích thước model thông qua số head và layer, tăng số training iteration, tinh chỉnh tham số của quá trình training. Cảm ơn các bạn đã đọc bài, nếu thấy có sai sót hay cần trao đổi thì cứ thoải mái comment bên dưới nhé, hẹn gặp các bạn ở bài kế tiếp trong series về LLM.

References

Transformer NLP LLM Language Model

Kiến trúc mô hình - Transformer Decoder

Phương pháp Generative-Pretraining

Lý thuyết

Thực hành

Kết

References

Mục lục