# Post-training a Language Model with RLHF

In this notebook, we are going to fine-tune a LM for summarization, step by step

First install the library *trl* (Transformers Reinforcement Learning) in case it is not already on your instance

In [2]:
# device = 'cuda:0'
device = 'cpu'

In [None]:
!pip install trl
import trl

## Summarization: the TL;DR: dataset


We will download this summarization dataset from the huggingface repository (url starting with `hf`)

In [5]:
import pandas as pd

In [None]:
splits = {'train': 'data/train-00000-of-00001-e8c59e5cf7bce1c0.parquet', 'test': 'data/test-00000-of-00001-59ffb27399371eac.parquet', 'valid': 'data/valid-00000-of-00001-0e33e6bd86e3edc9.parquet'}
df = pd.read_parquet("hf://datasets/CarperAI/openai_summarize_tldr/" + splits["train"])

Let's have look at the data

In [None]:
df.iloc[5]
print(df.iloc[5]["prompt"])
print(df.iloc[5]["label"])

## Loading a pretrained Language Model

We will load a model from Hugging-Face.
Because we use colab, we will pick one of the smallest LMs available.


In [8]:
import torch

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    default_data_collator,
)


In [None]:
# model_id= "EleutherAI/pythia-6.9b-deduped"
# model_id= "EleutherAI/pythia-2.8b-deduped"
# model_id= "EleutherAI/pythia-1b-deduped"
model_id= "EleutherAI/pythia-410m-deduped"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device=device)

tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))
tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.end_token_id = tokenizer.eos_token_id
model.config.pad_token_id = model.config.eos_token_id

### Generate a text

Use the Hugging Face API to generate a sentence starting with *Once upon a time*
- use the tokenizer to tokenize the prompt
- generate a sequence of token id starting with the tokenized prompt
- decode token ids to get the response as a string


In [None]:
# your code here

Use the Hugging Face API to generate a text summarizing **The Little Red Riding Hood** (https://en.wikipedia.org/wiki/Little_Red_Riding_Hood)
In the prompt, write the story and ask for a summary.

In [None]:
# your code here

## Finetuning a Language Model on Summarization Examples

First step in post-training is to fine-tune our baseline LM with example data that correspond more closely to the task we want to use it for.
As an example, we fine-tune our model with pairs *(text, summary)*
seprarated by token `TLDR`.

Look at the data below:


In [None]:
from datasets import load_dataset

# we use valid because it is smaller
# dataset = load_dataset("CarperAI/openai_summarize_tldr", split="train")
dataset = load_dataset("CarperAI/openai_summarize_tldr", split="valid").rename_column("label", "completion")

print(type(dataset))
print(dataset[0])

In [None]:
# useful constants

output_dir = "sft_checkpoints"
train_batch_size = 3 # set according to memory usage
gradient_accumulation_steps = 1
learning_rate = 1e-5
eval_batch_size = 1
eval_steps = 500
max_input_length = 550
save_steps = 1000
num_train_epochs = 20



In [None]:
from trl import SFTConfig, SFTTrainer

### Create a configuration

Using the TRL/Hugging face API, create a configuration instance of class `SFTConfig` for Supervised Fine-tuning.

In [None]:
training_args = SFTConfig(
    #fill here
)

Create the the trainer object instance of `SFTTrainer` and run training.

In [None]:
trainer = SFTTrainer(
    #fill here
)

trainer.train()

In [None]:
trainer.save_model("finetuned_model/")

# Training a Reward Function

To perform RLHF, we need a reward function.
We assume this function is implemented by a neural network, parametrized from examples.

In [None]:
import torch
import transformers
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
from trl import RewardTrainer, SFTTrainer, RewardConfig
from datasets import Dataset
import json
import pandas as pd
from transformers import Trainer, TrainingArguments

We use the `test` split of the `CarperAI/openai_summarize_tldr` dataset because of time and resource constraints.

In [None]:
##model path
data_url = "hf://datasets/CarperAI/openai_summarize_comparisons/"
splits = {'train': 'data/train-00000-of-00001-3cbd295cedeecf91.parquet', 'test': 'data/test-00000-of-00001-0845e2eec675b16a.parquet', 'valid1': 'data/valid1-00000-of-00001-b647616a2be5f333.parquet', 'valid2': 'data/valid2-00000-of-00001-2655c5b3621b6116.parquet'}
DATA_PATH = data_url + splits["test"]

In [None]:
df = pd.read_parquet(DATA_PATH)
df = df[:1000] # to speed up training
raw_dataset = Dataset.from_pandas(df)
raw_dataset


We add a special token for padding, and store encoded examples

In [None]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
def formatting_func(examples):
    kwargs = {"padding": "max_length",
              "truncation": True,
              "max_length": 256,
              "return_tensors": "pt"
              }

    # Prepend the prompt and a line break to the original_response and response-1 fields.
    prompt_plus_chosen_response = examples["prompt"] + "\n" + examples["chosen"]
    prompt_plus_rejected_response = examples["prompt"] + "\n" + examples["rejected"]

    # Then tokenize these modified fields.
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)

    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }


In [None]:
formatted_dataset = raw_dataset.map(formatting_func)
formatted_dataset = formatted_dataset.train_test_split()


Let us have a look at the configuration:

In [None]:
model.config

Create a configuration instance of the object `RewardConfig`

In [None]:
training_args = RewardConfig(
    # fill in here
    )

Create the trainer, instance of class `RewardTrainer`

In [None]:
trainer = RewardTrainer(
    #fill in here
                        )
trainer.train()

In [None]:
trainer.save_model("reward_model/")

In [None]:
## inference the model
rm_model = AutoModelForCausalLM.from_pretrained("reward_model/")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Now we compare the score fo the rejected and choses response, given a specific prompt.

In [None]:

def get_score(model, tokenizer, prompt, response):

    instructions = tokenizer.encode_plus(prompt,
                                         response,
                                         padding="max_length",
                                         max_length=256,
                                         return_tensors="pt",
                                         truncation=True).to(device=device)
    with torch.no_grad():
        outputs = model(**instructions)

    logits = outputs[0]

    return logits


In [None]:
# usage with prompt
prompt = df.iloc[0]["prompt"]
example_prefered_response = df.iloc[0]["chosen"]
example_unprefered_response = df.iloc[0]["rejected"]

In [None]:
loss1 = get_score(model, tokenizer, prompt, example_prefered_response)
loss2= get_score(model, tokenizer, prompt, example_unprefered_response)
print(loss1.sum(), loss2.sum())

In [None]:
from torch import nn
loss = -nn.functional.logsigmoid(loss1 - loss2).mean()
print(loss)

How do you interpret this result? Does the current model prefer the chosen response over the rejected response?

## Using the Reward Model to Find a Policy: RLOO

Now that we have the Reward Model, we can learn a policy (to summarize texts in this work)

In [None]:
from trl import RLOOConfig, RLOOTrainer

Create a config (class `RLOOConfig`)

In [None]:

config = RLOOConfig(
    # fill in here
)


In [None]:
#del(ref_model)
# torch.cuda.empty_cache()

Create the trainer and run it

In [None]:
import copy

ref_model = copy.deepcopy(model)

dtset = raw_dataset.remove_columns(["prompt"])


trainer = RLOOTrainer(
    # fill in here
)




In [None]:
trainer.train()

## Bypassing the Reward Model to Find a Policy: DPO

In this case we don't have to train a reward model.
So we go back to our dataset of *chosen/rejected* sentences, and modify our model directly.



Create a config `DPOConfig`

In [None]:
from trl import DPOConfig, DPOTrainer
from datasets import Dataset
import copy

ref_model = copy.deepcopy(model)

config = DPOConfig(
    #fill in here
    )


We limit to 100 examples because of time constraints

In [None]:
data_url = "hf://datasets/CarperAI/openai_summarize_comparisons/"
splits = {'train': 'data/train-00000-of-00001-3cbd295cedeecf91.parquet', 'test': 'data/test-00000-of-00001-0845e2eec675b16a.parquet', 'valid1': 'data/valid1-00000-of-00001-b647616a2be5f333.parquet', 'valid2': 'data/valid2-00000-of-00001-2655c5b3621b6116.parquet'}
DATA_PATH = data_url + splits["test"]

df = pd.read_parquet(DATA_PATH)
df = df[:100]
raw_dataset = Dataset.from_pandas(df)
raw_dataset

Create trainer, and run it

In [None]:


trainer = DPOTrainer(
    #fill in here
)

In [None]:
trainer.train()

In [None]:
# usage with prompt
prompt = df.iloc[0]["prompt"]
example_prefered_response = df.iloc[0]["chosen"]
example_unprefered_response = df.iloc[0]["rejected"]

In [None]:
loss1 = get_score(ref_model, tokenizer, prompt, example_prefered_response)
loss2= get_score(ref_model, tokenizer, prompt, example_unprefered_response)
print(loss1.sum(), loss2.sum())

loss1 = get_score(model, tokenizer, prompt, example_prefered_response)
loss2= get_score(model, tokenizer, prompt, example_unprefered_response)
print(loss1.sum(), loss2.sum())

What can we conclude ? Did training help for this example ?