# 处理数据

下面是我们用模型中心的数据在 PyTorch 上训练句子分类器的一个例子：

	import torch
	from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

	# Same as before
	checkpoint = "bert-base-uncased"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
	sequences = [
	"I've been waiting for a HuggingFace course my whole life.",
	"This course is amazing!",
	]
	batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

	# This is new
	batch["labels"] = torch.tensor([1, 1])

	optimizer = AdamW(model.parameters())
	loss = model(**batch).loss
	loss.backward()
	optimizer.step()

	``

	当然，仅仅用两句话训练模型不会产生很好的效果。为了获得更好的结果，您需要准备一个更大的数据集。

	在本节中，我们将使用MRPC（微软研究释义语料库）数据集作为示例，该数据集由威廉·多兰和克里斯·布罗克特在这篇文章发布。该数据集由5801对句子组成，每个句子对带有一个标签，指示它们是否为同义（即，如果两个句子的意思相同）。我们在本章中选择了它，因为它是一个小数据集，所以很容易对它进行训练。



	## 从模型中心（Hub）加载数据集

	模型中心（hub）不只是包含模型；它也有许多不同语言的多个数据集。点击数据集的链接即可进行浏览。我们建议您在阅读本节后阅读一下加载和处理新的数据集这篇文章，这会让您对huggingface的darasets更加清晰。但现在，让我们使用MRPC数据集中的[GLUE 基准测试数据集](https://gluebenchmark.com/)，它是构成MRPC数据集的10个数据集之一，这是一个学术基准，用于衡量机器学习模型在10个不同文本分类任务中的性能。


	Datasets库提供了一个非常便捷的命令，可以在模型中心（hub）上下载和缓存数据集。我们可以通过以下的代码下载MRPC数据集

	```python

	from datasets import load_dataset

	raw_datasets = load_dataset("glue", "mrpc")
	raw_datasets

	DatasetDict({
	train: Dataset({
	features: ['sentence1', 'sentence2', 'label', 'idx'],
	num_rows: 3668
	})
	validation: Dataset({
	features: ['sentence1', 'sentence2', 'label', 'idx'],
	num_rows: 408
	})
	test: Dataset({
	features: ['sentence1', 'sentence2', 'label', 'idx'],
	num_rows: 1725
	})
	})

正如你所看到的，我们获得了一个 DatasetDict 对象，其中包含训练集、验证集和测试集。每一个集合都包含几个列 (sentence1, sentence2, label, and idx) 以及一个代表行数的变量，即每个集合中的行的个数（因此，训练集中有 3668 对句子，验证集中有 408 对，测试集中有 1725 对）。

默认情况下，此命令在下载数据集并缓存到～/.cache/huggingface/dataset. 回想一下第 2 章，您可以通过设置 HF_HOME 环境变量来自定义缓存的文件夹。

我们可以访问我们数据集中的每一个 raw_train_dataset 对象，如使用字典：

	raw_train_dataset = raw_datasets["train"]
	raw_train_dataset[0]

	{'idx': 0,
	'label': 1,
	'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
	'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

我们可以看到标签已经是整数了，所以我们不需要对标签做任何预处理。要知道哪个数字对应于哪个标签，我们可以查看 raw_train_dataset 的 features. 这将告诉我们每列的类型：

raw_train_dataset.features

	{'sentence1': Value(dtype='string', id=None),
	'sentence2': Value(dtype='string', id=None),
	'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
	'idx': Value(dtype='int32', id=None)}

在上面的例子之中，Label（标签）是一种 ClassLabel（分类标签），使用整数建立起到类别标签的映射关系。0 对应于 not_equivalent，1 对应于 equivalent。

# 预处理数据集

为了预处理数据集，我们需要将文本转换为模型能够理解的数字。正如你在第二章上看到的那样

	from transformers import AutoTokenizer

	checkpoint = "bert-base-uncased"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
	tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

然而，在两句话传递给模型，预测这两句话是否是同义之前。我们需要这两句话依次进行适当的预处理。幸运的是，标记器不仅仅可以输入单个句子还可以输入一组句子，并按照我们的 BERT 模型所期望的输入进行处理：

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

	{
	'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
	'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
	'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	}

我们在第二章讨论了输入词 id (input_ids) 和注意力遮罩 (attention_mask) ，但我们在那个时候没有讨论类型标记 ID (token_type_ids)。在这个例子中，类型标记 ID (token_type_ids) 的作用就是告诉模型输入的哪一部分是第一句，哪一部分是第二句。

如果我们将 input_ids 中的 id 转换回文字:

tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']

所以我们看到模型需要输入的形式是 [CLS] sentence1 [SEP] sentence2 [SEP] 。因此，当有两句话的时候。类型标记 ID(token_type_ids) 的值是：

	['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
	[ 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

如您所见，输入中 [CLS] sentence1 [SEP] 它们的类型标记 ID 均为 0，而其他部分，对应于 sentence2 [SEP] ，所有的类型标记 ID 均为 1.

请注意，如果选择其他的 checkpoint，则不一定具有类型标记 ID (token_type_ids)（例如，如果使用 DistilBERT 模型，就不会返回它们）。只有当它在预训练期间使用过这一层，模型在构建时依赖它们，才会返回它们。

用类型标记 ID 对 BERT 进行预训练，并且使用第一章的遮罩语言模型，还有一个额外的应用类型，叫做下一句预测。这项任务的目标是建立成对句子之间关系的模型。

在下一个句子预测任务中，会给模型输入成对的句子（带有随机遮罩的标记），并被要求预测第二个句子是否紧跟第一个句子。为了提高模型的泛化能力，数据集中一半的两个句子在原始文档中挨在一起，另一半的两个句子来自两个不同的文档。

一般来说，你不需要担心是否有类型标记 ID (token_type_ids)。在您的输入中：只要您对标记器和模型使用相同的检查点，一切都会很好，因为标记器知道向其模型提供什么。

# 一份完整的代码 MRPC

	from transformers import AutoModelForSequenceClassification
	from transformers import Trainer, TrainingArguments
	from datasets import load_metric

	import numpy as np
	from transformers import AutoTokenizer, DataCollatorWithPadding
	import datasets

	checkpoint = 'bert-base-cased'
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	raw_datasets = datasets.load_dataset('glue', 'mrpc')

	def tokenize_function(sample):
	return tokenizer(sample['sentence1'], sample['sentence2'], truncation=True)
	tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

	data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

	def compute_metrics(eval_preds):
	metric = load_metric("glue", "mrpc")
	logits, labels = eval_preds.predictions, eval_preds.label_ids
	# 上一行可以直接简写成：
	# logits, labels = eval_preds 因为它相当于一个 tuple
	predictions = np.argmax(logits, axis=-1)
	return metric.compute(predictions=predictions, references=labels)

	training_args = TrainingArguments(output_dir='test_trainer', evaluation_strategy='epoch') # 指定输出文件夹，没有会自动创建

	model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) # new model

	trainer = Trainer(
	model,
	training_args,
	train_dataset=tokenized_datasets["train"],
	eval_dataset=tokenized_datasets["validation"],
	data_collator=data_collator, # 在定义了 tokenizer 之后，其实这里的 data_collator 就不用再写了，会自动根据 tokenizer 创建
	tokenizer=tokenizer,
	compute_metrics=compute_metrics
	)

	trainer.train()

针对上面的问题，输入进入 model 的应该是什么呢？

model(torch.tensor([tokenizer(raw_datasets['train'][0]['sentence1'], raw_datasets['train'][0]['sentence2'], truncation=True, padding=True)['input_ids']]))

上面这种写法中，是没有传入 attention_masked 的，对比下面两个样本测试

之所以取两条数据，是因为这样得出来的就是二维的，如果只是取出一条的话，还得自己加一个维度。

model(torch.tensor(tokenizer(raw_datasets['train'][0:2]['sentence1'], raw_datasets['train'][0:2]['sentence2'], truncation=True, padding=True)['input_ids']))

结果为

	SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5339, -0.2458],
	[ 0.5840, -0.2698]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

这两条数据为

	raw_datasets['train'][0:2]

	{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
	"Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion ."],
	'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
	"Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 ."],
	'label': [1, 0],
	'idx': [0, 1]}

在 model 中传入 attention_mask 时

	temp = tokenizer(raw_datasets['train'][0:2]['sentence1'], raw_datasets['train'][0:2]['sentence2'], truncation=True, padding=True)

	print(temp.keys()) # dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

	print(temp)

	print(temp['input_ids'])

	print(type(temp['input_ids'])) # <class 'list'>

	print(type(temp['token_type_ids'])) # <class 'list'>

	print(type(temp['attention_mask'])) # <class 'list'>

	# 转为 tensor <class 'torch.Tensor'>

	temp['input_ids'] = torch.tensor(temp['input_ids'])

	temp['token_type_ids'] = torch.tensor(temp['token_type_ids'])

	temp['attention_mask'] = torch.tensor(temp['attention_mask'])

	model(**temp)

	SequenceClassifierOutput(loss=None, logits=tensor([[0.1645, 0.6985],
	[0.1670, 0.6946]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

如果是在有 GPU 的环境下进行训练，则 model 会被转移到 GPU 上面，则要将样本数据也要转移到 GPU 上面去

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	temp.to(device)

	'''
	也可用下面的来代替上面的 temp.to(device)
	temp['input_ids'] = temp['input_ids'].to(device)

	temp['attention_mask'] = temp['attention_mask'].to(device)

	temp['token_type_ids'] = temp['token_type_ids'].to(device)

	'''


	model.to(device)

	model(**temp)

# 如何写 compute_metric 的代码

如果是

	predictions = trainer.predict(tokenized_datasets['validation'])

	print(predictions.predictions.shape) # logits
	# array([[-2.7887206, 3.1986978],
	# [ 2.5258656, -1.832253 ], ...], dtype=float32)
	print(predictions.label_ids.shape) # array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...], dtype=int64)
	print(predictions.metrics)

第一行改成 predictions = trainer.predict(tokenized_datasets['validation'][0:3]) ，就会出现报错：原因是 tokenized_datasets['validation'][0:3] 的类型是 <class 'dict'> ，而 tokenized_datasets['validation'] 的类型是 <class 'datasets.arrow_dataset.Dataset'>

更改代码为如下就可以了：

	from datasets import Dataset

	x = tokenized_datasets['validation'][0:2]

	y = Dataset.from_dict(x)

	print(type(y))

	predictions = trainer.predict(y)

	print(predictions.predictions.shape) # logits
	# array([[-2.7887206, 3.1986978],
	# [ 2.5258656, -1.832253 ], ...], dtype=float32)
	print(predictions.label_ids.shape) # array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...], dtype=int64)
	print(predictions.metrics)

值得注意的是，上面得到的结果 predictions.predictions 和 model(**temp) 得到的结果是相同的。

# 下面探索 label

将数据集中的 label 列变成 predict 中的 label_ids，可能数据类型不同，但是数据是相同的

	from datasets import Dataset

	x = tokenized_datasets['validation'][0:2]

	y = Dataset.from_dict(x)

	print(type(y))

	predictions = trainer.predict(y)

	print(predictions.predictions.shape) # logits
	# array([[-2.7887206, 3.1986978],
	# [ 2.5258656, -1.832253 ], ...], dtype=float32)
	print(predictions.label_ids.shape) # array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...], dtype=int64)
	print(predictions.metrics)

	# 输出 label
	print(predictions.label_ids)
	print(x['label'])

# 总结

	from datasets import Dataset

	x = tokenized_datasets['validation'][0:2]

	y = Dataset.from_dict(x)

	print(type(y))

	predictions = trainer.predict(y)

	print(predictions.predictions.shape) # logits
	# array([[-2.7887206, 3.1986978],
	# [ 2.5258656, -1.832253 ], ...], dtype=float32)
	print(predictions.label_ids.shape) # array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...], dtype=int64)

	# 这里就计算出来指标了，可以直接查看
	print(predictions.metrics)

	metric = load_metric("glue", "mrpc")

	metric.compute(predictions=np.argmax(predictions.predictions, axis=-1), references=predictions.label_ids)

如果是对 temp 的话

	res = model(**temp)

	metric = load_metric("glue", "mrpc")


	# 要将 res.logits 转移到 cpu 上，
	metric.compute(predictions=np.argmax(res.logits.cpu().detach().numpy(), axis=-1), references=tokenized_datasets['validation']['label'][0:2])

detach () 函数的用法：https://blog.csdn.net/Hodors/article/details/119248838

现在的问题是，在换 head 之后，模型的输出结果和 labels 指定的形式不一定是一样的，这会对 compute_metric 造成影响，解决方案是：

在训练模型前（此时 head 中的权重是随机初始化的），先

# cola

在上面的代码中，使用的都是 AutoModelFor... ，但是查看官网发现，是存在着 BertFor... 这样的类存在的，目前尚不清楚两者的区别。

对于 AutoModelForSequenceClassification ，模型的输出并不是非 0 即 1，对于输入的每个句子都会得到一个浮点数的结果，如果目标是二分类的话，最后使用 argmax(..., axis=-1) ，得到两个句子中的 0 或 1。

如果目标是类似 stsb 的 1-5 呢？该怎么处理输出的结果呢？

	from transformers import AutoModelForSequenceClassification
	from transformers import Trainer, TrainingArguments
	from datasets import load_metric

	import numpy as np
	from transformers import AutoTokenizer, DataCollatorWithPadding
	import datasets

	checkpoint = 'bert-base-cased'
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)

	raw_datasets = datasets.load_dataset('glue', 'cola')

	# 查看数据集的基本信息
	raw_datasets['train'].features

	def tokenize_function(sample):
	return tokenizer(sample['sentence'], truncation=True)
	tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

	model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) # new model

	def compute_metrics(eval_preds):
	metric = load_metric("glue", "cola")
	logits, labels = eval_preds.predictions, eval_preds.label_ids
	# 上一行可以直接简写成：
	# logits, labels = eval_preds 因为它相当于一个 tuple
	predictions = np.argmax(logits, axis=-1)
	return metric.compute(predictions=predictions, references=labels)

	training_args = TrainingArguments(output_dir='test_trainer', evaluation_strategy='epoch') # 指定输出文件夹，没有会自动创建

	trainer = Trainer(
	model,
	training_args,
	train_dataset=tokenized_datasets["train"],
	eval_dataset=tokenized_datasets["validation"],
	# data_collator=data_collator, # 在定义了 tokenizer 之后，其实这里的 data_collator 就不用再写了，会自动根据 tokenizer 创建
	tokenizer=tokenizer,
	compute_metrics=compute_metrics
	)

	trainer.train()

# 补充

# collate

自然语言处理