# 文本摘要

在本节中,我们将看看如何使用 Transformer 模型将长文档压缩为摘要,这项任务称为文本摘要。这是最具挑战性的 NLP 任务之一,因为它需要一系列能力,例如理解长篇文章和生成能够捕捉文档中主要主题的连贯文本。但是,如果做得好,文本摘要是一种强大的工具,可以减轻领域专家详细阅读长文档的负担,从而加快各种业务流程。

尽管在 Hugging Face Hub 上已经存在各种微调模型用于文本摘要,几乎所有这些都只适用于英文文档。因此,为了在本节中添加一些变化,我们将为英语和西班牙语训练一个双语模型。在本节结束时,您将有一个可以总结客户评论的模型

# 准备多语言语料库

我们将使用多语言亚马逊评论语料库创建我们的双语摘要器。该语料库由六种语言的亚马逊产品评论组成,通常用于对多语言分类器进行基准测试。然而,由于每条评论都附有一个简短的标题,我们可以使用标题作为我们模型学习的目标摘要!首先,让我们从 Hugging Face Hub 下载英语和西班牙语子集

from datasets import load_dataset
spanish_dataset = load_dataset("amazon_reviews_multi", "es")
english_dataset = load_dataset("amazon_reviews_multi", "en")
english_dataset
DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

如您所见,对于每种语言,都有 200,000 条评论 train 拆分,每个评论有 5,000 条评论 validation 和 test 分裂。我们感兴趣的评论信息包含在 review_body 和 review_title 列。让我们通过创建一个简单的函数来查看一些示例,该函数使用我们在第五章学到过:

def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")
show_samples(english_dataset)
'>> Title: Worked in front position, not rear'
'>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.'
'>> Title: meh'
'>> Review: Does it’s job and it’s gorgeous but mine is falling apart, I had to basically put it together again with hot glue'
'>> Title: Can\'t beat these for the money'
'>> Review: Bought this for handling miscellaneous aircraft parts and hanger "stuff" that I needed to organize; it really fit the bill. The unit arrived quickly, was well packaged and arrived intact (always a good sign). There are five wall mounts-- three on the top and two on the bottom. I wanted to mount it on the wall, so all I had to do was to remove the top two layers of plastic drawers, as well as the bottom corner drawers, place it when I wanted and mark it; I then used some of the new plastic screw in wall anchors (the 50 pound variety) and it easily mounted to the wall. Some have remarked that they wanted dividers for the drawers, and that they made those. Good idea. My application was that I needed something that I can see the contents at about eye level, so I wanted the fuller-sized drawers. I also like that these are the new plastic that doesn\'t get brittle and split like my older plastic drawers did. I like the all-plastic construction. It\'s heavy duty enough to hold metal parts, but being made of plastic it\'s not as heavy as a metal frame, so you can easily mount it to the wall and still load it up with heavy stuff, or light stuff. No problem there. For the money, you can\'t beat it. Best one of these I\'ve bought to date-- and I\'ve been using some version of these for over forty years.'

试试看! 更改 Dataset.shuffle () 命令中的随机种子以探索语料库中的其他评论。 如果您是说西班牙语的人,请查看 spanish_dataset 中的一些评论,看看标题是否也像合理的摘要。

此示例显示了人们通常在网上找到的评论的多样性,从正面到负面(以及介于两者之间的所有内容!)。尽管标题为 “meh” 的示例信息量不大,但其他标题看起来像是对评论本身的体面总结。在单个 GPU 上训练所有 400,000 条评论的摘要模型将花费太长时间,因此我们将专注于为单个产品领域生成摘要。为了了解我们可以选择哪些域,让我们将 english_dataset 转换到 pandas.DataFrame 并计算每个产品类别的评论数量:

english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]
# Show counts for top 20 products
english_df["product_category"].value_counts()[:20]
home                      17679
apparel                   15951
wireless                  15717
other                     13418
beauty                    12091
drugstore                 11730
kitchen                   10382
toy                        8745
sports                     8277
automotive                 7506
lawn_and_garden            7327
home_improvement           7136
pet_products               7082
digital_ebook_purchase     6749
pc                         6401
electronics                6186
office_product             5521
shoes                      5197
grocery                    4730
book                       3756
Name: product_category, dtype: int64

英语数据集中最受欢迎的产品是家居用品、服装和无线电子产品。不过,为了坚持亚马逊的主题,让我们专注于总结书籍的评论 —— 毕竟,这是亚马逊这家公司成立的基础!我们可以看到两个符合要求的产品类别( book 和 digital_ebook_purchase ),所以让我们为这些产品过滤两种语言的数据集。正如我们在第五章学到的, 这 Dataset.filter () 函数允许我们非常有效地对数据集进行切片,因此我们可以定义一个简单的函数来执行此操作:

def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

现在,当我们将此函数应用于 english_dataset 和 spanish_dataset ,结果将只包含涉及书籍类别的那些行。在应用过滤器之前,让我们将 english_dataset 的格式从 pandas 切换回到 arrow :

english_dataset.reset_format()

然后我们可以应用过滤器功能,作为健全性检查,让我们检查评论样本,看看它们是否确实与书籍有关:

spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
show_samples(english_books)
'>> Title: I\'m dissapointed.'
'>> Review: I guess I had higher expectations for this book from the reviews. I really thought I\'d at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I\'m dissapointed.'
'>> Title: Good art, good price, poor design'
'>> Review: I had gotten the DC Vintage calendar the past two years, but it was on backorder forever this year and I saw they had shrunk the dimensions for no good reason. This one has good art choices but the design has the fold going through the picture, so it\'s less aesthetically pleasing, especially if you want to keep a picture to hang. For the price, a good calendar'
'>> Title: Helpful'
'>> Review: Nearly all the tips useful and. I consider myself an intermediate to advanced user of OneNote. I would highly recommend.'

好的,我们可以看到评论并不是严格意义上的书籍,可能是指日历和 OneNote 等电子应用程序等内容。尽管如此,该领域似乎适合训练摘要模型。在我们查看适合此任务的各种模型之前,我们还有最后一点数据准备要做:将英语和西班牙语评论合并为一个 DatasetDict 目的。 🤗 Datasets 提供了一个方便的 concatenate_datasets () 函数(顾名思义)合并 Dataset 对象。因此,为了创建我们的双语数据集,我们将遍历每个拆分,连接该拆分的数据集,并打乱结果以确保我们的模型不会过度拟合单一语言:

from datasets import concatenate_datasets, DatasetDict
books_dataset = DatasetDict()
for split in english_books.keys():
    books_dataset[split] = concatenate_datasets(
        [english_books[split], spanish_books[split]]
    )
    books_dataset[split] = books_dataset[split].shuffle(seed=42)
# Peek at a few examples
show_samples(books_dataset)
'>> Title: Easy to follow!!!!'
'>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.'
'>> Title: PARCIALMENTE DAÑADO'
'>> Review: Me llegó el día que tocaba, junto a otros libros que pedí, pero la caja llegó en mal estado lo cual dañó las esquinas de los libros porque venían sin protección (forro).'
'>> Title: no lo he podido descargar'
'>> Review: igual que el anterior'