HuggingFace的模型和数据集dataset

微调数据集:wikitext-103-v1

模型:BigBirdPegasusForCausalLM

如何下载模型和数据集并调用

这里的模型和数据集是需要在huggingface上找到专门的名称的,然后有多种下载方法,默认会下载到.cache/huggingface/hub/,但是后面.cache可能会被清空。加载的时候,直接传入地址即可

  1. 通过 huggingface model hub 网页的下载按钮进行下载。模型项目页的 Files 栏中可以获取文件的下载链接。无需登录直接点击下载

  2. 通过 huggingface 的 huggingface_hub 工具进行下载

    1
    2
    3
    4
    5
    6
    7
    8
    9
    pip install huggingface_hub
    huggingface-cli download internlm/internlm2-chat-7b
    # 但是直接这么下载还是网络超时,所以使用镜像
    python -m pip install huggingface_hub
    export HF_ENDPOINT=https://hf-mirror.com
    huggingface-cli download --resume-download gpt2 --local-dir gpt2
    可选参数 --resume-download (已废弃)现在默认断点续传
    可选参数 --local-dir-use-symlinks False 因为huggingface的工具链默认会使用符号链接来存储下载的文件,导致--local-dir指定的目录中都是一些“链接文件”,真实模型则存储在~/.cache/huggingface下,如果不喜欢这个可以用 --local-dir-use-symlinks False取消这个逻辑。 但是这样的话每次调用的时候都必须输入绝对路径了。
    huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir wikitext
  3. 使用 huggingface 的 transformers 库实例化模型进而将模型下载到缓存目录。就是说写代码什么时候需要什么时候下载,

    1
    2
    3
    4
    5
    6
    7
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import os
    # 设置 HF_ENDPOINT 环境变量
    os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
    tokenizer = AutoTokenizer.from_pretrained("/home/{username}/huggingface/internlm2-chat-7b", trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained("/home/{username}/huggingface/internlm2-chat-7b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
  4. 使用**hfd** 是国人开发的 huggingface 专用下载工具,基于成熟工具 aria2,可以做到稳定高速下载不断线。

    1
    2
    3
    4
    5
    wget https://hf-mirror.com/hfd/hfd.shchmod a+x hfd.sh
    chmod a+x hfd.sh
    export HF_ENDPOINT=https://hf-mirror.com
    ./hfd.sh gpt2
    ./hfd.sh wikitext --dataset

Dataset库的使用

Dataset库其实也是huggingface的,维护了很多数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
pip install datasets
import os #据说这样可以开启代理
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
from datasets import load_dataset
ds = load_dataset("fancyzhx/ag_news")


# 查看数据集的组成,用feature,然后可以看到分为text和每个text的label,这是一个文本分类任务
ds["train"].features
# {'text': Value(dtype='string', id=None),'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}

# 直接这样也可以输出ds的构成,是分为train和test
ds
print(ds)
# DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 120000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 7600
})
})

  • filter

    1
    2
    3
    4
    # 传入一个lamda函数,其中要求数据的label为2
    tmp = ds["train"].filter(lambda x: x["label"] == 2)
    print(tmp)
    print(tmp[0])
  • map,对于数据集中的分类进行修改,或者对于内容进行扩充

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    prompt_cls = """你是文本分类领域的专家,请你给下述文本分类,把它分到下述类别中:
    * World
    * Sports
    * Business
    * Science / Technology'

    text是待分类的文本,label是文本的类别:
    text: {text}
    label:
    """
    # 这个lamda函数定义了一个把prompt_cls中的text部分替换为输入的text,也可以是text
    def trans2llm(item):
    item["text"] = prompt_cls.format(text=item["text"])
    return item

    tmp = ds["test"].map(trans2llm)
    print(tmp[0])
  • select 数据集采样,随机采样、下标采样等;

    1
    2
    3
    4
    5
    6
    ds["train"].select([0, 10, 20, 30, 40, 50])# 需要加下标
    # 如果想要随机取1000个
    import random
    numbers = list(range(1000))
    random_numbers = random.sample(numbers, 100)
    ds["train"].select(random_numbers)
  • concatenate_datasets数据集拼接

    1
    2
    from datasets import concatenate_datasets
    dataset = concatenate_datasets([true_dataset, false_dataset])
  • train_test_split区分训练集和测试集,并且自动加一个纬度划分为test和train

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    dataset.train_test_split(train_size=0.8)

    dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")
    print(dataset)
    dataset = dataset.select(range(1000)) # Use a subset for quick testing
    train_test_split = dataset.train_test_split(test_size=0.1)
    print(train_test_split)
    train_dataset = train_test_split['train']
    print(train_dataset)
    eval_dataset = train_test_split['test']
    print(eval_dataset)
    print(train_dataset[0])
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    Dataset({
    features: ['text'],
    num_rows: 1801350
    })
    DatasetDict({
    train: Dataset({
    features: ['text'],
    num_rows: 900
    })
    test: Dataset({
    features: ['text'],
    num_rows: 100
    })
    })
    Dataset({
    features: ['text'],
    num_rows: 900
    })
    Dataset({
    features: ['text'],
    num_rows: 100
    })
    {'text': " The ship was assigned to the Austro @-@ Hungarian Fleet 's 1st Battle Squadron after her 1911 commissioning . In 1912 , Zrínyi and her two sister ships conducted two training cruises into the eastern Mediterranean Sea . On the second cruise into the Aegean Sea , conducted from November to December , Zrínyi and her sister ships were accompanied by the cruiser SMS Admiral Spaun and a pair of destroyers . After returning to Pola , the entire fleet mobilized for possible hostilities , as tensions flared in the Balkans . \n"}
  • add_column添加属性列

    1
    dataset["train"] = dataset["train"].add_column("index", list(range(786701)))
  • rename_column属性列重命名

    1
    2
    dataset["train"] = dataset["train"].rename_column("idx", "file_sent_index")

train loss

对比微调训练的loss变化

epoch mindnlp+mindspore transformer+torch(4060)
1 2.9176 8.7301
2 2.79 8.1557
3 2.593 7.7516
4 2.4875 7.5017
5 2.3831 7.2614
6 2.2631 7.0559
7 2.2369 6.8405
8 2.1732 6.7297
9 2.1717 6.7136
10 2.1833 6.6279

eval loss

对比评估得分

epoch mindnlp+mindspore transformer+torch(4060)
1 2.6390955448150635 6.3235931396484375

# 测试样例(包含真实标签)

test_data = [

​ {“text”: “I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.”

​ , “true_label”: “rec.autos”},

​ {“text”: “I’m not familiar at all with the format of these X-Face:thingies, but\nafter seeing them in some folks’ headers, I’ve got to see them (and\nmaybe make one of my own)!\n\nI’ve got dpg-viewon my Linux box (which displays uncompressed X-Faces)\nand I’ve managed to compile [un]compface too… but now that I’m looking\nfor them, I can’t seem to find any X-Face:'s in anyones news headers! :-(\n\nCould you, would you, please send me your X-Face:header\n\nI know* I’ll probably get a little swamped, but I can handle it.\n\n\t…I hope.”

​ , “true_label”: “comp.windows.x”},

​ {“text”: “\nIn a word, yes.\n”

​ , “true_label”: “alt.atheism”},

​ {“text”: “\nThey were attacking the Iraqis to drive them out of Kuwait,\na country whose citizens have close blood and business ties\nto Saudi citizens. And me thinks if the US had not helped out\nthe Iraqis would have swallowed Saudi Arabia, too (or at \nleast the eastern oilfields). And no Muslim country was doing\nmuch of anything to help liberate Kuwait and protect Saudi\nArabia; indeed, in some masses of citizens were demonstrating\nin favor of that butcher Saddam (who killed lotsa Muslims),\njust because he was killing, raping, and looting relatively\nrich Muslims and also thumbing his nose at the West.\n\nSo how would have you defended Saudi Arabia and rolled\nback the Iraqi invasion, were you in charge of Saudi Arabia???\n\n\nI think that it is a very good idea to not have governments have an\nofficial religion (de facto or de jure), because with human nature\nlike it is, the ambitious and not the pious will always be the\nones who rise to power. There are just too many people in this\nworld (or any country) for the citizens to really know if a \nleader is really devout or if he is just a slick operator.\n\n\nYou make it sound like these guys are angels, Ilyess. (In your\nclarinet posting you edited out some stuff; was it the following???)\nFriday’s New York Times reported that this group definitely is\nmore conservative than even Sheikh Baz and his followers (who\nthink that the House of Saud does not rule the country conservatively\nenough). The NYT reported that, besides complaining that the\ngovernment was not conservative enough, they have:\n\n\t- asserted that the (approx. 500,000) Shiites in the Kingdom\n\t are apostates, a charge that under Saudi (and Islamic) law\n\t brings the death penalty. \n\n\t Diplomatic guy (Sheikh bin Jibrin), isn’t he Ilyess?\n\n\t- called for severe punishment of the 40 or so women who\n\t drove in public a while back to protest the ban on\n\t women driving. The guy from the group who said this,\n\t Abdelhamoud al-Toweijri, said that these women should\n\t be fired from their jobs, jailed, and branded as\n\t prostitutes.\n\n\t Is this what you want to see happen, Ilyess? I’ve\n\t heard many Muslims say that the ban on women driving\n\t has no basis in the Qur’an, the ahadith, etc.\n\t Yet these folks not only like the ban, they want\n\t these women falsely called prostitutes? \n\n\t If I were you, I’d choose my heroes wisely,\n\t Ilyess, not just reflexively rally behind\n\t anyone who hates anyone you hate.\n\n\t- say that women should not be allowed to work.\n\n\t- say that TV and radio are too immoral in the Kingdom.\n\nNow, the House of Saud is neither my least nor my most favorite government\non earth; I think they restrict religious and political reedom a lot, among\nother things. I just think that the most likely replacements\nfor them are going to be a lot worse for the citizens of the country.\nBut I think the House of Saud is feeling the heat lately. In the\nlast six months or so I’ve read there have been stepped up harassing\nby the muttawain (religious police—not government) of Western women\nnot fully veiled (something stupid for women to do, IMO, because it\nsends the wrong signals about your morality). And I’ve read that\nthey’ve cracked down on the few, home-based expartiate religious\ngatherings, and even posted rewards in (government-owned) newspapers\noffering money for anyone who turns in a group of expartiates who\ndare worship in their homes or any other secret place. So the\ngovernment has grown even more intolerant to try to take some of\nthe wind out of the sails of the more-conservative opposition.\nAs unislamic as some of these things are, they’re just a small\ntaste of what would happen if these guys overthrow the House of\nSaud, like they’re trying to in the long run.\n\nIs this really what you (and Rached and others in the general\nwest-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd)\nwant, Ilyess?\n”

​ , “true_label”: “talk.politics.mideast”}

]