Shawn001 commited on
Commit
7a8763c
1 Parent(s): 1101a21

Upload 8 files

Browse files
README.md CHANGED
@@ -1,3 +1,45 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+
4
+ language: zh
5
+ inference: false
6
+ tags:
7
+ - bert
8
+ - pytorch
9
  ---
10
+
11
+ # YuYan-10b
12
+
13
+ YuYan is a series of natural language processing models developed by Fuxi AI lab, Netease.Inc, including text generation models, natural language understanding models, and more. YuYan-10b is a natural language understanding model trained on high-quality Chinese corpus.
14
+
15
+ YuYan-10b is similar to BERT in that it is trained on large-scale pre-training corpora using unsupervised learning. However, it differs in that it incorporates various tasks such as sentence order and word deletion in addition to the MLM task during training to enhance the model's semantic representation ability and improve its understanding of Chinese.
16
+
17
+ # CLUE result
18
+
19
+ | | Score | AFQMC | TNEWS1.1 | IFLYTEK | OCNLI_50k | WSC1.1 | CSL |
20
+ | -------------- | ------ | ----- | -------- | ------- | --------- | ------ | ---- |
21
+ | YuYan-10b | | | | | | | |
22
+ | HUMAN | 84.1 | 81 | 71 | 80.3 | 90.3 | 98 | 84 |
23
+ | HunYuan-NLP 1T | 83.632 | 85.11 | 70.44 | 67.54 | 86.5 | 96 | 96.2 |
24
+
25
+ ## How to use
26
+
27
+ Our model is trained based on the [Megatron](https://github.com/NVIDIA/Megatron-LM). As a result, the inference and finetuning depend on it.
28
+
29
+ Below are the install tutorial.
30
+ We have packaged all the required dependencies for the model. Use the following command to obtain the model running environment.
31
+
32
+ ```
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ ## Finetuning script
37
+
38
+ We provide multiple scripts for finetuning on the CLUE benchmark, which is a Chinese language understanding evaluation leaderboard that covers various tasks such as natural language understanding, reading comprehension, and semantic understanding. For any given CLUE task, use the following command to start finetuning.
39
+ ```
40
+ # finetuning afqmc task
41
+ sh finetune_afqmc_distributed.sh
42
+
43
+ # finetuning csl task
44
+ sh finetune_csl_distributed.sh
45
+ ```
bert-vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
finetune_afqmc_distributed.sh ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ WORLD_SIZE=8
4
+
5
+ DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
6
+ --nnodes 1 \
7
+ --node_rank 0 \
8
+ --master_addr localhost \
9
+ --master_port 6003"
10
+
11
+ TASK="AFQMC"
12
+ TRAIN_DATA="clue_data/afqmc/train.json"
13
+ VALID_DATA="clue_data/afqmc/dev.json"
14
+ TEST_DATA="clue_data/afqmc/test.json"
15
+ PRETRAINED_CHECKPOINT="./yuyan-10b"
16
+
17
+ VOCAB_FILE=bert-vocab.txt
18
+
19
+ for lr in 1e-5 2e-5 3e-5; do
20
+ for bs in 32 16; do
21
+ for ep in 3 5 8; do
22
+ ct=`date +"%m%d%H%M%S"`
23
+ OUTPUTS_PATH="outputs/${TASK}/yuyan_bs_${bs}_lr_${lr}_ep_${ep}_${ct}"
24
+ if [ ! -d ${OUTPUTS_PATH} ];then
25
+ mkdir -p ${OUTPUTS_PATH}
26
+ else
27
+ echo "dir exist, not mkdir"
28
+ fi
29
+ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
30
+ --task $TASK \
31
+ --seed 1234 \
32
+ --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
33
+ --train-data $TRAIN_DATA \
34
+ --valid-data $VALID_DATA \
35
+ --test-data $TEST_DATA \
36
+ --tokenizer-type BertWordPieceLowerCase \
37
+ --vocab-file $VOCAB_FILE \
38
+ --epochs $ep \
39
+ --tensor-model-parallel-size 8 \
40
+ --num-layers 48 \
41
+ --hidden-size 4096 \
42
+ --num-attention-heads 64 \
43
+ --micro-batch-size $bs \
44
+ --lr $lr \
45
+ --lr-decay-style linear \
46
+ --lr-warmup-fraction 0.065 \
47
+ --seq-length 128 \
48
+ --max-position-embeddings 512 \
49
+ --log-interval 10 \
50
+ --eval-interval 800 \
51
+ --eval-iters 50 \
52
+ --weight-decay 1.0e-1 \
53
+ --res-path ${OUTPUTS_PATH} \
54
+ --fp16 | tee ${OUTPUTS_PATH}/job.log
55
+ # --activations-checkpoint-method uniform \
56
+ done
57
+ done
58
+ done
finetune_csl_distributed.sh ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ WORLD_SIZE=8
4
+
5
+ DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
6
+ --nnodes 1 \
7
+ --node_rank 0 \
8
+ --master_addr localhost \
9
+ --master_port 6000"
10
+
11
+ TASK="CSL"
12
+ TRAIN_DATA="clue_data/csl/train.json"
13
+ VALID_DATA="clue_data/csl/dev.json"
14
+ TEST_DATA="clue_data/csl/test.json"
15
+ PRETRAINED_CHECKPOINT="./yuyan-10b"
16
+
17
+ VOCAB_FILE=bert-vocab.txt
18
+
19
+ for lr in 4e-6 7e-6; do
20
+ for bs in 4 2; do
21
+ for ep in 7 10; do
22
+ ct=`date +"%m%d%H%M%S"`
23
+ OUTPUTS_PATH="outputs/${TASK}/yuyan_bs_${bs}_lr_${lr}_ep_${ep}_${ct}"
24
+ if [ ! -d ${OUTPUTS_PATH} ];then
25
+ mkdir -p ${OUTPUTS_PATH}
26
+ else
27
+ echo "dir exist, not mkdir"
28
+ fi
29
+ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
30
+ --task $TASK \
31
+ --seed 1234 \
32
+ --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
33
+ --train-data $TRAIN_DATA \
34
+ --valid-data $VALID_DATA \
35
+ --test-data $TEST_DATA \
36
+ --tokenizer-type BertWordPieceLowerCase \
37
+ --vocab-file $VOCAB_FILE \
38
+ --epochs $ep \
39
+ --tensor-model-parallel-size 8 \
40
+ --num-layers 48 \
41
+ --hidden-size 4096 \
42
+ --num-attention-heads 64 \
43
+ --micro-batch-size $bs \
44
+ --lr $lr \
45
+ --lr-decay-style linear \
46
+ --lr-warmup-fraction 0.1 \
47
+ --seq-length 512 \
48
+ --max-position-embeddings 512 \
49
+ --log-interval 10 \
50
+ --eval-interval 3000 \
51
+ --eval-iters 50 \
52
+ --weight-decay 1.0e-1 \
53
+ --res-path ${OUTPUTS_PATH} \
54
+ --fp16 | tee ${OUTPUTS_PATH}/job.log
55
+ # --activations-checkpoint-method uniform \
56
+
57
+
58
+ done
59
+ done
60
+ done
finetune_iflytek_distributed.sh ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ WORLD_SIZE=8
4
+
5
+ DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
6
+ --nnodes 1 \
7
+ --node_rank 0 \
8
+ --master_addr localhost \
9
+ --master_port 6000"
10
+
11
+ TASK="IFLYTEK"
12
+ TRAIN_DATA="clue_data/iflytek/train.json"
13
+ VALID_DATA="clue_data/iflytek/dev.json"
14
+ TEST_DATA="clue_data/iflytek/test.json"
15
+ PRETRAINED_CHECKPOINT="./yuyan-10b"
16
+
17
+ VOCAB_FILE=bert-vocab.txt
18
+
19
+ for lr in 7e-6 1e-5 2e-5; do
20
+ for bs in 24 16 8; do
21
+ for ep in 2 3 5 7 15; do
22
+ ct=`date +"%m%d%H%M%S"`
23
+ OUTPUTS_PATH="outputs/${TASK}/yuyan_bs_${bs}_lr_${lr}_ep_${ep}_${ct}"
24
+ if [ ! -d ${OUTPUTS_PATH} ];then
25
+ mkdir -p ${OUTPUTS_PATH}
26
+ else
27
+ echo "dir exist, not mkdir"
28
+ fi
29
+ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
30
+ --task $TASK \
31
+ --seed 1242 \
32
+ --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
33
+ --train-data $TRAIN_DATA \
34
+ --valid-data $VALID_DATA \
35
+ --test-data $TEST_DATA \
36
+ --tokenizer-type BertWordPieceLowerCase \
37
+ --vocab-file $VOCAB_FILE \
38
+ --epochs $ep \
39
+ --tensor-model-parallel-size 8 \
40
+ --num-layers 48 \
41
+ --hidden-size 4096 \
42
+ --num-attention-heads 64 \
43
+ --micro-batch-size $bs \
44
+ --lr $lr \
45
+ --lr-decay-style linear \
46
+ --lr-warmup-fraction 0.1 \
47
+ --seq-length 512 \
48
+ --max-position-embeddings 512 \
49
+ --log-interval 10 \
50
+ --eval-interval 600 \
51
+ --eval-iters 20 \
52
+ --weight-decay 1.0e-1 \
53
+ --res-path ${OUTPUTS_PATH} \
54
+ --fp16 | tee ${OUTPUTS_PATH}/job.log
55
+ # --activations-checkpoint-method uniform \
56
+
57
+
58
+ done
59
+ done
60
+ done
finetune_ocnli_distributed.sh ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ WORLD_SIZE=8
4
+
5
+ DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
6
+ --nnodes 1 \
7
+ --node_rank 0 \
8
+ --master_addr localhost \
9
+ --master_port 6000"
10
+
11
+ TASK="OCNLI"
12
+ TRAIN_DATA="clue_data/ocnli/train.json"
13
+ VALID_DATA="clue_data/ocnli/dev.json"
14
+ TEST_DATA="clue_data/ocnli/test.json"
15
+ PRETRAINED_CHECKPOINT="./yuyan-10b"
16
+
17
+ VOCAB_FILE=bert-vocab.txt
18
+
19
+ for lr in 2e-5 1e-5 7e-6; do
20
+ for bs in 32 16; do
21
+ for ep in 3 5 10 100; do
22
+ ct=`date +"%m%d%H%M%S"`
23
+ OUTPUTS_PATH="outputs/${TASK}/yuyan_bs_${bs}_lr_${lr}_ep_${ep}_${ct}"
24
+ if [ ! -d ${OUTPUTS_PATH} ];then
25
+ mkdir -p ${OUTPUTS_PATH}
26
+ else
27
+ echo "dir exist, not mkdir"
28
+ fi
29
+ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
30
+ --task $TASK \
31
+ --seed 1236 \
32
+ --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
33
+ --train-data $TRAIN_DATA \
34
+ --valid-data $VALID_DATA \
35
+ --test-data $TEST_DATA \
36
+ --tokenizer-type BertWordPieceLowerCase \
37
+ --vocab-file $VOCAB_FILE \
38
+ --epochs $ep \
39
+ --tensor-model-parallel-size 8 \
40
+ --num-layers 48 \
41
+ --hidden-size 4096 \
42
+ --num-attention-heads 64 \
43
+ --micro-batch-size $bs \
44
+ --lr $lr \
45
+ --lr-decay-style linear \
46
+ --lr-warmup-fraction 0.1 \
47
+ --seq-length 128 \
48
+ --max-position-embeddings 512 \
49
+ --log-interval 10 \
50
+ --eval-interval 800 \
51
+ --eval-iters 50 \
52
+ --weight-decay 1.0e-1 \
53
+ --res-path ${OUTPUTS_PATH} \
54
+ --fp16 | tee ${OUTPUTS_PATH}/job.log
55
+ # --activations-checkpoint-method uniform \
56
+
57
+
58
+ done
59
+ done
60
+ done
finetune_wsc_distributed.sh ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ WORLD_SIZE=8
4
+
5
+ DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
6
+ --nnodes 1 \
7
+ --node_rank 0 \
8
+ --master_addr localhost \
9
+ --master_port 6000"
10
+
11
+ TASK="WSC"
12
+ TRAIN_DATA="clue_data/wsc/train.json"
13
+ VALID_DATA="clue_data/wsc/dev.json"
14
+ TEST_DATA="clue_data/wsc/test.json"
15
+ PRETRAINED_CHECKPOINT="./yuyan-10b"
16
+
17
+ VOCAB_FILE=bert-vocab.txt
18
+
19
+ for lr in 3e-6 5e-6 1e-5; do
20
+ for bs in 8 16 32; do
21
+ for ep in 10 20 30; do
22
+ ct=`date +"%m%d%H%M%S"`
23
+ OUTPUTS_PATH="outputs/${TASK}/yuyan_bs_${bs}_lr_${lr}_ep_${ep}_${ct}"
24
+ if [ ! -d ${OUTPUTS_PATH} ];then
25
+ mkdir -p ${OUTPUTS_PATH}
26
+ else
27
+ echo "dir exist, not mkdir"
28
+ fi
29
+ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
30
+ --task $TASK \
31
+ --seed 1238 \
32
+ --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
33
+ --train-data $TRAIN_DATA \
34
+ --valid-data $VALID_DATA \
35
+ --test-data $TEST_DATA \
36
+ --tokenizer-type BertWordPieceLowerCase \
37
+ --vocab-file $VOCAB_FILE \
38
+ --epochs $ep \
39
+ --tensor-model-parallel-size 8 \
40
+ --num-layers 48 \
41
+ --hidden-size 4096 \
42
+ --num-attention-heads 64 \
43
+ --micro-batch-size $bs \
44
+ --lr $lr \
45
+ --lr-decay-style linear \
46
+ --lr-warmup-fraction 0.1 \
47
+ --seq-length 128 \
48
+ --max-position-embeddings 512 \
49
+ --log-interval 5 \
50
+ --eval-interval 50 \
51
+ --eval-iters 25 \
52
+ --weight-decay 1.0e-1 \
53
+ --res-path ${OUTPUTS_PATH} \
54
+ --fp16 | tee ${OUTPUTS_PATH}/job.log
55
+ # --activations-checkpoint-method uniform \
56
+
57
+
58
+ done
59
+ done
60
+ done
requirements.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ apex==0.1
2
+ autopep8==2.0.2
3
+ einops==0.6.1
4
+ faiss==1.5.3
5
+ file_utils==0.0.1
6
+ Flask==1.1.2
7
+ flask_restful==0.3.10
8
+ ftfy==6.1.1
9
+ jieba_fast==0.53
10
+ langdetect==1.0.9
11
+ lsh==0.1.2
12
+ mmcv==2.0.1
13
+ nltk==3.5
14
+ numpy==1.19.2
15
+ Pillow==10.0.0
16
+ regex==2020.11.13
17
+ Requests==2.31.0
18
+ six==1.15.0
19
+ spacy==2.3.2
20
+ timm==0.9.2
21
+ tldextract==3.4.4
22
+ torch==1.8.0a0+1606899
23
+ torchvision==0.9.0a0
24
+ tqdm==4.53.0
25
+ transformers==4.21.1