AICC大模型Cheat Sheet

返回列表

AICC大模型Cheat Sheet

2023-09-13

浏览量：6143

# 基础术语 * LLM：大语言模型（Large Language Model） * AIGC：人工智能内容创作（AI Generate Content） * AICC：人工智能计算中心（AI Computing Center），全国各地已建设（或在建）20+AICC * B：Billions，十亿，通常用来表述模型参数规模 * DP：数据并行（Data Parallel），将模型复制多份，并行计算 * DDP：分布式数据并行（Distributed Data Parallel） * PP：流水线并行（Pipeline Parallel），不同的NPU/GPU存放一个LLM的不同layer，流水线执行 * TP：张量并行（Tensor Parallel），不同的NPU/GPU存放一个LLM的相同Layer，共同完成一个超大张量的运算（通常在同一个节点上的NPU/GPU） * MOE：混合专家系统（Mixture of Experts） * ZeRO：Zero Redundancy Data Parallelism，一种HBM优化策略 # 进阶术语 * Ascend：昇腾芯片，华为自研芯片，对标NVIDIA芯片 * Ascend NPU：昇腾NPU芯片，目前主流的有32GB HBM和64GB HBM版本 * NPU：深度学习处理器（Neural Process Unit），对标GPU * CANN：昇腾芯片使能层，对标NVIDIA CUDA编程接口 * MindSpore（MS）：昇思框架，对标PyTorch/TensorFlow * MindFormer（MindSpore Transformer）：基于MindSpore（昇思框架）的Transformer大模型使能套件（[MindSpore/mindformers](https://gitee.com/mindspore/mindformers)） * Euler：欧拉操作系统 * Kunpeng：鲲鹏CPU芯片，ARM架构，在人工智能业务里面通常无感知 * ACL（Ascend Computing Language）：昇腾上的C++异构计算编程接口，python上亦有对应的pyACL接口 * BMS：裸金属服务器（Bare Metal Server），计算中心上的一种算力开放形式 * ModelArts：一种基于AICC/公有云的高度封装的AI服务 * 静态图模式：一种MindSpore的执行模式（与动态图模式对应），训练性能较高，但难以调试 * 动态图模式：一种MindSpore的执行模式（与静态图模式对应），易于调试，但训练性能较差 * AscendSpeed：DeepSpeed在Ascend芯片上的实现 * ModelZoo：昇腾社区上的一个页面（[开发者主页-昇腾社区](https://www.hiascend.com/software/modelzoo)），提供目前Ascend上的一些已经实现或者适配的模型 * OBS：对象存储服务（Object Storage Service），用于存储训练过程中的代码、脚本及中间过程文件等 # 常用信息 ## 模型参数量计算方式 * P：参数量 * l：模型层数 * h：隐藏层维度 P=12lh^2 (忽略Embedding层等信息) 例 | 实际参数量 | h | l | 12*lh^2 | | ---------- | ---- | -- | -------------- | | 6.7B | 4096 | 32 | 6,442,450,944 | | 13B | 5120 | 40 | 12,582,912,000 | | 32.5B | 6656 | 60 | 31,897,681,920 | | 65B | 8192 | 80 | 64,424,509,440 | ## 模型文件大小（不含优化器状态）计算方式 * S：模型大小（单位为GB） * P：模型参数量（单位为B） S=2P 例如，13B参数模型大小约为26GB ## 推理模型大小计算方式同模型文件大小计算方式，S=2P ## 训练模型大小（含优化器状态，显存HBM占用）计算方式 * S\_train：训练模型占用显存，单位为GB * P：模型参数量（单位为B） S\_train=20P 例如，13B参数模型训练占用HBM约为260GB ## 中间激活值大小计算方式除了可训练参数及可训练参数相关的梯度、优化器状态以外（这部分参数的大小只和模型有关），激活（activation）占据了大量现存，激活指的是训练过程中得到的，需要在反向传播过程中使用的各张量（可以通过ZeRO方法优化activation这部分HBM占用），如Attention结构中Softmax的结果等。对于l层transformer模型，中间激活层占用的现存可以近似为l\*(34bsh+5bs^2\*a)，其中b表示batchsize，s表示sequence length，a表示注意力头数，不难发现，激活占用的HBM与b和s强相关。以GPT3为例（s为2048），按照上面公式计算，175B权重约为350GB，训练时占用HBM为3.5TB，当b=1时，激活占用275GB，当b=64时，激活占用达到了恐怖的17.6TB，是模型参数的50倍。 ## 训练时间计算方式 * T：训练时间 * n_token：token数量 * P：模型参数量 * n_GPU：NPU/GPU数量 * p_GPU：NPU/GPU峰值算力（FP16/BF16） * usage：GPU利用率（在百卡以上集群，利用率约为30%~50%） T=8 \* n_token \* P/(n_GPU \* p_GPU \* usage) 这里直接引用原始文章中对于GPT和LLaMA训练过程的分析 ![652f821141076.png](/uploads/images/20231027/653b821ecb745.png) PS: 这里的Reference[4]和Reference[7]分别是 Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint arXiv:2302.13971, 2023. Narayanan D, Shoeybi M, Casper J, et al. Efficient large-scale language model training on gpu clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021: 1-15. ## 3D并行计算公式 ``` ``` num_NPUs = data_parallel_size * pipeline_parallel_size * tensor_parallel_size ``` ## global_batch_size VS micro_batch_size ``` global_batch_size = micro_batch_size * data_parallel_size * gradient_accumulation_steps ``` gradient_accumulation_steps:累积forward+backward的次数以执行一次optimizer（the number of forward+backward before you step the optimizer）。当gradient_accumulation_steps=1时，有global_batch_size = micro_batch_size * data_parallel_size ## 吞吐率samples_per_second_per_piece（单位samples/s/p） ``` batch_size = micro_batch_size * num_micro_batches * data_parallel_size samples_per_second = batch_size / elapsed_time_per_batch samples_per_second_per_piece = samples_per_second / num_NPUs ``` ## 吞吐率tokens_per_second_per_piece (单位tokens/s/p) ``` tokens_per_second_per_piece = samples_per_second_per_piece * sequence_length ``` ## 浮点操作数TFLOPs_per_second(单位TFLOPs/s) 这里采用AscendSpeed仓提供的计算方式（reference: ascendspeed/utils.py/throughput_calculator），TFLOPs_per_second可以较好的表示在整个集群上每秒发生的计算数量（单位是TFLOPS），从而根据整体GPU的算力规模计算算力的使用率 ``` flops_per_iteration = (24 * checkpoint_activations_factor * batch_size * seq_len * num_layers * (hidden_size**2)) * (1. + (seq_len / (6. * hidden_size)) + (vocab_size / (16. * num_layers * hidden_size))) TFLOPs_per_second = flops_per_iteration / (elapsed_time_per_iter * args.world_size * (10**12)) ``` * checkpoint_activations_factor: 参考[链接](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html)，checkpoint_activations技术本质上是一种用计算换空间（HBM）的方法，如启用则在上面公式将此参数设置为4，如不启动则为3。 # 参考链接 [[S12]AICC大模型Cheat Sheet](https://zhuanlan.zhihu.com/p/656105169) [基本介绍 - MindSpore master documentation](https://www.mindspore.cn/tutorials/zh-CN/r2.1/beginner/introduction.html) [LLM Parameter Counting | kipply's blog](https://kipp.ly/transformer-param-count/) [回旋托马斯x：分析transformer模型的参数量、计算量、中间激活、KV cache](https://zhuanlan.zhihu.com/p/624740065) [JMXGODLZ：LLM训练指南(二):模型参数、计算量、显存、计算时间计算](https://zhuanlan.zhihu.com/p/639872915) ```