# 导入BytePS模块 import byteps.torch as bps
# 初始化BytePS bps.init()
# 设置训练进程使用的GPU torch.cuda.set_device(bps.local_rank())
local_rank: """A function that returns the local BytePS rank of the calling process, within the node that it is running on. For example, if there are seven processes running on a node, their local ranks will be zero through six, inclusive. Returns: An integer scalar with the local BytePS rank of the calling process.
# 在push和pull过程中,把32位梯度压缩成16位。(注:精度损失问题怎么解决的?) compression = bps.Compression.fp16 if args.fp16_pushpull else bps.Compression.none
compression: """Optional gradient compression algorithm used during push_pull.""" """Compress all floating point gradients to 16-bit.""" 注:梯度压缩是这个意思?
# optimizer = bps.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=compression)
bps.broadcast_parameters(model.state_dict(), root_rank=0)
root_rank: The rank of the process from which parameters will be broadcasted to all other processes. 注:这里的root_rank是本地的还是全局的?本地的,通常是0号进程。
push_pull_async: """ A function that performs asynchronous averaging or summation of the input tensor over all the BytePS processes. The input tensor is not modified. The reduction operation is keyed by the name. If name is not provided, an incremented auto-generated name is used. The tensor type and shape must be the same on all BytePS processes for a given name. The reduction will not start until all processes are ready to send and receive the tensor. Arguments: tensor: A tensor to average or sum. average: A flag indicating whether to compute average or summation, defaults to average. name: A name of the reduction operation. Returns: A handle to the push_pull operation that can be used with `poll()` or `synchronize()`. """
bps.broadcast_optimizer_state(optimizer, root_rank=0)
worker之间没有通信,server之间也没有通信。(注:李沐论文中说的Parameter Server之间有通信,是为了备份容错。)
rank: # A function that returns the BytePS rank of the calling process. 注:全局进程编号,通常用于控制日志打印。
size: # A function that returns the number of BytePS processes.
local_size: # A function that returns the number of BytePS processes within the node the current process is running on.
""" An optimizer that wraps another torch.optim.Optimizer, using an push_pull to average gradient values before applying gradients to model weights. push_pull operations are executed after each gradient is computed by `loss.backward()` in parallel with each other. The `step()` method ensures that all push_pull operations are finished before applying gradients to the model. DistributedOptimizer exposes the `synchronize()` method, which forces push_pull operations to finish before continuing the execution. It's useful in conjunction with gradient clipping, or other operations that modify gradients in place before `step()` is executed. Example of gradient clipping: ``` output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.synchronize() torch.nn.utils.clip_grad_norm(model.parameters(), args.clip) optimizer.step() ``` Arguments: optimizer: Optimizer to use for computing gradients and applying updates. named_parameters: A mapping between parameter names and values. Used for naming of push_pull operations. Typically just `model.named_parameters()`. compression: Compression algorithm used during push_pull to reduce the amount of data sent during the each parameter update step. Defaults to not using compression. backward_passes_per_step: Number of expected backward passes to perform before calling step()/synchronize(). This allows accumulating gradients over multiple mini-batches before executing averaging and applying them. """ # We dynamically create a new class that inherits from the optimizer that was passed in. # The goal is to override the `step()` method with an push_pull implementation.
common/基础API的Python封装,如BytePSBasics local_rank。
shared_memory.h和共享内存,用于存储CPU中的张量。(注:用的是POSIX API,即共享内存文件shm_open,结合内存映射mmap,相比System V API,有更好的可移植性)
// Total key space is 0 to 2^64 - 1 // It will be divided to N PS servers, for now we assume N <= 2^16
ps::KVWorker,继承SimpleApp,用于向server Push,或者从server Pull key-value数据,还有Wait函数。
ps is_recovery,节点是不是恢复的。(注:有可能中途断掉过?)
ZPush/ZPull:zero-copy Push/Pull, This function is similar to Push except that all data will not be copied into system for better performance. It is the caller's responsibility to keep the content to be not changed before actually finished.
Tensor Partition
key = declared_key * 2^16 + part_num。
cudaHostRegister:把host内存注册为pin memory,用于CUDA。这样CPU->GPU,只需要一次copy。(注:pin memory就是page locked和non pageable,不使用虚拟内存,直接物理内存,也就不会有内存页交换到硬盘上,自然不会有缺页中断)
numa_max_node() returns the highest node number available on the current system. (See the node numbers in /sys/devices/system/node/ ). Also see numa_num_configured_nodes(). numa_set_preferred() sets the preferred node for the current task to node. The system will attempt to allocate memory from the preferred node, but will fall back to other nodes if no memory is available on the the preferred node. Passing a node of -1 argument specifies local allocation and is equivalent to calling numa_set_localalloc(). numa_set_interleave_mask() sets the memory interleave mask for the current task to nodemask. All new memory allocations are page interleaved over all nodes in the interleave mask. Interleaving can be turned off again by passing an empty mask (numa_no_nodes). The page interleaving only occurs on the actual page fault that puts a new page into the current address space. It is also only a hint: the kernel will fall back to other nodes if no memory is available on the interleave target.