OneFlow: 计算数据的来源
前言
上一篇,我们分析了启动 Runtime,其中我们着重分析了线程的启动,线程是计算的抽象。这篇我们来关注存储,Runtime 在启动的时候,会给 RegstMgr 添加一个 Plan,RegstMgr 根据 Plan 申请分配内存。
回顾
上篇的最后,我们看到了最终一个 Kernel 是如何被调用执行的,数据的来源和目的地都存储在 KernelComputeContext 上。Compute 是最终消费数据的地方,这是数据的目的地,那么数据从哪里来呢?
// oneflow/user/kernels/add_n_kernel.cpp: 41
void Compute(user_op::KernelComputeContext* ctx) const override {
size_t in_num = ctx->inputs().size();
user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
int64_t n = out->shape().elem_cnt();
T* out_dptr = out->mut_dptr<T>();
std::vector<const T*> in_dptrs(in_num);
for (int32_t i = 0; i < in_num; ++i) {
in_dptrs.at(i) = ctx->Tensor4ArgNameAndIndex("in", i)->dptr<T>();
}
cpu_add<T>(n, out_dptr, in_dptrs);
}
add_n 脚本
首先写一个调用了 add_n 脚本,这个脚本适用于 0.5.0 版本的 OneFlow。它首先将两个矩阵相乘,然后将所有的三个矩阵相加。
from oneflow.compatible import single_client as flow
from oneflow.compatible.single_client import linalg as linalg
from oneflow.compatible.single_client.ops import math_ops as math_ops
from oneflow.compatible.single_client import typing as tp
import numpy as np
@flow.global_function()
def matmul(
x: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
y: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
) -> tp.Numpy:
return linalg.matmul(x, y)
@flow.global_function()
def add_n(
x: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
y: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
z: tp.Numpy.Placeholder((3, 3), dtype=flow.float),
) -> tp.Numpy:
return math_ops.add_n([x, y, z])
if __name__ == '__main__':
x = np.arange(0, 9).reshape(3, 3).astype(np.float32)
y = np.arange(9, 18).reshape(3, 3).astype(np.float32)
z = matmul(x, y)
print(z)
a = add_n(x, y, z)
print(a)
流程分析
我们尝试一种自底向上的视角去分析。
- Compute 函数中最终消费了数据,它调用了 KernelComputeContext 中 Tensor4ArgNameAndIndex 来获取输入和输出。
// oneflow/user/kernels/add_n_kernel.cpp: 42
void Compute(user_op::KernelComputeContext* ctx) const override {
std::cout << std::this_thread::get_id() << std::endl;
size_t in_num = ctx->inputs().size();
user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
int64_t n = out->shape().elem_cnt();
T* out_dptr = out->mut_dptr<T>();
std::vector<const T*> in_dptrs(in_num);
for (int32_t i = 0; i < in_num; ++i) {
in_dptrs.at(i) = ctx->Tensor4ArgNameAndIndex("in", i)->dptr<T>();
}
cpu_add<T>(n, out_dptr, in_dptrs);
}
- UserKernelComputeContext 中会初始化那个元组,不过并没有初始化 BlobTensorView,那这块对应的内存应该在哪里、什么时候进行了初始化?
// oneflow/core/kernel/user_kernel.cpp: 457
struct BnTensorPair {
std::string bn;
std::unique_ptr<user_op::BlobTensorView> tensor;
};
BnTensorPair MakeBnTensorPair(const std::string& bn) {
BnTensorPair pair;
pair.bn = bn;
return pair;
}
// oneflow/core/kernel/user_kernel.cpp: 478
explicit UserKernelComputeContext(DeviceCtx* device_ctx, const KernelConf& kernel_conf,
const JobDesc& job_desc)
: user_op_conf_(kernel_conf.op_attribute().op_conf()),
device_ctx_(device_ctx),
base_ctx_(kernel_conf, job_desc) {
auto InitInOrOut = [&](const PbMap<std::string, UserOpConf::ListString>& arg_map) {
for (const auto& it : arg_map) {
const std::string& arg_name = it.first;
for (int32_t i = 0; i < it.second.s_size(); ++i) {
arg2bn_tensor_pair_.emplace(std::make_pair(arg_name, i),
MakeBnTensorPair(GenRepeatedBn(arg_name, i)));
}
}
};
InitInOrOut(kernel_conf.op_attribute().op_conf().user_conf().input());
InitInOrOut(kernel_conf.op_attribute().op_conf().user_conf().output());
arg2bn_tensor_pair_.emplace(std::make_pair("tmp_buffer", 0),
MakeBnTensorPair(GenRepeatedBn("tmp_buffer", 0)));
}
- UserKernelComputeContext 提供了一个更新的方法,应该是用这个方法进行了初始化、更新等操作。这里面的逻辑是这样的,传入一个函数,用于将 blob name 映射到 blob。对当前 Context 中存储的每一个 pair,从 pair 里拿到 blob name,调用函数去获取 blob。如果 blob 是空指针,那么将当前的 tensor 更新为空指针。如果不是空指针,那么将 blob 更新到 tensor。tensor 如果已经存在,那么更新数据,tensor 如果不存在,那么初始化。
// oneflow/core/kernel/user_kernel.cpp: 513
bool UpdateTensorWithCorrBlob(const std::function<Blob*(const std::string&)>& BnInOp2Blob) {
bool updated = false;
for (auto& pair : arg2bn_tensor_pair_) {
std::unique_ptr<user_op::BlobTensorView>* arg_tensor_ptr = &pair.second.tensor;
Blob* blob = BnInOp2Blob(pair.second.bn);
if (blob == nullptr) {
if (*arg_tensor_ptr) {
arg_tensor_ptr->reset(nullptr);
updated = true;
}
} else {
if (*arg_tensor_ptr) {
if (arg_tensor_ptr->get()->blob() != blob) {
arg_tensor_ptr->get()->Reset(blob);
updated = true;
} else {
if (blob->blob_desc().is_dynamic()) { updated = true; }
}
} else {
arg_tensor_ptr->reset(new user_op::BlobTensorView(blob));
updated = true;
}
}
}
return updated;
}
- UpdateTensorWithCorrBlob 的调用时机呢?它是什么时候被调用的?在 Lazy 模式下面,会在 ForwardUserKernel 的时候调用。
// oneflow/core/kernel/user_kernel.cpp: 635
void UserKernel::ForwardUserKernel(const std::function<Blob*(const std::string&)>& BnInOp2Blob,
user_op::OpKernelState* opkernel_state) const {
const bool updated = ctx_->UpdateTensorWithCorrBlob(BnInOp2Blob);
#ifdef WITH_CUDA_GRAPHS
bool capturing = false;
if (cuda_graph_ctx_) {
if (!cuda_graph_ctx_->IsCapturing()) {
if (cuda_graph_ctx_->IsCaptured() && (!updated)) {
cuda_graph_ctx_->Launch();
return;
}
capturing = true;
cuda_graph_ctx_->BeginCapture();
}
}
#endif // WITH_CUDA_GRAPHS
kernel_->Compute(ctx_.get(), opkernel_state);
#ifdef WITH_CUDA_GRAPHS
if (cuda_graph_ctx_ && capturing) {
cuda_graph_ctx_->EndCapture();
cuda_graph_ctx_->Launch();
}
#endif // WITH_CUDA_GRAPHS
}
- ForwardUserKernel 前面已经分析过了,是在 ForwardDataContent 的时候调用,也就是一个 Kernel Forward 的时候触发,即 Launch 的时候调用,或者说 AsyncLaunchKernel 的时候调用。AsyncLaunchKernel 传入的 lambda 函数,正是上面用来寻找 blob name 对应 blob 的函数。分析一下这个函数。
- 输入是 blob name,输出是 Blob。先根据 blob name 寻找 BlobInfo,然后找到 regst_desc_id,然后根据 id 找到 Regst,再从 Regst 中取出 Blob。
// oneflow/core/actor/actor.cpp: 470
void Actor::AsyncLaunchKernel(const KernelCtx& kernel_ctx,
std::function<Regst*(int64_t)> Regst4RegstDescId) {
for (const ExecKernel& ek : exec_kernel_vec_) {
ek.kernel->Launch(kernel_ctx, [&](const std::string& bn_in_op) -> Blob* {
const auto blob_info_it = ek.bn_in_op2blob_info.find(bn_in_op);
if (blob_info_it == ek.bn_in_op2blob_info.cend()) { return nullptr; }
const BlobInfo& info = blob_info_it->second;
if (info.regst_desc_id == -1) { return nullptr; }
Regst* regst;
if (info.rs != nullptr) {
regst = info.rs->Front(info.regst_desc_id);
} else {
regst = Regst4RegstDescId(info.regst_desc_id);
}
if (regst == nullptr) { return nullptr; }
if (info.ordinal >= 0) {
return regst->GetBlobByOrdinal(info.ordinal);
} else {
return regst->GetBlobByLbi(info.lbi);
}
});
}
}
- Blob 是从 Regst 中取出来的,Regst 的 Blob 又从何而来?有两种可能,Blob 可以来自用户的输入,也可以来自上一个 Actor 运行的结果。Regst 提供了 SetBlobByOrdinal 允许设置指定位置的 Blob。SetBlobByOrdinal 只被 RegstMgr 的 NewBlobsInOneRegst 方法调用,NewBlobsInOneRegst 则是在 NewRegsts 的时候调用,而 NewRegsts 是在 Actor Init 的时候调用。这说明,Blob 其实是在初始化的时候,就已经分配好了内存,只需要上一个 Actor 将数据写好,下一个 Actor 就会去消费。
下面的代码很长,写着上面的方法,被下面的方法调用。
// oneflow/core/register/register.cpp: 52
void Regst::SetBlobByOrdinal(int64_t ordinal, std::unique_ptr<Blob>&& blob) {
CHECK(!sorted_blob_vec_.at(ordinal));
sorted_blob_vec_.at(ordinal).swap(blob);
}
// oneflow/core/register/register_manager.cpp: 191
void RegstMgr::NewBlobsInOneRegst(const std::vector<LbiBlobDescPair>& lbis, Regst* regst,
const RtRegstDesc* rt_regst_desc, char* main_mem_ptr,
char* separated_header_mem_ptr) {
size_t separated_header_mem_size = rt_regst_desc->SeparatedHeaderByteSize4OneRegst();
char* cur_body_pointer = nullptr;
char* cur_header_pointer = nullptr;
if (separated_header_mem_size > 0) {
MemoryCase host_mem_case;
host_mem_case.mutable_host_mem();
if (separated_header_mem_ptr == nullptr) {
separated_header_mem_ptr =
Global<MemoryAllocator>::Get()->Allocate(host_mem_case, separated_header_mem_size);
}
cur_header_pointer = separated_header_mem_ptr;
cur_body_pointer = main_mem_ptr;
} else {
CHECK(separated_header_mem_ptr == nullptr);
cur_header_pointer = main_mem_ptr;
if (main_mem_ptr == nullptr) {
cur_body_pointer = nullptr;
} else {
cur_body_pointer =
main_mem_ptr + rt_regst_desc->GetSoleBlobDesc()->AlignedByteSizeOfBlobHeader();
}
}
regst->set_main_mem_ptr(main_mem_ptr);
regst->set_separated_header_mem_ptr(separated_header_mem_ptr);
rt_regst_desc->ForEachBlobDescOffsetInOnRegst([&](int64_t ordinal, const LogicalBlobId& lbi,
const BlobDesc* blob_desc, int64_t body_offset,
int64_t header_offset) {
std::unique_ptr<Blob> blob_ptr;
if (cur_body_pointer == nullptr) {
blob_ptr.reset(new Blob(regst->regst_desc()->mem_case(), blob_desc,
cur_header_pointer + header_offset, nullptr));
} else {
blob_ptr.reset(new Blob(regst->regst_desc()->mem_case(), blob_desc,
cur_header_pointer + header_offset, cur_body_pointer + body_offset));
InitNonPODTypeBlobIfNeed(Global<MemoryAllocator>::Get(), blob_ptr.get());
}
regst->SetBlobByOrdinal(ordinal, std::move(blob_ptr));
const int64_t regst_desc_id = rt_regst_desc->regst_desc_id();
const auto& parallel_ctx = regst_desc_id2parallel_ctx_.at(regst_desc_id);
if (parallel_ctx.has_parallel_id()) {
const int64_t parallel_id = parallel_ctx.parallel_id();
{
std::lock_guard<std::mutex> lock(mutex_);
lbi2parallel_id2blob_[lbi][parallel_id] = regst->GetBlobByOrdinal(ordinal);
}
}
});
}
// oneflow/core/register/register_manager.cpp: 150
void RegstMgr::NewRegsts(const RegstDescProto& regst_desc_proto,
std::function<void(Regst*)> OneRegstDone) {
const int64_t regst_desc_id = regst_desc_proto.regst_desc_id();
const RegstDescTypeProto& regst_desc_type = regst_desc_proto.regst_desc_type();
const RtRegstDesc* rt_regst_desc = regst_desc_id2rt_regst_desc_.at(regst_desc_id).get();
char* main_mem_ptr = nullptr;
char* separated_header_mem_ptr = nullptr;
int64_t mem_block_id = regst_desc_proto.mem_block_id();
int64_t header_block_id = regst_desc_proto.separated_header_mem_block_id();
if (mem_block_id != -1 && mem_block_id2ptr_.find(mem_block_id) != mem_block_id2ptr_.end()) {
main_mem_ptr = mem_block_id2ptr_.at(mem_block_id) + regst_desc_proto.mem_block_offset();
}
if (header_block_id != -1 && mem_block_id2ptr_.find(header_block_id) != mem_block_id2ptr_.end()) {
separated_header_mem_ptr = mem_block_id2ptr_.at(header_block_id);
}
std::vector<LbiBlobDescPair> lbi_pairs;
if (regst_desc_type.has_data_regst_desc()) {
for (const LbiBlobDescPair& pair : regst_desc_type.data_regst_desc().lbi2blob_desc()) {
lbi_pairs.push_back(pair);
}
std::sort(lbi_pairs.begin(), lbi_pairs.end(), &CompareLbiBlobDescPair);
CHECK(!lbi_pairs.empty());
}
for (int64_t i = 0; i < rt_regst_desc->register_num(); ++i) {
Regst* regst = new Regst;
regst->set_regst_desc(rt_regst_desc);
if (regst_desc_type.has_data_regst_desc()) {
NewBlobsInOneRegst(lbi_pairs, regst, rt_regst_desc, main_mem_ptr, separated_header_mem_ptr);
if (main_mem_ptr != nullptr) { main_mem_ptr += rt_regst_desc->MainByteSize4OneRegst(); }
if (separated_header_mem_ptr != nullptr) {
separated_header_mem_ptr += rt_regst_desc->SeparatedHeaderByteSize4OneRegst();
}
} else if (regst_desc_type.has_ctrl_regst_desc()) {
// do nothing
} else {
UNIMPLEMENTED();
}
OneRegstDone(regst);
}
}
// oneflow/core/actor/actor.cpp: 40
void Actor::Init(const JobDesc* job_desc, const TaskProto& task_proto,
const ThreadCtx& thread_ctx) {
job_desc_ = job_desc;
actor_id_ = task_proto.task_id();
thrd_id_ = Global<IDMgr>::Get()->ThrdId4ActorId(actor_id_);
job_id_ = task_proto.job_id();
InitDeviceCtx(thread_ctx);
if (task_proto.has_parallel_ctx()) {
parallel_ctx_.reset(new ParallelContext(task_proto.parallel_ctx()));
}
for (const ExecNodeProto& node : task_proto.exec_sequence().exec_node()) {
ExecKernel ek;
ek.kernel = ConstructKernel(job_desc_, node.kernel_conf(), device_ctx_.get());
exec_kernel_vec_.push_back(std::move(ek));
}
is_kernel_launch_synchronized_ =
std::all_of(exec_kernel_vec_.cbegin(), exec_kernel_vec_.cend(),
[](const ExecKernel& ek) { return ek.kernel->IsKernelLaunchSynchronized(); });
if (!is_kernel_launch_synchronized_) { CHECK_EQ(exec_kernel_vec_.size(), 1); }
remaining_eord_cnt_ = 0;
msg_handler_ = nullptr;
eord_regst_desc_ids_.clear();
for (const auto& pair : task_proto.produced_regst_desc()) {
Global<RegstMgr>::Get()->NewRegsts(pair.second, [this](Regst* regst) {
produced_regsts_[regst->regst_desc_id()].emplace_back(regst);
});
int64_t regst_desc_id = pair.second.regst_desc_id();
CHECK(name2regst_desc_id_.insert({pair.first, {regst_desc_id}}).second);
if (pair.second.regst_desc_type().has_ctrl_regst_desc()) {
produced_ctrl_regst_desc_ids_.insert(regst_desc_id);
}
}
for (const auto& pair : produced_regsts_) {
for (const auto& regst : pair.second) { produced_regst2reading_cnt_[regst.get()] = 0; }
}
for (const auto& pair : task_proto.consumed_regst_desc_id()) {
CHECK(name2regst_desc_id_.find(pair.first) == name2regst_desc_id_.end());
std::vector<int64_t>& regst_desc_id_vec = name2regst_desc_id_[pair.first];
for (int64_t regst_desc_id : pair.second.regst_desc_id()) {
regst_desc_id_vec.push_back(regst_desc_id);
}
remaining_eord_cnt_ += pair.second.regst_desc_id_size();
if (pair.first == "in_ctrl") {
consumed_ctrl_regst_desc_ids_.insert(regst_desc_id_vec.begin(), regst_desc_id_vec.end());
}
}
total_reading_cnt_ = 0;
is_inplace_consumed_eord_ = false;
CheckInplaceRegstDescId(task_proto);
TakeOverInplaceConsumedAndProduced(task_proto.produced_regst_desc());
is_naive_consumed_eord_ = false;
TakeOverNaiveConsumed(task_proto.consumed_regst_desc_id());
TakeOverNaiveProduced(task_proto.produced_regst_desc());
InitBnInOp2BlobInfo(task_proto);
VirtualActorInit(task_proto);
}
- 既然上一个 Actor 将数据输出之后,下一个 Actor 就会自动消费,那么我们其实只需要关注接受数据的 Actor。数据是通过 Push Job 推送来的,并且在 Python 需要提供一个函数,将 numpy 数据复制到 Blob 里面。
下面的过程来自 InferenceSession,通过启动 Push Job 来将数据推送。Python 通过继承一个 C++ 类,实现了 PushBlob 的方法,在 PushBlob 被调用的时候,他会将指针变成 blob,然后调用 push 的回调函数。
# python/oneflow/compatible/single_client/serving/inference_session.py: 427
def _run_push_jobs(self, **kwargs):
for (
input_name,
push_job_name,
) in self.inter_user_job_info_.input_or_var_op_name2push_job_name.items():
if input_name not in kwargs:
raise ValueError('input "{}" is absent'.format(input_name))
input_numpy = kwargs[input_name]
if not isinstance(input_numpy, np.ndarray):
raise ValueError('input "{}" requires numpy.ndarray'.format(input_name))
push_fn = input_blob_util._MakePushNdarrayCallback(input_numpy)
push_job_inst = job_instance_util.MakePushJobInstance(
push_job_name, input_name, push_fn
)
self._run_job(push_job_inst)
# python/oneflow/compatible/single_client/framework/input_blob_def.py: 249
def _MakePushNdarrayCallback(ndarray):
copied = np.copy(ndarray, order="C")
def Copy(ofblob):
capacity = reduce(lambda x, y: x * y, ofblob.static_shape, 1)
elem_cnt = reduce(lambda x, y: x * y, copied.shape, 1)
assert elem_cnt <= capacity, "%s v.s. %s" % (copied.shape, ofblob.static_shape)
ofblob.CopyFromNdarray(copied)
return Copy
# python/oneflow/compatible/single_client/framework/job_instance.py: 106
def PushBlob(self, of_blob_ptr):
try:
self.push_cb_(ofblob.OfBlob(of_blob_ptr))
except Exception as e:
print(traceback.format_exc())
raise e
- push job instance 什么时候被启动呢?首先从 Python 调用 C++ 的 API,将 job instance 推送过来,接着将这个 job instance 放入 Buffer Manager,等待被取出来。Launch Job 的最后,调用 Buffer Manager 获取 kBufferNameGlobalWaitJobId,向它发送 job id 启动 foreign input job。之后 ForeignInputKernel 被启动,调用了 ForwardDataContent,然后构建一个 blob 传给 job instance 让它去将数据 push 到 blob 里。
// oneflow/api/python/framework/framework.h: 67
inline Maybe<void> LaunchJob(const std::shared_ptr<oneflow::JobInstance>& cb) {
CHECK_OR_RETURN(GlobalProcessCtx::IsThisProcessMaster());
CHECK_NOTNULL_OR_RETURN(Global<Oneflow>::Get());
const auto& job_name = cb->job_name();
auto* buffer_mgr = Global<BufferMgr<std::shared_ptr<JobInstance>>>::Get();
int64_t job_id = Global<JobName2JobId>::Get()->at(job_name);
if (IsPullJob(job_name, *Global<InterUserJobInfo>::Get())) {
buffer_mgr->Get(GetForeignOutputBufferName(job_name))->Send(cb);
}
if (IsPushJob(job_name, *Global<InterUserJobInfo>::Get())) {
buffer_mgr->Get(GetForeignInputBufferName(job_name))->Send(cb);
}
buffer_mgr->Get(GetCallbackNotifierBufferName(job_name))->Send(cb);
Global<BufferMgr<int64_t>>::Get()->Get(kBufferNameGlobalWaitJobId)->Send(job_id);
return Maybe<void>::Ok();
}
// oneflow/core/kernel/foreign_input_kernel.cpp: 23
void ForeignInputKernel::ForwardDataContent(
const KernelCtx& ctx, const std::function<Blob*(const std::string&)>& BnInOp2Blob) const {
const auto& buffer_name = op_conf().foreign_input_conf().ofblob_buffer_name();
std::shared_ptr<JobInstance> foreign_job_instance;
BufferStatus buffer_status = Global<BufferMgr<std::shared_ptr<JobInstance>>>::Get()
->Get(buffer_name)
->TryReceive(&foreign_job_instance);
CHECK_NE(buffer_status, kBufferStatusEmpty);
if (buffer_status == kBufferStatusSuccess) {
OfBlob ofblob(ctx.device_ctx, BnInOp2Blob("out"));
foreign_job_instance->PushBlob(reinterpret_cast<uint64_t>(&ofblob));
}
}
总结
这篇文章粗略总结了计算需要的数据从哪里来,将到哪儿去。内存是在运行时启动的时候按照 Plan 分配的,当上一个 Actor 输出之后,下一个 Actor 就可以利用上一个 Actor 的输出来进行下一步的计算。当 Python 需要将前端的数据推送的 C++ 这边的时候,使用 JobInstance 来实现推送数据。
不过这里有一个细节,我说的很模糊。在 Launch Job 的时候,向 BufferMgr 发送一个 id,这个 id 是如何被接收的呢?接收了这个 id 之后,如何启动对应的 Job 呢?下一篇将进行分析。