采用「问-答」形式记录研读 CINN 开源框架的笔记

Q：CINN中子图编译的入口是在哪里？

  for (const auto& node_vec : clusters) {  // <------- 逐个遍历每个子图

    // Classify var node to inputs, outputs, and internals.

    GraphNodeSet cluster_set(node_vec.begin(), node_vec.end());

    GraphNodeSet cluster_inputs, cluster_outputs, cluster_internals;

    AnalyseClusterVariables(cluster_set,

                            deny_var_set,

                            &cluster_inputs,

                            &cluster_outputs,

                            &cluster_internals,

                            is_inference_stage,

                            all_skip_gc_vars);

    auto subgraph = CreateNewSubGraph(

        cluster_set, cluster_internals, cluster_inputs, cluster_outputs);

    if (graph->Has(kSkipGcVarNames)) {

      auto& sub_skip_gc_vars =

          subgraph->GetOrInit<std::unordered_set<std::string>>(kSkipGcVarNames);

      sub_skip_gc_vars = all_skip_gc_vars;

    }

    auto compilation_key = cinn_compiler->AddGraph(std::move(subgraph));  // <------ 添加子图（可能包含-1动态shape）

    VLOG(4) << "Compilation Key:\n"

            << cinn_compiler->ReadableKey(compilation_key);

    // Replace the found cluster to a new cinn op node

    ReplaceSubGraphWithCinnOpNode(cluster_set,     // <------- 编译并缓存每个子图的结果

                                  cluster_inputs,

                                  cluster_outputs,

                                  cluster_internals,

                                  compilation_key,

                                  graph);

Q：AddGraph做的事情是什么？

int64_t CinnCompiler::AddGraph(std::unique_ptr<Graph> graph) {

  int64_t graph_key = std::hash<Graph *>()((&(*graph)));

  graphs_[graph_key] = std::move(graph);  // <------ 编译期原生静态图包含-1

  return graph_key;

}

// Add一个graph后，会同步替换原生的Graph为一个 [cinn_launch] Op

Q：CINN中不同Program下的子图编译结果可以复用么？hashkey是否耦合了var_name？

size_t CinnCacheKeyByStructure::HashGraph(const ir::Graph& graph) {

  // sort grad node by name and id.

  auto compare = [](ir::Node* n1, ir::Node* n2) {

    return (n1->Name() == n2->Name()) ? (n1->id() < n2->id())

                                      : (n1->Name() < n2->Name());

  };

  // graph.Nodes() return unordered_set, here using set to avoid the same graph

  // may return different result

  std::set<ir::Node*, bool (*)(ir::Node*, ir::Node*)> node_set(compare),

      output_set(compare);

  node_set.insert(graph.Nodes().begin(), graph.Nodes().end());

  std::string hash_str;

  for (ir::Node* n : node_set) {

    hash_str.append(n->Name());

    output_set.clear();

    output_set.insert(n->outputs.begin(), n->outputs.end());

    for (auto* out : output_set) {

      hash_str.append(out->Name()); // <------ 耦合了graph中的var_name

    }

  }

  VLOG(1) << "The hash graph:\n" << hash_str;

  size_t hash_val = std::hash<std::string>()(hash_str);

  VLOG(4) << "The graph's hash value by graph structure is: " << hash_val;

  return hash_val;

}  //

Bert中具体的一个hash_key样例：cumsumcumsum_0.tmp_0cumsum_0.tmp_0elementwise_subelementwise_subtmp_0feedinput_idsfetchfill_any_likefull_like_0.tmp_0full_like_0.tmp_0cumsumelementwise_subinput_idsfill_any_liketmp_0fetch

size_t CinnCacheKey::Hash::operator()(const CinnCacheKey& key) const {

  std::ostringstream has_str;

  for (const auto& name_shape : key.input_shapes_) {  // <------- 输入shape信息

    has_str << name_shape.first;

    has_str << std::hash<phi::DDim>()(name_shape.second);

  }

  has_str << key.graph_hash_val_;   // graph 结构信息

  has_str << key.arch_str_;        // target 信息

  return std::hash<std::string>()(has_str.str());

}

Q：主框架是何时触发「编译」的？

template <typename DeviceContext, typename T>

class CinnLaunchOpKernel : public framework::OpKernel<T> {

 public:

  void Compute(const framework::ExecutionContext& ctx) const override {

    const auto& compilation_key = ctx.template Attr<int64_t>(kCompilationKey);

     // 根据输入的Tensor shape信息来触发，此时会消解掉一些动态shape为-1的值

    const auto& cinn_compiled_object = CinnCompiler::GetInstance()->Compile(

        compilation_key, inputs_name2tensor, target, stream);

  }

Q：CINN是如何消除动态shape的？

void CinnGraphSymbolization::RunOp(const CinnOpDesc& op_desc,

                                   const OpMapperContext& ctx) const {

  const auto& op_type = op_desc.Type();

  auto* kernel = ::cinn::frontend::OpMapperRegistry::Global()->Find(op_type);

  VLOG(4) << "Running Op " << op_type;

  kernel->Run(op_desc, ctx);  // 此处会由NetBuilder->build()分发到具体API上，调用infer_shape

}

Q：CINN内部是哪里触发缓存机制的？

const CinnCompiledObject &CinnCompiler::Compile(

    const Graph &graph,

    const std::map<std::string, const phi::DenseTensor *> &input_tensors,

    const Target &target,

    void *stream) {

  VLOG(4) << "-- The graph to be compiled is:\n" << VizGraph(graph);

  CinnCacheKeyByAddress cur_key_by_address(

      graph, input_tensors, target.arch_str());   // 优先通过graph.ptr + shape + target 来获取？

  CinnCacheKeyByStructure cur_key_by_struct;      // 若未命中，则再以 graph info + shape + target 来获取

  if (!cache_by_address_.count(cur_key_by_address)) {

    // generate the structure cache key

    cur_key_by_struct.SetKey(graph, input_tensors, target.arch_str());

    if (!cache_by_struct_.count(cur_key_by_struct)) {

      std::int64_t compiled_num = real_compiled_num_.fetch_add(1);

      auto compiled_res =

          CompileGraph(graph, input_tensors, target, compiled_num, stream); // 核心职责交给 CompileGraph

      std::unique_lock<std::mutex> guard(lock_);

      // double check cache_by_struct_

      if (!cache_by_struct_.count(cur_key_by_struct)) {

        cache_by_struct_[cur_key_by_struct] = compiled_num;

        index2cache_.emplace(compiled_num, std::move(compiled_res));

      }

      // double check cache_by_address_

      if (!cache_by_address_.count(cur_key_by_address)) {

        cache_by_address_[cur_key_by_address] =

            cache_by_struct_.at(cur_key_by_struct);

      }

    } else {

      std::unique_lock<std::mutex> guard(lock_);

      // double check cache_by_address_

      if (!cache_by_address_.count(cur_key_by_address)) {

        cache_by_address_[cur_key_by_address] =

            cache_by_struct_.at(cur_key_by_struct);

      }

    }

  }

  return *index2cache_.at(cache_by_address_.at(cur_key_by_address));

}

Q: CompileGraph里的核心职责是什么，是否还有缓存？

std::unique_ptr<CinnCompiledObject> CinnCompiler::CompileGraph(

    const ir::Graph &graph,

    const std::map<std::string, const phi::DenseTensor *> &input_tensors,

    const Target &target,

    std::int64_t compiled_num,

    void *stream) const {

  CinnGraphSymbolization symbol{compiled_num, graph, target, input_tensors};

  auto frontend_program = symbol();

  auto fetch_ids = symbol.GetFetchIds();

  VLOG(4) << "All fetch var ids in CINN: "

          << string::join_strings(fetch_ids, ',');

  auto cinn_graph = Optimize(&frontend_program, fetch_ids, target); // 同一个ir::Graph仅会做一次

  VLOG(4) << "-- The " << compiled_num << "-th compilation ("

          << target.arch_str() << "), and its related graph:\n"

          << cinn_graph->Visualize();

  auto scope = BuildScope(target, cinn_graph);

  auto graph_compiler =

      std::make_unique<GraphCompiler>(target, scope, cinn_graph); // GraphCompiler一次性工作，但会被compiled_obj持有

  GraphCompiler::CompileOptions options;

  options.with_instantiate_variables = false;

  if (!FLAGS_enable_pe_launch_cinn) {

    options.with_buffer_handle_instruction_inserted = true;

  }

  std::unique_ptr<AutoTuner> auto_tuner;

  if (FLAGS_enable_cinn_auto_tune) {

    VLOG(4) << "Compile with auto-tune";

    auto_tuner = std::make_unique<AutoTuner>(target, cinn_graph.get());

    auto_tuner->Initialize(AutoTuner::Config(), graph_compiler.get());

    ::cinn::auto_schedule::TuningOptions tuning_options;

    tuning_options.num_measure_trials = 0;

    auto tuning_result = auto_tuner->Tune(tuning_options);

    options.Apply(tuning_result);

  }

  auto compiled_res =

      graph_compiler->Build(options, std::move(fetch_ids), stream);

  auto compiled_obj = std::make_unique<CinnCompiledObject>();

  *compiled_obj = {std::move(graph_compiler),

                   std::move(auto_tuner),

                   std::move(compiled_res.runtime_program),

                   scope,

                   symbol.var_model_to_program_map()};  // <------对应于 paddle2cinn_varmap

  compiled_obj->cached_index = compiled_num;

  compiled_obj->launch_context =

      std::make_unique<operators::details::CinnLaunchContext>(graph,

                                                              *compiled_obj);

  CheckCompiledValid(graph, input_tensors, *compiled_obj);

  return compiled_obj;

}

Q：GraphCompiler负责编译链接的任务均交给了backends::Compiler，那么此后端Compiler是否有编译缓存呢？

A：host module 端看起来主要是函数声明和调用逻辑，device module 主要是函数定义

如下是一个 CodeGen 生成的源码，即将写到一个 file 文件中传递给编译引擎做编译。如果是多个函数，则会放到同一个文件中编译、链接。

从代码来看，我理解对于一个 CINN 的 sub graph ，会对应一个GraphCompiler来编译生成一个名称范式为：fn_xxx_yyy_zzz 的函数：

描述 sub graph 里所有 op 整体的计算逻辑
可能经过算子 Decompose、优化等逻辑，生成多个子函数
多个子函数放到一个 host 文件、一个 cuda 文件，统一编译、链接成一个函数指针
待确认项：所以lower_func层面是没有缓存的？

上图是在构建 engine_ = ExecutionEngine::Create(ExecutionOptions(), std::move(symbols));

附录：TVM中编译实现

Q：TVM里类似 `GraphCompiler` 的角色是什么？

A：大致复习了TVM的源码，感觉是 TECompilerImpl ，继承自TECompilerNode，提供了如下核心接口：

  // Lower the function.

  CachedFunc Lower(const CCacheKey& key) {

    return LowerInternal(key, global_var_supply_)->cached_func;

  }

// For now, build one module per function.

  PackedFunc JIT(const CCacheKey& key) final {

    CCacheValue value = LowerInternal(key, GlobalVarSupply(NameSupply("")));

    if (value->packed_func != nullptr) {

      return value->packed_func;

    }

    auto m = build(value->cached_func->funcs, key->target, Target(nullptr));   // <------ 此处 m 是一个 runtime::Module 对象

    value->packed_func = m.GetFunction(value->cached_func->prim_fn_var->name_hint);

    return value->packed_func;

  }

  CachedFunc LowerShapeFunc(const CCacheKey& key) final {

    return LowerShapeFuncInternal(key)->cached_func;

  }

值得注意的是，TECompilerImpl 中包含了两个缓存相关的数据结构：

  /*! \brief internal compiler cache */

  std::unordered_map<CCacheKey, CCacheValue> cache_;

  /*! \brief internal compiler cache for shape funcs */

  std::unordered_map<CCacheKey, CCacheValue> shape_func_cache_;

Q：上述 `build ()` 方法是做什么用的？与飞桨的 `backend::Compiler` 角色是一样的么？

A：我认为是一样的，而且其返回的 runtime::Module 对象似乎可以对标飞桨 CINN 中的 RuntimeProgram来理解？

// Build for heterogeneous execution when targets are specified as

// objects.  This wrapper around the internal API is maintained for

// backwards compatibility.

runtime::Module build(const Map<Target, IRModule>& input, const Target& target_host) {

  return TIRToRuntime(input, target_host);

}

runtime::Module TIRToRuntime(const Map<Target, IRModule>& inputs_arg,

                             const Target& target_host_arg) {   // <------- 实现

  std::vector<runtime::Module> device_modules;

  Map<Target, IRModule> inputs = inputs_arg;

  Target target_host = target_host_arg;

  // Fetch previous defined target host in targets

  CheckAndUpdateHostConsistency(&inputs, &target_host);

  if (!target_host.defined()) {

    for (const auto& it : inputs) {

      if (it.first->GetTargetDeviceType() == kDLCPU ||

          it.first->GetTargetDeviceType() == kDLMicroDev) {

        target_host = it.first;

        break;

      }

    }

  }

  if (!target_host.defined()) {

    target_host = DefaultTargetHost(target_host);

  }

  // Update target host for all targets

  CheckAndUpdateHostConsistency(&inputs, &target_host);

  // Take the attrs from the first module so the eventual modules have them.

  // Ideally this would just be one unified module all the way through;

  IRModule first_module = (*inputs.begin()).second;

  IRModule mhost_all = IRModule(Map<GlobalVar, BaseFunc>(), {}, {}, {}, first_module->attrs);

  ICHECK(mhost_all.defined()) << "The host module must be defined";

  for (const auto& it : inputs) {

    if (it.second.defined()) {

      const Target& target = it.first;

      const IRModule& ir_module = it.second;

      auto pair = SplitMixedModule(ir_module, target, target_host);

      auto& host_mod = pair.first;

      auto& device_mod = pair.second;

      ICHECK(host_mod.defined()) << "The split host module must be defined";

      ICHECK(mhost_all.defined()) << "The host module must be defined";

      // We don't want library modules going back into host codegen

      // unless they're supposed to. Here if we overrode the target host

      // to allow lowering previously we check that it's meant to be placed

      // back into the host Module.

      bool overrides_host_target =

          target->GetTargetDeviceType() == target_host->GetTargetDeviceType();

      bool non_host_target_kind = target->kind != target_host->kind;

      if (overrides_host_target && non_host_target_kind) {

        device_modules.push_back(codegen::Build(host_mod, it.first));

      } else {

        mhost_all->Update(host_mod);

      }

      if (device_mod->functions.size() != 0) {

        device_modules.push_back(codegen::Build(device_mod, it.first));

      }

    }

  }

  runtime::Module mhost = codegen::Build(mhost_all, target_host);   // <----- 编译？

  for (const auto& it : device_modules) {

    if (it.operator->()) {

      mhost.Import(it);

    }

  }

  return mhost;

}

runtime::Module Build(IRModule mod, Target target) {

  if (transform::PassContext::Current()

          ->GetConfig<Bool>("tir.disable_assert", Bool(false))

          .value()) {

    mod = tir::transform::SkipAssert()(mod);

  }

  auto target_attr_map = tvm::TargetKind::GetAttrMap<FTVMTIRToRuntime>("TIRToRuntime");

  if (target_attr_map.count(target->kind)) {

    return target_attr_map[target->kind](mod, target);

  }

  // the build function.

  std::string build_f_name = "target.build." + target->kind->name;

  const PackedFunc* bf = runtime::Registry::Get(build_f_name);

  ICHECK(bf != nullptr) << build_f_name << " is not enabled";

  return (*bf)(mod, target);

}

TVM_REGISTER_GLOBAL("target.build.cuda").set_body_typed(BuildCUDA);

runtime::Module BuildCUDA(IRModule mod, Target target) {

  using tvm::runtime::Registry;

  bool output_ssa = false;

  CodeGenCUDA cg;

  cg.Init(output_ssa);

  for (auto kv : mod->functions) {

    ICHECK(kv.second->IsInstance<PrimFuncNode>()) << "CodeGenCUDA: Can only take PrimFunc";

    auto f = Downcast<PrimFunc>(kv.second);

    auto calling_conv = f->GetAttr<Integer>(tvm::attr::kCallingConv);

    ICHECK(calling_conv == CallingConv::kDeviceKernelLaunch)

        << "CodeGenCUDA: expect calling_conv equals CallingConv::kDeviceKernelLaunch";

    cg.AddFunction(f);

  }

  std::string code = cg.Finish();

  if (const auto* f = Registry::Get("tvm_callback_cuda_postproc")) {

    code = (*f)(code).operator std::string();

  }

  std::string fmt = "ptx";

  std::string ptx;

  const auto* f_enter = Registry::Get("target.TargetEnterScope");

  (*f_enter)(target);

  if (const auto* f = Registry::Get("tvm_callback_cuda_compile")) {

    ptx = (*f)(code).operator std::string();

    // Dirty matching to check PTX vs cubin.

    // TODO(tqchen) more reliable checks

    if (ptx[0] != '/') fmt = "cubin";

  } else {

    ptx = NVRTCCompile(code, cg.need_include_path());

  }

  const auto* f_exit = Registry::Get("target.TargetExitScope");

  (*f_exit)(target);

  return CUDAModuleCreate(ptx, fmt, ExtractFuncInfo(mod), code);

}

Q：TVM中是从哪里调用执行的？

A：看到了一个 GraphExecutor 的数据结构。

CINN 中子图编译缓存机制的更多相关文章

python中变量的缓存机制
同一文件中, 变量的缓存机制 (在此范围内的相同值内存地址一样) Number: int: -5 ~ 正无穷 float: 非负数 bool: ...
内置组件 && vue中强大的缓存机制之keep-alive
vue中强大的缓存机制之keep-alive 最近在用vue做项目,在切换页面时发现切换回原来的页面无法保存原来的状态. 如A页面需要ajax请求数据,然后切换到B页面做某些事情,再切换回A页面时,A ...
HTTP请求中浏览器的缓存机制
摘要:在Web开发过程中,我们可能会经常遇到浏览器缓存的问题.本文作者详细解释了浏览器缓存的机制,帮助读者更深层次的认识浏览器的缓存. 流程当资源第一次被访问的时候,HTTP头部如下 (Reques ...
常见面试题之操作系统中的LRU缓存机制实现
LRU缓存机制,全称Least Recently Used,字面意思就是最近最少使用,是一种缓存淘汰策略.换句话说,LRU机制就是认为最近使用的数据是有用的,很久没用过的数据是无用的,当内存满了就优先 ...
10 Python中的代码缓存机制
目录: 1) 什么是代码块 2) 基本原理 3) 机制适用范围 4) 适用对象 5) 优势更详细说明,参考太白老师博客 https://www.cnblogs.com/jin-xin/article ...
js中的事件缓存机制
异步任务指的是,不进入主线程.而进入"任务队列"(task queue)的任务,只有"任务队列"通知主线程,某个异步任务可以执行了,该任务才会进入主线程执行. ...
mysql中innodb引擎的mvcc机制和BufferPool缓存机制
一.MVCC (1)mvcc主要undo日志版本链和read-view一致性视图来保证多事务的并发控制,mvcc是innodb的一种特殊机制,他保证了事务四大特性之一的隔离性(原子性,一致性,隔离性) ...
全面剖析Smarty缓存机制一[三种缓存方式]
今天主要全面总结下Smarty模板引擎中强大的缓存机制,缓存机制有效减少了系统对服务器的压力,而这也是很多开发者喜欢Smarty的原因之一,由于篇幅较大,便于博友阅读,这篇文章将剖析Smarty缓存的 ...
详解ASP.NET缓存机制
文中对ASP.NET的缓存机制进行了简述,ASP.NET中的缓存极大的简化了开发人员的使用,如果使用得当,程序性能会有客观的提升.缓存是在内存存储数据的一项技术,也是ASP.NET中提供的重要特性之一 ...
java中字面量，常量和变量之间的区别（附：Integer缓存机制）
一.引子在各种教科书和博客中这三者经常被引用,今天复习到内存区域,想起常量池中就是存着字面量和符号引用,其实这三者并不是只在java中才有,各个语言中都有类似的定义,所以做一下总结,以示区分. 二. ...

随机推荐

5W1H聊开源之Who和How——谁、如何参与开源？
上次Who的主体是谁"发明"了开源,这一次主体转换,来看看开源发明之后,还有哪些人为开源做贡献?作为普通程序员的我们,又能以怎样的形式参与到开源项目中? 很多人都以为参与开源是一件 ...
#01背包#洛谷 4161 [SCOI2009]游戏
题目将 \(n\) 拆成若干个正整数的和, 问这些正整数的LCM有多少种 \(n\leq 10^3\) 分析考虑这个\(LCM\)一定是1或者由若干个质数的指数幂相乘得到的, 那么可以设\(dp[ ...
#zkw线段树#洛谷 3792 由乃与大母神原型和偶像崇拜
题目给你一个长为 \(n\) 的序列 \(a\) 每次两个操作: 修改 \(x\) 位置的值为 \(y\) 查询区间 \([l,r]\) 是否可以重排为值域上连续的一段分析直接维护区间最大值和最 ...
#dp#洛谷 4399 [JSOI2008]Blue Mary的职员分配
题目分析设\(dp[i][day][j][k]\)表示当前雇员个数为\(i\), 距离上次发广告时间为\(day\),获得的金钱和声望分别为\(j,k\) 注意\(day\)是\([0\sim 3 ...
[P4551] 最长异或路径题解
过程手写利用DFS求出每个点到根节点的异或距离不难得出 xor_dis[x][y]=xor_dis[0][x]^xor_dis[0][y] 于是树上异或问题转换成了Trie上异或问题. 代码直接 ...
开源相机管理库Aravis例程学习（二）——连续采集multiple-acquisition-main-thread
目录简介例程代码函数说明 arv_camera_set_acquisition_mode arv_camera_create_stream arv_camera_get_payload arv_ ...
js 按照字母进行分组
前言 js 按照字母进行分组的一个实例. 正文 var list = [ { 'name' : '张三', 'py' : 'zhnagsan' }, { 'name' : '李四', 'py' : ' ...
Pytorch-tensor的激活函数
1.激活函数激活函数的作用是能够给神经网络加入一些非线性因素,使得神经网络可以更好地解决较为复杂的问题.因为很多问题都不是线性的,你只有给它加入一些非线性因素,就能够让问题更好的解决. 函数1:RE ...
密码学中的RSA算法与椭圆曲线算法
PrimiHub一款由密码学专家团队打造的开源隐私计算平台,专注于分享数据安全.密码学.联邦学习.同态加密等隐私计算领域的技术和内容. 在数字安全领域,加密算法扮演着至关重要的角色.它们确保了信息的机 ...
K8s集群nginx-ingress监控告警最佳实践
本文分享自华为云社区<K8s集群nginx-ingress监控告警最佳实践>,作者:可以交个朋友. 一背景 nginx-ingress作为K8s集群中的关键组成部分.主要负责k8s集群中 ...

CINN 中子图编译缓存机制