[源码解析] TensorFlow 分布式环境(5) --- Session

会话机制是TensorFlow 分布式运行时的核心,我们接下来按照从 Client 到 worker 的流程,把 Session 机制从前到后走一边。

本系列其他文章是:

[翻译] TensorFlow 分布式之论文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems"

[翻译] TensorFlow 分布式之论文篇 "Implementation of Control Flow in TensorFlow"

[源码解析] TensorFlow 分布式环境(1) --- 总体架构

[源码解析] TensorFlow 分布式环境(2)---Master 静态逻辑

[源码解析] TensorFlow 分布式环境(3)--- Worker 静态逻辑

[源码解析] TensorFlow 分布式环境(4) --- WorkerCache

1. 概述

1.1 Session 分类

分布式模式由如下 sessions 彼此协作完成了会话控制,其中:

  • GrpcSession 位于 Client 之上,控制 Client 的会话生命周期;
  • MasterSession 位于 Master 之上,可能存在多个 Client 同时接入到同一个 Master,Master 会为每个 Client 构建一个 MasterSession。MasterSession 控制 Master 的会话生命周 期;
  • WorkerSession 位于 Worker 之上,可能存在多个 Master 接入到同一个 Worker,Worker 会为每个 Master 创建一个 WorkerSession。WorkerSession 控制 Worker 的会话生命周期;

如下图所示,这里 Master 和 Worker 都是一个 Server,每个 Server 之上运行一个 MasterService,一个 WorkerService,每个 Server 可能会扮演不同角色,具体取决于用户如何配置计算图和集群。因为存在这种两层一对多关系,为了区别这种不同的数据流和控制关系,有逻辑关系的这三个 session 绑定在同一个 session_handle 之上,每个 session_handle 标示一条完整的数据流。

图 1 Session 关系

1.2 会话流程

我们从 GrpcSession 入手,其基本功能如下:

  • 创建会话

    • 获取远端设备集;
    • 在 Master 之上创建 MasterSession;
    • 在各个 Worker 之上创建 WorkerSession;
  • 迭代执行
    • 启动执行;
    • 图分裂;
    • 注册子图;
    • 运行子图;
  • 关闭会话
    • 关闭 MasterSession
    • 关闭 WorkerSession;

1.2.1 MasterSession 生命周期

在分布式模式下,Master 运行时被 MasterSession 控制,其生命周期如下图所示。

图 2 MasterSession 生命周期

1.2.2 WorkerSession 生命周期

在分布式模式下,Worker 运行时由 WorkerSession 控制,其生命周期如下图所示。

图 3 WorkerSession 生命周期

2. GrpcSession

GrpcSession 是 tensorflow::grpc::MasterService 的简单封装。其使用远程设备集作为计算资源,使用 grpc 作为远端调用机制,让调用者在远端设备上对 TensorFlow 图进行计算。

2.1 定义

我们依然只给出成员变量定义和部分重要函数,其就是利用 master_ 对 tensorflow::grpc::MasterService 进行调用。

  1. class GrpcSession : public Session {
  2. // 有多种创建方式
  3. Status Create(const GraphDef& graph) override;
  4. Status Create(const RunOptions& run_options, const GraphDef& graph) override;
  5. Status Create(GraphDef&& graph) override;
  6. Status Create(const RunOptions& run_options, GraphDef&& graph) override;
  7. private:
  8. const SessionOptions options_;
  9. std::unique_ptr<MasterInterface> master_;
  10. mutex mu_;
  11. // handle_ returned by the master to identify this session.
  12. string handle_ TF_GUARDED_BY(mu_);
  13. // The current version of the graph.
  14. int64_t current_graph_version_ TF_GUARDED_BY(mu_);
  15. bool is_local_ = false;
  16. };

2.2 注册&工厂类

GrpcSession 的使用是通过工厂类完成,比如:

  1. Status NewSession(const SessionOptions& options, Session** out_session) {
  2. SessionFactory* factory;
  3. Status s = SessionFactory::GetFactory(options, &factory);
  4. if (!s.ok()) {
  5. *out_session = nullptr;
  6. return s;
  7. }
  8. // Starts exporting metrics through a platform-specific monitoring API (if
  9. // provided). For builds using "tensorflow/core/platform/default", this is
  10. // currently a no-op.
  11. session_created->GetCell()->Set(true);
  12. s = factory->NewSession(options, out_session);
  13. if (!s.ok()) {
  14. *out_session = nullptr;
  15. }
  16. return s;
  17. }

GrpcSession 由 GrpcSessionFactory 来多态创建,如果 protocal 使用了"grpc://",就会产生 GrpcSession。而 GrpcSessionFactory 会实现注册到系统之上。

  1. const char* const kSchemePrefix = "grpc://";
  2. const size_t kSchemePrefixLength = strlen(kSchemePrefix);
  3. class GrpcSessionFactory : public SessionFactory {
  4. public:
  5. bool AcceptsOptions(const SessionOptions& options) override {
  6. return absl::StartsWith(options.target, kSchemePrefix);
  7. }
  8. Status NewSession(const SessionOptions& options,
  9. Session** out_session) override {
  10. std::unique_ptr<GrpcSession> session;
  11. TF_RETURN_IF_ERROR(GrpcSession::Create(options, &session));
  12. *out_session = session.release();
  13. return Status::OK();
  14. }
  15. // Invokes the session specific static method to reset containers.
  16. Status Reset(const SessionOptions& options,
  17. const std::vector<string>& containers) override {
  18. return GrpcSession::Reset(options, containers);
  19. }
  20. };
  21. class GrpcSessionRegistrar {
  22. public:
  23. GrpcSessionRegistrar() {
  24. SessionFactory::Register("GRPC_SESSION", new GrpcSessionFactory());
  25. }
  26. };
  27. static GrpcSessionRegistrar registrar;

2.3 创建GrpcSession

GrpcSession::Create 方法完成了获取工作。Client 通过 GrpcSession 调用 Master Service,但是具体如何与 Master Service 交互?则通过 MasterInterface。

所以说,这里最重要的就是如何构建 MasterInterface 实例。我们前文提到过,MasterInterface有两种实现,都是用来和 Master service 进行通信,分别对应了不同的应用场景。

  • LocalMaster 用于进程间的直接通信,此时 Client 和 Master 在同一个进程。
  • GrpcRemoteMaster 则使用 Grpc 来和 Master service 进行通信,此时Client 和 Master 分别部署在两个不同进程。GrpcRemoteMaster 其实就实现了 gRPC 客户端,它通过 Stub 访问远端 Master 上的 MasterService 服务。

图上两个矩形封装的 Master 代表实际的 Master 类,此类实现了具体 Master 功能。

图 1 Master 逻辑关系

从下面代码可以看到,GrpcSession 会依据 options.target 来决定如何创建,options.target 一般就是"grpc://",如果通过 LocalMaster::Lookup 方法得到 LocalMaster 类,就直接使用,如果没有找到,就使用 NewGrpcMaster 来生成一个 GrpcRemoteMaster。

  1. /* static */
  2. Status GrpcSession::Create(const SessionOptions& options,
  3. std::unique_ptr<GrpcSession>* out_session) {
  4. std::unique_ptr<GrpcSession> session(new GrpcSession(options));
  5. std::unique_ptr<MasterInterface> master;
  6. // For testing, we enable the client to disable the use of the local
  7. // master registry, so that the RPC stack is exercised.
  8. if (!options.config.rpc_options().use_rpc_for_inprocess_master()) {
  9. master = LocalMaster::Lookup(options.target);
  10. }
  11. if (!master) {
  12. SharedGrpcChannelPtr master_channel;
  13. TF_RETURN_IF_ERROR(
  14. NewHostPortGrpcChannel(options.target.substr(kSchemePrefixLength),
  15. &options.config.rpc_options(), &master_channel));
  16. master.reset(NewGrpcMaster(master_channel));
  17. } else {
  18. session->is_local_ = true;
  19. }
  20. session->SetRemoteMaster(std::move(master));
  21. *out_session = std::move(session);
  22. return Status::OK();
  23. }

2.4 创建MasterSession

在 GrpcSession 创建之后,系统会接着创建 MasterSession,这是通过 GrpcSession::Create(graph_def) 完成的。GrpcSession::Create(graph_def) 会构建 CreateSessionRequst 消息,然后通过 GrpcRemoteMaster 把初始计算图发给 Master。Master 收到 CreateSessionRequst 消息之后就构建相应的 MasterSession,然后返回 CreateSessionResponse 给 GrpcSession,消息包括。

  • 该 MasterSession 的 session_handle。用于标识 Master 侧的 MasterSession 实例
  • 初始计算图的版本号 graph_version。用于后续发起 ExtendSession 操作,比如往原始的计算图中追加新的节点。

图 2 创建MasterSession

具体代码如下,首先是两个 create 方法,其最终调用到 CreateImpl。

  1. Status GrpcSession::Create(const RunOptions& run_options,
  2. const GraphDef& graph) {
  3. return Create(run_options, GraphDef(graph));
  4. }
  5. Status GrpcSession::Create(GraphDef&& graph) {
  6. CallOptions call_options;
  7. call_options.SetTimeout(options_.config.operation_timeout_in_ms());
  8. return CreateImpl(&call_options, std::move(graph));
  9. }

CreateImpl 方法如下:

  1. Status GrpcSession::CreateImpl(CallOptions* call_options, GraphDef graph) {
  2. {
  3. mutex_lock l(mu_);
  4. if (!handle_.empty()) {
  5. return errors::InvalidArgument("A session is alive.");
  6. }
  7. }
  8. CreateSessionRequest req;
  9. *req.mutable_config() = options_.config;
  10. req.mutable_graph_def()->Swap(&graph);
  11. req.set_target(options_.target);
  12. ReEncodeConsts(req.mutable_graph_def());
  13. CreateSessionResponse resp;
  14. Status s = master_->CreateSession(call_options, &req, &resp);
  15. if (s.ok()) {
  16. SetHandleAndGraphVersion(resp.session_handle(), resp.graph_version());
  17. }
  18. return s;
  19. }

2.4.1 GrpcRemoteMaster::CreateSession

GrpcRemoteMaster 是位于 Client 的 gRPC 客户端实现,它的 CreateSession 方法只是通过 gRPC stub 来调用 远端服务 MasterService 的 CreateSession 接口,其实就是发送一个 CreateSessionRequest 请求。

  1. Status CreateSession(CallOptions* call_options,
  2. const CreateSessionRequest* request,
  3. CreateSessionResponse* response) override {
  4. return CallWithRetry(call_options, request, response,
  5. &MasterServiceStub::CreateSession);
  6. }

2.4.2 GrpcMasterService::CreateSessionHandler

GrpcMasterService 是 Master 提供的 gRPC 服务,收到 CreateSessionRequest 消息之后, 服务调用 GrpcMasterService::CreateSessionHandler 来处理消息,而真正业务处理是由 master_impl_(Master 类的实例)来完成,就是调用了 Master::CreateSession。

当 master_impl_ 处理完成后,会向 Client 返回 CreateSessionResponse 响应。

  1. // RPC handler for creating a session.
  2. void CreateSessionHandler(
  3. MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  4. CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  5. rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  6. rewritten_req->MergeFrom(call->request);
  7. master_impl_->CreateSession(rewritten_req, &call->response,
  8. [call, rewritten_req](const Status& status) {
  9. call->SendResponse(ToGrpcStatus(status));
  10. delete rewritten_req;
  11. });
  12. ENQUEUE_REQUEST(CreateSession, true);
  13. }

2.4.3 Master::CreateSession

Master::CreateSession 会从线程池之中拿到一个线程,在线程之中会做如下处理:

  • 如果定义了 clust_spec,则按照配置寻找所有的 worker。
  • 获取远端设备。
  • 获取远端worker。
  • 通过factory 建立 MasterSession。
  • 利用 worker_cache_factory,让 MasterSession 建立 WorkerSession 会话。
  • 通过 sessions_.insert 在 Master 内部的 <session_handle, MasterSession> 二元组之中保存对应关系,这样后续 Master 就可以通过 session_handle 得到对应的 MasterSession。
  1. void Master::CreateSession(const CreateSessionRequest* req,
  2. CreateSessionResponse* resp, MyClosure done) {
  3. SchedClosure([this, req, resp, done]() {
  4. Status status;
  5. WorkerCacheFactoryOptions worker_cache_factory_options;
  6. string grpc_protocol("grpc");
  7. worker_cache_factory_options.protocol = &grpc_protocol;
  8. auto call_done = gtl::MakeCleanup([&status, &done] { done(status); });
  9. status = ValidateExternalGraphDefSyntax(req->graph_def());
  10. if (!status.ok()) return;
  11. // The following 4 variables are set differently, depending on whether this
  12. // session uses a client-provided clusterspec or not.
  13. WorkerCacheInterface* worker_cache = nullptr;
  14. // Note: worker_cache_ptr will be null except if this session is using a
  15. // client-supplied ClusterDef (ClusterSpec propagation).
  16. std::unique_ptr<WorkerCacheInterface> worker_cache_ptr;
  17. std::unique_ptr<DeviceSet> device_set;
  18. // TODO(saeta): Convert to std::make_unique when available.
  19. std::unique_ptr<std::vector<std::unique_ptr<Device>>> remote_devices(
  20. new std::vector<std::unique_ptr<Device>>());
  21. if (req->config().has_cluster_def()) { // 如果定义了集群
  22. worker_cache_factory_options.cluster_def = &req->config().cluster_def();
  23. // Set the server_def's job_name and task_index fields.
  24. string normalized_string;
  25. string grpc_protocol(kGrpcProtocol);
  26. if (req->target().compare(0, grpc_protocol.length(), grpc_protocol) ==
  27. 0) {
  28. normalized_string =
  29. req->target().substr(grpc_protocol.length(), string::npos);
  30. } else {
  31. normalized_string = req->target();
  32. }
  33. for (auto&& job : req->config().cluster_def().job()) {
  34. for (auto&& task : job.tasks()) {
  35. if (task.second == normalized_string) {
  36. if (worker_cache_factory_options.job_name != nullptr) {
  37. return;
  38. }
  39. if (env_->local_devices[0]->parsed_name().job == job.name() &&
  40. env_->local_devices[0]->parsed_name().task == task.first) {
  41. return;
  42. }
  43. worker_cache_factory_options.job_name = &job.name();
  44. worker_cache_factory_options.task_index = task.first;
  45. }
  46. }
  47. }
  48. worker_cache_factory_options.rpc_options = &req->config().rpc_options();
  49. // Create the worker cache from the computed server_def.
  50. status = env_->worker_cache_factory(worker_cache_factory_options,
  51. &worker_cache);
  52. if (!status.ok()) return;
  53. worker_cache_ptr = std::unique_ptr<WorkerCacheInterface>(worker_cache);
  54. // Ping all the workers and build the list of devices that the
  55. // session will use.
  56. // 获取设备
  57. status =
  58. DeviceFinder::GetRemoteDevices(req->config().device_filters(), env_,
  59. worker_cache, remote_devices.get());
  60. if (!status.ok()) return;
  61. device_set.reset(new DeviceSet);
  62. for (auto&& d : *remote_devices) {
  63. device_set->AddDevice(d.get());
  64. DeviceNameUtils::ParsedName name = d->parsed_name();
  65. if (name.job == *worker_cache_factory_options.job_name &&
  66. name.task == worker_cache_factory_options.task_index &&
  67. name.type == "CPU" && name.id == 0) {
  68. device_set->set_client_device(d.get());
  69. }
  70. }
  71. } else { // 没有集群
  72. worker_cache = env_->worker_cache;
  73. // Ping all the workers and build the list of devices that the
  74. // session will use.
  75. // 获取远端设备
  76. status =
  77. DeviceFinder::GetRemoteDevices(req->config().device_filters(), env_,
  78. worker_cache, remote_devices.get());
  79. if (!status.ok()) return;
  80. device_set.reset(new DeviceSet);
  81. for (auto&& d : *remote_devices) {
  82. device_set->AddDevice(d.get());
  83. }
  84. int num_local_devices = 0;
  85. for (Device* d : env_->local_devices) {
  86. device_set->AddDevice(d);
  87. if (num_local_devices == 0) {
  88. // Uses the first local device as the client device.
  89. device_set->set_client_device(d);
  90. }
  91. num_local_devices++;
  92. }
  93. }
  94. SessionOptions options;
  95. options.config = req->config();
  96. // 获取远端worker
  97. std::vector<string> filtered_worker_list;
  98. DeviceFinder::GetRemoteWorkers(req->config().device_filters(), env_,
  99. worker_cache, &filtered_worker_list);
  100. // 通过factory找到会话
  101. MasterSession* session = env_->master_session_factory(
  102. options, env_, std::move(remote_devices), std::move(worker_cache_ptr),
  103. std::move(device_set), std::move(filtered_worker_list));
  104. GraphDef* gdef =
  105. const_cast<CreateSessionRequest*>(req)->mutable_graph_def();
  106. // 建立会话,把图传给会话
  107. status = session->Create(std::move(*gdef), worker_cache_factory_options);
  108. if (!status.ok()) {
  109. session->Close().IgnoreError();
  110. session->Unref();
  111. return;
  112. }
  113. resp->set_session_handle(session->handle());
  114. // Insert into the session map, which takes ownership of the session.
  115. {
  116. mutex_lock l(mu_);
  117. CHECK(sessions_.insert({session->handle(), session}).second);
  118. }
  119. });
  120. }

3. MasterSession

MasterSession 位于 Master 之上,可能存在多个 Client 同时接入到同一个 Master,Master 会为每个 Client 构建一个 MasterSession。MasterSession 控制 Master 的会话生命周 期。

3.1 定义

MasterSession 的定义如下。

  1. // MasterSession wraps ClientGraph in a reference counted object.
  2. // This way, MasterSession can clear up the cache mapping Run requests to
  3. // compiled graphs while the compiled graph is still being used.
  4. class MasterSession::ReffedClientGraph : public core::RefCounted {
  5. public:
  6. ReffedClientGraph(const string& handle, const BuildGraphOptions& bopts,
  7. std::unique_ptr<ClientGraph> client_graph,
  8. const SessionOptions& session_opts,
  9. const StatsPublisherFactory& stats_publisher_factory,
  10. bool is_partial, WorkerCacheInterface* worker_cache,
  11. bool should_deregister)
  12. : session_handle_(handle),
  13. bg_opts_(bopts),
  14. client_graph_before_register_(std::move(client_graph)),
  15. session_opts_(session_opts),
  16. is_partial_(is_partial),
  17. callable_opts_(bopts.callable_options),
  18. worker_cache_(worker_cache),
  19. should_deregister_(should_deregister),
  20. collective_graph_key_(
  21. client_graph_before_register_->collective_graph_key) {
  22. VLOG(1) << "Created ReffedClientGraph for node with "
  23. << client_graph_before_register_->graph.num_node_ids();
  24. stats_publisher_ = stats_publisher_factory(handle, bopts, session_opts);
  25. // Initialize a name to node map for processing device stats.
  26. for (Node* n : client_graph_before_register_->graph.nodes()) {
  27. name_to_node_details_.emplace(
  28. n->name(),
  29. NodeDetails(n->type_string(),
  30. strings::StrCat(
  31. "(", absl::StrJoin(n->requested_inputs(), ", "))));
  32. }
  33. }
  34. ~ReffedClientGraph() override {
  35. if (should_deregister_) {
  36. DeregisterPartitions();
  37. } else {
  38. for (Part& part : partitions_) {
  39. worker_cache_->ReleaseWorker(part.name, part.worker);
  40. }
  41. }
  42. }
  43. private:
  44. const string session_handle_;
  45. const BuildGraphOptions bg_opts_;
  46. // NOTE(mrry): This pointer will be null after `RegisterPartitions()` returns.
  47. std::unique_ptr<ClientGraph> client_graph_before_register_ TF_GUARDED_BY(mu_);
  48. const SessionOptions session_opts_;
  49. const bool is_partial_;
  50. const CallableOptions callable_opts_;
  51. WorkerCacheInterface* const worker_cache_; // Not owned.
  52. struct NodeDetails {
  53. explicit NodeDetails(string type_string, string detail_text)
  54. : type_string(std::move(type_string)),
  55. detail_text(std::move(detail_text)) {}
  56. const string type_string;
  57. const string detail_text;
  58. };
  59. std::unordered_map<string, NodeDetails> name_to_node_details_;
  60. const bool should_deregister_;
  61. const int64_t collective_graph_key_;
  62. std::atomic<int64_t> execution_count_ = {0};
  63. // Graph partitioned into per-location subgraphs.
  64. struct Part {
  65. // Worker name.
  66. string name;
  67. // Maps feed names to rendezvous keys. Empty most of the time.
  68. std::unordered_map<string, string> feed_key;
  69. // Maps rendezvous keys to fetch names. Empty most of the time.
  70. std::unordered_map<string, string> key_fetch;
  71. // The interface to the worker. Owned.
  72. WorkerInterface* worker = nullptr;
  73. // After registration with the worker, graph_handle identifies
  74. // this partition on the worker.
  75. string graph_handle;
  76. Part() : feed_key(3), key_fetch(3) {}
  77. };
  78. // partitions_ is immutable after RegisterPartitions() call
  79. // finishes. RunPartitions() can access partitions_ safely without
  80. // acquiring locks.
  81. std::vector<Part> partitions_;
  82. mutable mutex mu_;
  83. // Partition initialization and registration only needs to happen
  84. // once. `!client_graph_before_register_ && !init_done_.HasBeenNotified()`
  85. // indicates the initialization is ongoing.
  86. Notification init_done_;
  87. // init_result_ remembers the initialization error if any.
  88. Status init_result_ TF_GUARDED_BY(mu_);
  89. std::unique_ptr<StatsPublisherInterface> stats_publisher_;
  90. };

3.2 创建

MasterSession::Create(graph_def) 的工作如下:

  • 调用 MakeForBaseGraph 来初始化计算图,并生成 SimpleGraphExecutionState 实例;
  • 调用 CreateWorkerSessions,如果动态配置集群,则广播通知给所有 Worker,让其创建对应的 WorkerSession。
  1. Status MasterSession::Create(GraphDef&& graph_def,
  2. const WorkerCacheFactoryOptions& options) {
  3. if (session_opts_.config.use_per_session_threads() ||
  4. session_opts_.config.session_inter_op_thread_pool_size() > 0) {
  5. return errors::InvalidArgument(
  6. "Distributed session does not support session thread pool options.");
  7. }
  8. if (session_opts_.config.graph_options().place_pruned_graph()) {
  9. session_opts_.config.mutable_graph_options()->set_place_pruned_graph(false);
  10. }
  11. GraphExecutionStateOptions execution_options;
  12. execution_options.device_set = devices_.get();
  13. execution_options.session_options = &session_opts_;
  14. {
  15. mutex_lock l(mu_);
  16. TF_RETURN_IF_ERROR(GraphExecutionState::MakeForBaseGraph(
  17. std::move(graph_def), execution_options, &execution_state_));
  18. }
  19. should_delete_worker_sessions_ = true;
  20. return CreateWorkerSessions(options);
  21. }

3.2.1 创建计算图

这里会构建 GraphExecutionState,依据 GraphDef 构建对应的 FullGraph。

GraphDef 是原始图结构,ConvertGraphDefToGraph 完成从 GraphDef 到 Graph 的格式转换,GraphDef 包含了图的元数据,Graph 则包含图结构的其他信息,被运行时系统所使用。

  1. /* static */ Status GraphExecutionState::MakeForBaseGraph(
  2. GraphDef&& graph_def, const GraphExecutionStateOptions& options,
  3. std::unique_ptr<GraphExecutionState>* out_state) {
  4. auto flib_def = absl::make_unique<FunctionLibraryDefinition>(
  5. OpRegistry::Global(), graph_def.library());
  6. TF_RETURN_IF_ERROR(AddDefaultAttrsToGraphDef(&graph_def, *flib_def, 0));
  7. if (options.session_options->config.graph_options().place_pruned_graph() ||
  8. !options.session_options->config.experimental()
  9. .optimize_for_static_graph()) {
  10. auto ret = absl::WrapUnique(new GraphExecutionState(
  11. absl::make_unique<GraphDef>(std::move(graph_def)), std::move(flib_def),
  12. options));
  13. // When place_pruned_graph is true, a different Graph* will be initialized
  14. // each time we prune the original graph, so there is no need to
  15. // construct a Graph* in this case.
  16. if (!options.session_options->config.graph_options().place_pruned_graph()) {
  17. auto base_graph = absl::make_unique<Graph>(OpRegistry::Global());
  18. TF_RETURN_IF_ERROR(ConvertGraphDefToGraph({}, *ret->original_graph_def_,
  19. base_graph.get()));
  20. TF_RETURN_IF_ERROR(ret->InitBaseGraph(std::move(base_graph)));
  21. }
  22. *out_state = std::move(ret);
  23. } else {
  24. auto ret = absl::WrapUnique(
  25. new GraphExecutionState(nullptr, std::move(flib_def), options));
  26. auto base_graph = absl::make_unique<Graph>(OpRegistry::Global());
  27. TF_RETURN_IF_ERROR(
  28. ConvertGraphDefToGraph({}, std::move(graph_def), base_graph.get()));
  29. TF_RETURN_IF_ERROR(ret->InitBaseGraph(std::move(base_graph)));
  30. *out_state = std::move(ret);
  31. }
  32. return Status::OK();
  33. }

InitBaseGraph 会调用 Placer.run 完成算子编排。就是把计算图之中的算子放到最适合的设备上计算,这样可以最大化效率。Placer 会对 Graph 做分析,并且结合用户的要求对每个Node如何放置进行微调,具体原则有如下四种:

  • 尽量满足用户的要求。用户可以通过 device 信息或者 loc 来制定设备,尽量优先满足。
  • 尽量使用快速设备。TF 系统之中每个设备都有优先级,级别越高计算性能越好,优先选择级别高的设备。
  • 尽量保证程序可运行。如果某个 Node 指定了在某种设备上执行,但是系统之中没有,则会选择一个可用的设备来重写 Placement。
  • 尽量考虑近邻性。比如尽量让 Consumer 和 Producer 在同一个设备上,避免无意义的跨设备拷贝。
  1. Status GraphExecutionState::InitBaseGraph(std::unique_ptr<Graph>&& new_graph) {
  2. // Save stateful placements before placing.
  3. RestoreStatefulNodes(new_graph.get());
  4. GraphOptimizationPassOptions optimization_options;
  5. optimization_options.session_handle = session_handle_;
  6. optimization_options.session_options = session_options_;
  7. optimization_options.graph = &new_graph;
  8. optimization_options.flib_def = flib_def_.get();
  9. optimization_options.device_set = device_set_;
  10. TF_RETURN_IF_ERROR(OptimizationPassRegistry::Global()->RunGrouping(
  11. OptimizationPassRegistry::PRE_PLACEMENT, optimization_options));
  12. Placer placer(new_graph.get(), "", flib_def_.get(), device_set_,
  13. /* default_local_device= */ nullptr,
  14. session_options_ == nullptr ||
  15. session_options_->config.allow_soft_placement(),
  16. session_options_ != nullptr &&
  17. session_options_->config.log_device_placement());
  18. TF_RETURN_IF_ERROR(placer.Run());
  19. TF_RETURN_IF_ERROR(OptimizationPassRegistry::Global()->RunGrouping(
  20. OptimizationPassRegistry::POST_PLACEMENT, optimization_options));
  21. for (const Node* n : new_graph->nodes()) {
  22. node_name_to_cost_id_map_[n->name()] = n->cost_id();
  23. }
  24. SaveStatefulNodes(new_graph.get());
  25. graph_ = new_graph.release();
  26. return Status::OK();
  27. }

3.2.2 创建 WorkerSession

当 MasterSession 创建成功后,如果没有动态配置集群 (默认的分布式配置环境), 则不会广播所有 Worker 动态地创建 WorkerSession。事实上,每个 Worker 都存在一个 SessionMgr 实例,它持有一个名为 legacy_session_ 的 WorkerSession 实例。因此,每个 Worker 存在一个全局唯一的 WorkerSession 实例。

图 3 创建 WorkerSession

逻辑如下:

  • 首先,调用 ReleaseWorker 来释放已有的 workers。
  • 其次,调用 GetOrCreateWorker 重新在缓存之中获取 Worker,如果没有,缓存自会构建。
  • 最后,遍历 Workers,调用 CreateWorkerSessionAsync 来让每个 Worker 各自创建一个 WorkerSession,每个请求都会用 set_session_handle(handle_) 来把 MasterSession 的 session_handle 设置进入,这样每个 WorkerSession 都和 MasterSession 共享同样的 session_handle,它们都隶属于同一个 MasterSession。

为了收集全部 Workers 返回的消息,这里使用了计数器 BlockingCounter 来等待,其会把初始数值设置为 Worker 数目,当收集全部 Workers 的 CreateWorkerSessionResponse 响应消息之后,计数器会减少为 0,则 BlockingCounter 会被唤醒。

  1. Status MasterSession::CreateWorkerSessions(
  2. const WorkerCacheFactoryOptions& options) {
  3. const std::vector<string> worker_names = filtered_worker_list_;
  4. WorkerCacheInterface* worker_cache = get_worker_cache();
  5. struct WorkerGroup {
  6. // The worker name. (Not owned.)
  7. const string* name;
  8. // The worker referenced by name. (Not owned.)
  9. WorkerInterface* worker = nullptr;
  10. // Request and responses used for a given worker.
  11. CreateWorkerSessionRequest request;
  12. CreateWorkerSessionResponse response;
  13. Status status = Status::OK();
  14. };
  15. BlockingCounter done(worker_names.size());
  16. std::vector<WorkerGroup> workers(worker_names.size());
  17. // Release the workers.
  18. auto cleanup = gtl::MakeCleanup([&workers, worker_cache] {
  19. for (auto&& worker_group : workers) {
  20. if (worker_group.worker != nullptr) {
  21. worker_cache->ReleaseWorker(*worker_group.name, worker_group.worker);
  22. }
  23. }
  24. });
  25. string task_name;
  26. string local_device_name;
  27. DeviceNameUtils::SplitDeviceName(devices_->client_device()->name(),
  28. &task_name, &local_device_name);
  29. const int64_t client_device_incarnation =
  30. devices_->client_device()->attributes().incarnation();
  31. Status status = Status::OK();
  32. // Create all the workers & kick off the computations.
  33. for (size_t i = 0; i < worker_names.size(); ++i) {
  34. workers[i].name = &worker_names[i];
  35. workers[i].worker = worker_cache->GetOrCreateWorker(worker_names[i]);
  36. workers[i].request.set_session_handle(handle_);
  37. workers[i].request.set_master_task(task_name);
  38. workers[i].request.set_master_incarnation(client_device_incarnation);
  39. if (session_opts_.config.share_cluster_devices_in_session() ||
  40. session_opts_.config.experimental()
  41. .share_cluster_devices_in_session()) {
  42. for (const auto& remote_dev : devices_->devices()) {
  43. *workers[i].request.add_cluster_device_attributes() =
  44. remote_dev->attributes();
  45. }
  46. if (!session_opts_.config.share_cluster_devices_in_session() &&
  47. session_opts_.config.experimental()
  48. .share_cluster_devices_in_session()) {
  49. }
  50. }
  51. DeviceNameUtils::ParsedName name;
  52. if (!DeviceNameUtils::ParseFullName(worker_names[i], &name)) {
  53. status = errors::Internal("Could not parse name ", worker_names[i]);
  54. return status;
  55. }
  56. if (!name.has_job || !name.has_task) {
  57. status = errors::Internal("Incomplete worker name ", worker_names[i]);
  58. return status;
  59. }
  60. if (options.cluster_def) {
  61. *workers[i].request.mutable_server_def()->mutable_cluster() =
  62. *options.cluster_def;
  63. workers[i].request.mutable_server_def()->set_protocol(*options.protocol);
  64. workers[i].request.mutable_server_def()->set_job_name(name.job);
  65. workers[i].request.mutable_server_def()->set_task_index(name.task);
  66. // Session state is always isolated when ClusterSpec propagation
  67. // is in use.
  68. workers[i].request.set_isolate_session_state(true);
  69. } else {
  70. // NOTE(mrry): Do not set any component of the ServerDef,
  71. // because the worker will use its local configuration.
  72. workers[i].request.set_isolate_session_state(
  73. session_opts_.config.isolate_session_state());
  74. }
  75. if (session_opts_.config.experimental()
  76. .share_session_state_in_clusterspec_propagation()) {
  77. // In a dynamic cluster, the ClusterSpec info is usually propagated by
  78. // master sessions. However, in data parallel training with multiple
  79. // masters
  80. // ("between-graph replication"), we need to disable isolation for
  81. // different worker sessions to update the same variables in PS tasks.
  82. workers[i].request.set_isolate_session_state(false);
  83. }
  84. }
  85. for (size_t i = 0; i < worker_names.size(); ++i) {
  86. auto cb = [i, &workers, &done](const Status& s) {
  87. workers[i].status = s;
  88. done.DecrementCount();
  89. };
  90. workers[i].worker->CreateWorkerSessionAsync(&workers[i].request,
  91. &workers[i].response, cb);
  92. }
  93. done.Wait();
  94. for (size_t i = 0; i < workers.size(); ++i) {
  95. status.Update(workers[i].status);
  96. }
  97. return status;
  98. }
GrpcRemoteWorker

GrpcRemoteWorker 是 gRPC 的客户端,通过 stub 调用远端 WorkerService 相应的服务接口。

  1. void CreateWorkerSessionAsync(const CreateWorkerSessionRequest* request,
  2. CreateWorkerSessionResponse* response,
  3. StatusCallback done) override {
  4. IssueRequest(request, response, createworkersession_, std::move(done));
  5. }
GrpcWorkerService

远端 Worker 之中,接收到消息是在 GrpcWorkerService 之中,当收到 CreateWorkerSessionRequest 消息,将 由 CreateWorkerSessionHandler 回调处理,CreateWorkerSessionHandler 是一个宏,其在线程池中启动一个可运行的线程,触发 Worker(就是GrpcWorker) 的 CreateWorkerSession 方法来动态创建 WorkerSession 实例。

  1. #define HANDLE_CALL(method, may_block_on_compute_pool) \
  2. void method##Handler(WorkerCall<method##Request, method##Response>* call) { \
  3. auto closure = [this, call]() { \
  4. Status s = worker_->method(&call->request, &call->response); \
  5. if (!s.ok()) { \
  6. VLOG(3) << "Bad response from " << #method << ": " << s; \
  7. } \
  8. call->SendResponse(ToGrpcStatus(s)); \
  9. }; \
  10. if ((may_block_on_compute_pool)) { \
  11. worker_->env()->env->SchedClosure(std::move(closure)); \
  12. } else { \
  13. worker_->env()->compute_pool->Schedule(std::move(closure)); \
  14. } \
  15. ENQUEUE_REQUEST(method, false); \
  16. }
  17. HANDLE_CALL(CreateWorkerSession, false);

4. WorkerSession

其实,GrpcWorker 最终调用的是 WorkerInterface.CreateWorkerSession 方法。

  1. Status CreateWorkerSession(const CreateWorkerSessionRequest* request,
  2. CreateWorkerSessionResponse* response) {
  3. return CallAndWait(&ME::CreateWorkerSessionAsync, request, response);
  4. }

CreateWorkerSessionRequest 消息之中携带了 MasterSession 分配的 session_handle,GrpcWorker 将据此创建一个 WorkerSession,session_handle 在这个 Worker 之内唯一标识这个 WorkerSession。

在 GrpcWorker 的 WorkerEnv 上下文之中有一个 SessionMgr,SessionMgr 负责统一管理和维护所有的 WorkerSession 生命周期。SessionMgr 与 WorkerSession 是一对多的关系,每个 WorkerSession 实例使用 session_handle 标识。

  1. void Worker::CreateWorkerSessionAsync(const CreateWorkerSessionRequest* request,
  2. CreateWorkerSessionResponse* response,
  3. StatusCallback done) {
  4. Status s = env_->session_mgr->CreateSession(
  5. request->session_handle(), request->server_def(),
  6. request->cluster_device_attributes(), request->isolate_session_state(),
  7. request->master_task(), request->master_incarnation());
  8. done(s);
  9. }

4.1 SessionMgr

4.1.1 定义

重点是如下,维护了 session_handle 和 WorkerSession 之间的对应关系,每个 WorkerSession 由 session_handle 来标识。

  • std::map<string, std::shared_ptr> sessions_ :维护了对应关系。

  • std::shared_ptr legacy_session_ :本地 WorkerSession 实例。

图 4 SessionMgr

  1. class SessionMgr {
  2. public:
  3. typedef std::function<Status(const ServerDef&, WorkerCacheInterface**)>
  4. WorkerCacheFactory;
  5. explicit SessionMgr(
  6. WorkerEnv* worker_env, const string& default_worker_name,
  7. std::unique_ptr<WorkerCacheInterface> default_worker_cache,
  8. WorkerCacheFactory worker_cache_factory);
  9. ~SessionMgr() {}
  10. // Allocates state for a new session.
  11. Status CreateSession(const string& session, const ServerDef& server_def,
  12. bool isolate_session_state);
  13. Status CreateSession(
  14. const string& session, const ServerDef& server_def,
  15. const protobuf::RepeatedPtrField<DeviceAttributes>& device_attributes,
  16. bool isolate_session_state);
  17. // Create WorkerSession from the master with the given `master_task` and
  18. // `master_incarnation`. We first look for existing WorkerSessions associated
  19. // with the specified master task. If there are sessions created by the same
  20. // master but with a different incarnation, it indicates that the remote
  21. // master has restarted before deleting the sessions on worker. When it
  22. // happens, old sessions associated with the master will be automatically
  23. // removed before the new session is created.
  24. Status CreateSession(
  25. const string& session, const ServerDef& server_def,
  26. const protobuf::RepeatedPtrField<DeviceAttributes>& device_attributes,
  27. bool isolate_session_state, string master_task,
  28. int64_t master_incarnation);
  29. void ResetDefaultWorkerCache(WorkerCacheInterface* worker_cache);
  30. // Updates state (worker cache, devices) of worker session identified by
  31. // session name (`session`) based on a new server_def and set of devices.
  32. Status UpdateSession(const string& session, const ServerDef& server_def,
  33. const protobuf::RepeatedPtrField<DeviceAttributes>&
  34. cluster_device_attributes,
  35. bool isolate_session_state);
  36. // Locates the worker session for a given session handle
  37. Status WorkerSessionForSession(const string& session_handle,
  38. std::shared_ptr<WorkerSession>* out_session);
  39. std::shared_ptr<WorkerSession> LegacySession();
  40. Status DeleteSession(const string& session);
  41. static string WorkerNameFromServerDef(const ServerDef& server_def);
  42. void SetLogging(bool active);
  43. void RetrieveLogs(int64_t step_id, LoggingResponse* response);
  44. void ClearLogs();
  45. private:
  46. WorkerEnv* const worker_env_; // Not owned.
  47. // A note about destruction:
  48. // We must delete graph_mgr before device_mgr, due to shared
  49. // ownership of OpKernels in the executors. (The graph_mgr will
  50. // free all stateless OpKernels, and pass over borrowed stateful
  51. // OpKernels, which are also held in their respective devices'
  52. // OpSegments.)
  53. //
  54. // legacy_session_ owns the worker_env_.device_mgr, and so we must ensure
  55. // that sessions_'s WorkerSessions are deleted (which do not own the
  56. // underlying devices, but instead own RenamedDevices) before
  57. // legacy_session_ is deleted. Further, we must ensure that WorkerSession's
  58. // device_mgr is deleted after WorkerSession's graph_mgr.
  59. std::unique_ptr<WorkerCacheInterface> default_worker_cache_;
  60. std::shared_ptr<WorkerSession> legacy_session_;
  61. bool is_logging_active_ = false;
  62. const WorkerCacheFactory worker_cache_factory_;
  63. Status WorkerSessionForSessionLocked(
  64. const string& session_handle, std::shared_ptr<WorkerSession>* out_session)
  65. TF_EXCLUSIVE_LOCKS_REQUIRED(mu_);
  66. mutex mu_;
  67. // A map from session identifier to internal session structure.
  68. std::map<string, std::shared_ptr<WorkerSession>> sessions_ TF_GUARDED_BY(mu_);
  69. // Incarnation and WorkerSession handle associated with a master task.
  70. struct MasterAssociatedSession {
  71. const int64_t master_incarnation;
  72. const string session_handle;
  73. };
  74. // A map from master task name to its associated worker sessions.
  75. std::unordered_multimap<string, MasterAssociatedSession>
  76. master_to_associated_sessions_ TF_GUARDED_BY(mu_);
  77. };

4.1.2 建立 Session

CreateSession 方法会创建 WorkerSession 和 GraphMgr。

  1. Status SessionMgr::CreateSession(
  2. const string& session, const ServerDef& server_def,
  3. const protobuf::RepeatedPtrField<DeviceAttributes>&
  4. cluster_device_attributes,
  5. bool isolate_session_state, string master_task,
  6. int64_t master_incarnation) {
  7. mutex_lock l(mu_);
  8. if (session.empty()) {
  9. return errors::InvalidArgument("Session must be non-empty.");
  10. }
  11. // For given master task name, check if one or more `WorkerSession`s have been
  12. // created previously on this worker, and if so garbage collect the expired
  13. // `WorkerSession`s. This happens when the master fails before sending
  14. // `DeleteSession` requests, which can cause `WorkerSession`s to be leaked.
  15. if (!master_task.empty()) {
  16. auto it_range = master_to_associated_sessions_.equal_range(master_task);
  17. if (it_range.first != it_range.second &&
  18. it_range.first->second.master_incarnation != master_incarnation) {
  19. auto it = it_range.first;
  20. while (it != it_range.second) {
  21. auto session_it = sessions_.find(it->second.session_handle);
  22. if (session_it != sessions_.end()) {
  23. sessions_.erase(session_it);
  24. }
  25. it = master_to_associated_sessions_.erase(it);
  26. }
  27. }
  28. }
  29. WorkerCacheInterface* worker_cache = nullptr;
  30. string worker_name;
  31. if (server_def.cluster().job().empty()) {
  32. worker_cache = new WorkerCacheWrapper(default_worker_cache_.get());
  33. worker_name = legacy_session_->worker_name();
  34. } else {
  35. TF_RETURN_IF_ERROR(worker_cache_factory_(server_def, &worker_cache));
  36. worker_name = WorkerNameFromServerDef(server_def);
  37. }
  38. if (worker_cache != nullptr && default_worker_cache_ != nullptr) {
  39. worker_cache->SetLogging(this->is_logging_active_);
  40. }
  41. std::shared_ptr<WorkerSession> worker_session;
  42. std::vector<std::unique_ptr<Device>> cluster_devices;
  43. if (isolate_session_state || server_def.cluster().job_size()) {
  44. // Create a private copy of the DeviceMgr for the WorkerSession.
  45. std::vector<std::unique_ptr<Device>> renamed_devices;
  46. for (Device* d : worker_env_->local_devices) {
  47. renamed_devices.push_back(RenamedDevice::NewRenamedDevice(
  48. worker_name, d, false, isolate_session_state));
  49. }
  50. auto device_mgr = MakeUnique<StaticDeviceMgr>(std::move(renamed_devices));
  51. LookupLocalDevice cb = [&device_mgr](StringPiece name, Device** device) {
  52. return device_mgr->LookupDevice(name, device);
  53. };
  54. AsRemoteDevices(worker_env_->env, cluster_device_attributes, cb,
  55. &cluster_devices);
  56. std::unique_ptr<DynamicDeviceMgr> remote_devices;
  57. if (!cluster_device_attributes.empty()) {
  58. remote_devices = MakeUnique<DynamicDeviceMgr>();
  59. TF_RETURN_IF_ERROR(
  60. remote_devices->AddDevices(std::move(cluster_devices)));
  61. }
  62. auto graph_mgr = MakeUnique<GraphMgr>(worker_env_, device_mgr.get());
  63. worker_session.reset(
  64. new WorkerSession(session, worker_name,
  65. std::unique_ptr<WorkerCacheInterface>(worker_cache),
  66. std::move(device_mgr), std::move(graph_mgr),
  67. std::move(remote_devices)));
  68. } else {
  69. AsRemoteDevices(worker_env_->env, cluster_device_attributes, nullptr,
  70. &cluster_devices);
  71. std::unique_ptr<DynamicDeviceMgr> remote_devices;
  72. if (!cluster_device_attributes.empty()) {
  73. remote_devices = MakeUnique<DynamicDeviceMgr>();
  74. TF_RETURN_IF_ERROR(
  75. remote_devices->AddDevices(std::move(cluster_devices)));
  76. }
  77. // Borrow the WorkerEnv's DeviceMgr for the WorkerSession, so
  78. // that resources using it can use its devices after the
  79. // WorkerSession has been deleted.
  80. auto graph_mgr = MakeUnique<GraphMgr>(worker_env_, worker_env_->device_mgr);
  81. worker_session = WorkerSession::CreateWithBorrowedDeviceMgr(
  82. session, worker_name,
  83. std::unique_ptr<WorkerCacheInterface>(worker_cache),
  84. worker_env_->device_mgr, std::move(graph_mgr),
  85. std::move(remote_devices));
  86. }
  87. sessions_.insert(std::make_pair(session, std::move(worker_session)));
  88. if (!master_task.empty()) {
  89. MasterAssociatedSession s{master_incarnation, session};
  90. master_to_associated_sessions_.emplace(master_task, s);
  91. }
  92. return Status::OK();
  93. }

4.1.3 注册图

我们用 RegisterGraphAsync 为例来看看 worker 内部功能。可以看到其使用 GraphMgr 完成了基础功能。

  1. void Worker::RegisterGraphAsync(const RegisterGraphRequest* request,
  2. RegisterGraphResponse* response,
  3. StatusCallback done) {
  4. std::shared_ptr<WorkerSession> session;
  5. Status s;
  6. if (request->create_worker_session_called()) {
  7. s = env_->session_mgr->WorkerSessionForSession(request->session_handle(),
  8. &session);
  9. } else {
  10. session = env_->session_mgr->LegacySession();
  11. }
  12. if (s.ok()) {
  13. s = session->graph_mgr()->Register(
  14. request->session_handle(), request->graph_def(), session.get(),
  15. request->graph_options(), request->debug_options(),
  16. request->config_proto(), request->collective_graph_key(),
  17. session->cluster_flr(), response->mutable_graph_handle());
  18. }
  19. done(s);
  20. }

4.2 WorkerSession

4.2.1 定义

WorkerSession 之中比较重要的几个成员变量包括几个管理类 GraphMgr,DeviceMgr,DynamicDeviceMgr:

  • string session_name_ :Session 名称。

  • string worker_name_ :Worker 名称,比如 /job:mnist/replica:0/task:1。

  • std::shared_ptr worker_cache_ :Worker 缓存。

  • std::unique_ptr graph_mgr_ :本 session 注册的计算图,每个 Worker 可以注册和运行多个计算图,每个计算图使用 graph)handle 标识。

  • std::unique_ptr device_mgr_ :本地计算设备集合信息。

图 5 WorkerSession 概念

  1. // WorkerSession encapsulates all of the state relating to a given session.
  2. class WorkerSession {
  3. public:
  4. // Collection of local devices. These devices are typically
  5. // RenamedDevices in all except the SessionMgr.legacy_session_ and
  6. // sessions created with `isolate_session_state == false`. In the
  7. // those cases, this method returns a pointer to a borrowed
  8. // DeviceMgr (typically the `worker_env.device_mgr`).
  9. DeviceMgr* device_mgr() {
  10. return device_mgr_ ? device_mgr_.get() : borrowed_device_mgr_;
  11. }
  12. DynamicDeviceMgr* remote_device_mgr() { return remote_device_mgr_.get(); }
  13. const string& session_name() const { return session_name_; }
  14. const string& worker_name() const { return worker_name_; }
  15. WorkerCacheInterface* worker_cache() const {
  16. tf_shared_lock l(worker_session_state_mu_);
  17. return worker_cache_.get();
  18. }
  19. GraphMgr* graph_mgr() const { return graph_mgr_.get(); }
  20. ClusterFunctionLibraryRuntime* cluster_flr() const {
  21. return cluster_flr_.get();
  22. }
  23. WorkerSession(const string& session_name, const string& worker_name,
  24. std::unique_ptr<WorkerCacheInterface> worker_cache,
  25. std::unique_ptr<DeviceMgr> device_mgr,
  26. std::unique_ptr<GraphMgr> graph_mgr,
  27. std::unique_ptr<DynamicDeviceMgr> remote_device_mgr);
  28. static std::shared_ptr<WorkerSession> CreateWithBorrowedDeviceMgr(
  29. const string& session_name, const string& worker_name,
  30. std::unique_ptr<WorkerCacheInterface> worker_cache,
  31. DeviceMgr* borrowed_device_mgr, std::unique_ptr<GraphMgr> graph_mgr,
  32. std::unique_ptr<DynamicDeviceMgr> remote_device_mgr);
  33. // In the eager runtime we allow WorkerSession to be updated, where the
  34. // worker cache will be recreated. If WorkerSession upate is expected and a
  35. // worker in the cache is used in RPCs, the caller should hold a shared
  36. // pointer to avoid the workers getting deleted.
  37. std::shared_ptr<WorkerCacheInterface> GetSharedWorkerCache() {
  38. tf_shared_lock l(worker_session_state_mu_);
  39. return worker_cache_;
  40. }
  41. // Update an existing worker session with new set of remote workers and
  42. // devices. Added devices will be owned by the worker session, and removed
  43. // devices will be freed by their names.
  44. Status UpdateWorkerCacheAndDevices(
  45. std::unique_ptr<WorkerCacheInterface> new_worker_cache,
  46. std::vector<std::unique_ptr<Device>> added_remote_devices,
  47. const std::vector<Device*>& removed_remote_devices);
  48. ~WorkerSession();
  49. private:
  50. WorkerSession(const string& session_name, const string& worker_name,
  51. std::unique_ptr<WorkerCacheInterface> worker_cache,
  52. DeviceMgr* borrowed_device_mgr,
  53. std::unique_ptr<GraphMgr> graph_mgr,
  54. std::unique_ptr<DynamicDeviceMgr> remote_device_mgr);
  55. // The name of the session.
  56. const string session_name_;
  57. // The name of the worker. E.g., /job:mnist/replica:0/task:1.
  58. const string worker_name_;
  59. mutable mutex worker_session_state_mu_;
  60. // Object from which WorkerInterface instances can be obtained.
  61. std::shared_ptr<WorkerCacheInterface> worker_cache_
  62. TF_GUARDED_BY(worker_session_state_mu_);
  63. // graph_mgr keeps track of the registered graphs of this session.
  64. //
  65. // Note: graph_mgr must be deleted before rendezvous_mgr!
  66. // Note: graph_mgr must be deleted before device_mgr!
  67. const std::unique_ptr<GraphMgr> graph_mgr_;
  68. std::unique_ptr<ClusterFunctionLibraryRuntime> cluster_flr_;
  69. const std::unique_ptr<DeviceMgr> device_mgr_;
  70. DeviceMgr* const borrowed_device_mgr_; // Not owned.
  71. std::unique_ptr<DynamicDeviceMgr> remote_device_mgr_;
  72. };

至此,session 基本流程我们梳理完成,下面就会对业务进行详细分析。

0xFF 参考

TensorFlow中的Placement启发式算法模块——Placer

[源码解析] TensorFlow 分布式环境(5) --- Session的更多相关文章

  1. [源码解析] TensorFlow 分布式环境(6) --- Master 动态逻辑

    [源码解析] TensorFlow 分布式环境(6) --- Master 动态逻辑 目录 [源码解析] TensorFlow 分布式环境(6) --- Master 动态逻辑 1. GrpcSess ...

  2. [源码解析] TensorFlow 分布式环境(7) --- Worker 动态逻辑

    [源码解析] TensorFlow 分布式环境(7) --- Worker 动态逻辑 目录 [源码解析] TensorFlow 分布式环境(7) --- Worker 动态逻辑 1. 概述 1.1 温 ...

  3. [源码解析] TensorFlow 分布式环境(8) --- 通信机制

    [源码解析] TensorFlow 分布式环境(8) --- 通信机制 目录 [源码解析] TensorFlow 分布式环境(8) --- 通信机制 1. 机制 1.1 消息标识符 1.1.1 定义 ...

  4. [源码解析] TensorFlow 分布式环境(1) --- 总体架构

    [源码解析] TensorFlow 分布式环境(1) --- 总体架构 目录 [源码解析] TensorFlow 分布式环境(1) --- 总体架构 1. 总体架构 1.1 集群角度 1.1.1 概念 ...

  5. [源码解析] TensorFlow 分布式环境(2)---Master 静态逻辑

    [源码解析] TensorFlow 分布式环境(2)---Master 静态逻辑 目录 [源码解析] TensorFlow 分布式环境(2)---Master 静态逻辑 1. 总述 2. 接口 2.1 ...

  6. [源码解析] TensorFlow 分布式环境(3)--- Worker 静态逻辑

    [源码解析] TensorFlow 分布式环境(3)--- Worker 静态逻辑 目录 [源码解析] TensorFlow 分布式环境(3)--- Worker 静态逻辑 1. 继承关系 1.1 角 ...

  7. [源码解析] TensorFlow 分布式环境(4) --- WorkerCache

    [源码解析] TensorFlow 分布式环境(4) --- WorkerCache 目录 [源码解析] TensorFlow 分布式环境(4) --- WorkerCache 1. WorkerCa ...

  8. [源码解析] TensorFlow 分布式 DistributedStrategy 之基础篇

    [源码解析] TensorFlow 分布式 DistributedStrategy 之基础篇 目录 [源码解析] TensorFlow 分布式 DistributedStrategy 之基础篇 1. ...

  9. [源码解析] TensorFlow 分布式之 MirroredStrategy

    [源码解析] TensorFlow 分布式之 MirroredStrategy 目录 [源码解析] TensorFlow 分布式之 MirroredStrategy 1. 设计&思路 1.1 ...

随机推荐

  1. No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK 问题解决

    1. 问题描述 使用idea对Java工程执行mvn compile命令进行编译,出现以下报错: [ERROR] Failed to execute goal org.apache.maven.plu ...

  2. IDEA中使用Docker

    开发环境 IDEA:2020.3.2 Docker:20.10.12 注意,如果没有开启Docker远程连接,请先开启Docker远程连接. 1. 打开或新建一个Web项目 可参考使用IDEA新建一个 ...

  3. 使用传统的三层架构出现的问题.引入Spring底层实现原理来解决(工厂模式+反射+XML配置文件/注解)

    以前写的代码 mapper层 public interface PersonMapper { void selectPersonList(); } public class PersonMapperI ...

  4. 微服务从代码到k8s部署应有尽有系列(八、各种队列)

    我们用一个系列来讲解从需求到上线.从代码到k8s部署.从日志到监控等各个方面的微服务完整实践. 整个项目使用了go-zero开发的微服务,基本包含了go-zero以及相关go-zero作者开发的一些中 ...

  5. java-23种设计模式概述【软件设计模式基本介绍(是什么、作用、优点)、模式的分类和介绍】

    一.设计模式基本介绍(是什么.作用.优点) 1.软件设计模式是什么? 软件设计模式(Software Design Pattern),又称设计模式. 2.设计模式的作用 ★ 提高代码的可复用性.可维护 ...

  6. Smartbi权限安全管理系统_保障数据权限安全

    思迈特软件Smartbi具有完善的安全管理体系,Smartbi权限安全管理系统它可以控制用户功能权限.数据访问权限.资源访问权限.Smartbi权限安全管理系统支持按用户.用户组.角色进行管理:支持多 ...

  7. Java的诞生历史

    Java帝国的诞生 1.1972年C诞生 贴近硬件,运行极快,效率极高 操作系统,编译器,数据库,网络系统等都采用C语言 但是,它的指针和内存管理给程序员安上了"枷锁".它的指针没 ...

  8. 3rd S-curve velocity profile

    3rd S-curve  (1) (2) (3) (4) 完整的三次S曲线包括上面的七个阶段.前面三个阶段为加速阶段,从初始速度Vs加速到Vmax: (5) 整个加速阶段的位移为: (6) 后面三个阶 ...

  9. Qt:Shadow Build

    每个编辑器有Build和Run两个设置界面. 在Build界面上,有一个"Shadow build"复选框.如果勾选此项,编译后将在项目的同级目录下建立一个编译后的文件目录,目录名 ...

  10. Python的内置数据结构

    Python内置数据结构一共有6类: 数字 字符串 列表 元组 字典 文件 一.数字 数字类型就没什么好说的了,大家自行理解 二.字符串 1.字符串的特性(重要): 序列化特性:字符串具有一个很重要的 ...