RDMA Programming - Base on linux-rdma

首页分类标签留言关于订阅2017-11-08 | 分类 Network  | 标签 RDMA  RoCE  Linux-RDMA

linux-rdma为Linux内核Infiniband子系统drivers/infiniband对应的用户态库,提供了Infiniband Verbs APIRDMA Verbs API.

基本概念

  • Queue Pair(QP)

为了进行RDMA操作,需要在两端建立连接,这通过Queue Pair (QP)来完成,QP相当于socket。通信的两端都需要进行QP的初始化,Communication Manager (CM) 在双方真正建立连接前交换QP信息。

Once a QP is established, the verbs API can be used to perform RDMA reads, RDMA writes, and atomic operations. Serialized send/receive operations, which are similar to socket reads/writes, can be performed as well.

QP对应数据结构struct ibv_qpibv_create_qp用于创建QP.

  1. /**
  2. * ibv_create_qp - Create a queue pair.
  3. */
  4. struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
  5. struct ibv_qp_init_attr *qp_init_attr);
  • Completion Queue(CQ)

Completion Queue is an object which contains the completed work requests which were posted to the Work Queues (WQ). Every completion says that a specific WR was completed (both successfully completed WRs and unsuccessfully completed WRs). A Completion Queue is a mechanism to notify the application about information of ended Work Requests (status, opcode, size, source).

对应数据结构struct ibv_cqibv_create_cq用于创建CQ:

  1. /**
  2. * ibv_create_cq - Create a completion queue
  3. * @context - Context CQ will be attached to
  4. * @cqe - Minimum number of entries required for CQ
  5. * @cq_context - Consumer-supplied context returned for completion events
  6. * @channel - Completion channel where completion events will be queued.
  7. * May be NULL if completion events will not be used.
  8. * @comp_vector - Completion vector used to signal completion events.
  9. * Must be >= 0 and < context->num_comp_vectors.
  10. */
  11. struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
  12. void *cq_context,
  13. struct ibv_comp_channel *channel,
  14. int comp_vector);
  • Memory Registration (MR)

Memory Registration is a mechanism that allows an application to describe a set of virtually con- tiguous memory locations or a set of physically contiguous memory locations to the network adapter as a virtually contiguous buffer using Virtual Addresses.

对应数据结构struct ibv_mr:

  1. struct ibv_mr {
  2. struct ibv_context *context;
  3. struct ibv_pd *pd;
  4. void *addr;
  5. size_t length;
  6. uint32_t handle;
  7. uint32_t lkey;
  8. uint32_t rkey;
  9. };

Every MR has a remote and a local key (rkey, lkey).

Local keys are used by the local HCA to access local memory, such as during a receive operation.

Remote keys are given to the remote HCA to allow a remote process access to system memory during RDMA operations.

ibv_reg_mr registers a memory region (MR), associates it with a protection domain (PD), and assigns it local and remote keys (lkey, rkey).

  1. /**
  2. * ibv_reg_mr - Register a memory region
  3. */
  4. struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
  5. size_t length, int access);
  • Protection Domain (PD)

Object whose components can interact with only each other. These components can be AH, QP, MR, and SRQ.

A protection domain is used to associate Queue Pairs with Memory Regions and Memory Windows , as a means for enabling and controlling network adapter access to Host System memory.

struct ibv_pd is used to implement protection domains:

  1. struct ibv_pd {
  2. struct ibv_context *context;
  3. uint32_t handle;
  4. };

ibv_alloc_pd creates a protection domain (PD). PDs limit which memory regions can be accessed by which queue pairs (QP) providing a degree of protection from unauthorized access.

  1. /**
  2. * ibv_alloc_pd - Allocate a protection domain
  3. */
  4. struct ibv_pd *ibv_alloc_pd(struct ibv_context *context);
  • Send Request (SR)

An SR defines how much data will be sent, from where, how and, with RDMA, to where. struct ibv_send_wr is used to implement SRs.参考struct ibv_send_wr

示例(IB Verbs API example)

RDMA应用可以使用librdmacm或者libibverbs API编程。前者是对后者的进一步封装。

rc_pingpong是直接使用libibverbs API编程的示例。

一般来说,使用IB Verbs API的基本流程如下:

  • (1) Get the device list

First you must retrieve the list of available IB devices on the local host. Every device in this list contains both a name and a GUID. For example the device names can be: mthca0, mlx4_1.参考这里.

IB devices对应数据结构struct ibv_device:

  1. struct ibv_device {
  2. struct _ibv_device_ops _ops;
  3. enum ibv_node_type node_type;
  4. enum ibv_transport_type transport_type;
  5. /* Name of underlying kernel IB device, eg "mthca0" */
  6. char name[IBV_SYSFS_NAME_MAX];
  7. /* Name of uverbs device, eg "uverbs0" */
  8. char dev_name[IBV_SYSFS_NAME_MAX];
  9. /* Path to infiniband_verbs class device in sysfs */
  10. char dev_path[IBV_SYSFS_PATH_MAX];
  11. /* Path to infiniband class device in sysfs */
  12. char ibdev_path[IBV_SYSFS_PATH_MAX];
  13. };

应用程序通过API ibv_get_device_list获取IB设备列表:

  1. /**
  2. * ibv_get_device_list - Get list of IB devices currently available
  3. * @num_devices: optional. if non-NULL, set to the number of devices
  4. * returned in the array.
  5. *
  6. * Return a NULL-terminated array of IB devices. The array can be
  7. * released with ibv_free_device_list().
  8. */
  9. struct ibv_device **ibv_get_device_list(int *num_devices);
  • (2) Open the requested device

Iterate over the device list, choose a device according to its GUID or name and open it.参考这里.

应用调用ibv_open_device打开IB设备:

  1. /**
  2. * ibv_open_device - Initialize device for use
  3. */
  4. struct ibv_context *ibv_open_device(struct ibv_device *device);

返回一个ibv_context对象:

  1. struct ibv_context {
  2. struct ibv_device *device;
  3. struct ibv_context_ops ops;
  4. int cmd_fd;
  5. int async_fd;
  6. int num_comp_vectors;
  7. pthread_mutex_t mutex;
  8. void *abi_compat;
  9. };
  • (3) Allocate a Protection Domain

分配一个PD,参考这里

A Protection Domain (PD) allows the user to restrict which components can interact with only each other.

These components can be AH, QP, MR, MW, and SRQ.

  • (4) Register a memory region

注册一个MR,参考这里.

Any memory buffer which is valid in the process’s virtual space can be registered.

During the registration process the user sets memory permissions and receives local and remote keys (lkey/rkey) which will later be used to refer to this memory buffer.

  • (5) Create a Completion Queue(CQ)

创建一个CQ,参考这里.

A CQ contains completed work requests (WR). Each WR will generate a completion queue entry (CQE) that is placed on the CQ.

The CQE will specify if the WR was completed successfully or not.

  • (6) Create a Queue Pair(QP)

创建QP,参考这里.

Creating a QP will also create an associated send queue and receive queue.

  • (7) Bring up a QP

启动QP,参考这里.

A created QP still cannot be used until it is transitioned through several states, eventually getting to Ready To Send (RTS).

This provides needed information used by the QP to be able send / receive data.

ibv_modify_qp修改QP的状态:

  1. /**
  2. * ibv_modify_qp - Modify a queue pair.
  3. */
  4. int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
  5. int attr_mask);

例如,对于client/server,需要将QP设置为RTS状态,参考rc_pingpong@pp_connect_ctx.

QP有如下一些状态:

  1. RESET Newly created, queues empty.
  2. INIT Basic information set. Ready for posting to receive queue.
  3. RTR Ready to Receive. Remote address info set for connected QPs, QP may now receive packets.
  4. RTS Ready to Send. Timeout and retry parameters set, QP may now send packets.
  • (8) Post work requests and poll for completion

Use the created QP for communication operations.

参考pp_post_sendibv_poll_cq.

  • (9) Cleanup
  1. Destroy objects in the reverse order you created them:
  2. Delete QP
  3. Delete CQ
  4. Deregister MR
  5. Deallocate PD
  6. Close device

测试

  • server
  1. # ibv_rc_pingpong -d rxe0 -g 0 -s 128 -r 1 -n 1
  2. local address: LID 0x0000, QPN 0x000011, PSN 0x626753, GID fe80::5054:61ff:fe57:1211
  3. remote address: LID 0x0000, QPN 0x000011, PSN 0x849753, GID fe80::5054:61ff:fe56:1211
  4. 256 bytes in 0.00 seconds = 11.38 Mbit/sec
  5. 1 iters in 0.00 seconds = 180.00 usec/iter
  • client
  1. # ibv_rc_pingpong -d rxe0 -g 0 172.18.42.162 -s 128 -r 1 -n 1
  2. local address: LID 0x0000, QPN 0x000011, PSN 0x849753, GID fe80::5054:61ff:fe56:1211
  3. remote address: LID 0x0000, QPN 0x000011, PSN 0x626753, GID fe80::5054:61ff:fe57:1211
  4. 256 bytes in 0.00 seconds = 16.13 Mbit/sec
  5. 1 iters in 0.00 seconds = 127.00 usec/iter

抓包可以查看client与server端的通信流程:

其中,第一个RC Send only为client发送给server的包,参考这里. 然后server回了一个RC Ack,并给client发送了一个RC Send only,参考这里.

前面的一些TCP包为client与server交互的控制信息,参考这里.

Refs

RDMA Programming - Base on linux-rdma的更多相关文章

  1. Git Base For Linux

    GitHub实战系列汇总:http://www.cnblogs.com/dunitian/p/5038719.html Linux安装git,做个记录吧(使用github提供的隐私邮箱) # git官 ...

  2. NVMe over Fabrics又让RDMA技术火了一把

    RDMA是个什么鬼?相信大部分不关心高性能网络的童鞋都不太了解.但是NVMe over Fabrics的出现让搞存储的不得不抽出时间来看看这个东西,这篇文章就来介绍下我所了解的RDMA. RDMA(R ...

  3. [中英对照]Introduction to Remote Direct Memory Access (RDMA) | RDMA概述

    前言: 什么是RDMA? 简单来说,RDMA就是指不通过操作系统(OS)内核以及TCP/IP协议栈在网络上传输数据,因此延迟(latency)非常低,CPU消耗非常少. 下面给出一篇简单介绍RDMA的 ...

  4. [SPDK/NVMe存储技术分析]009 - Introduction to RDMA Send | RDMA Send操作概论

    来源: https://zcopy.wordpress.com/ 说明: 本文不是对原文的逐字逐句翻译,而是摘取核心部分以介绍RDMA Send操作(后面凡是提到RDMA send, 都对应于IBA里 ...

  5. RDMA调研报告&一点随笔

    计算所科研实践随笔 被淹没在论文海里的两个星期. 早上7:10分起床,草草洗漱,7:30出发,开始漫长的1小时通勤.从地铁站的安检口起,队便排的极长,让人看得头皮发麻.下到了轨道旁稍好,但每趟呼啸而来 ...

  6. 基于SoftRoCE 了解RDMA

    RDMA是基于IB技术的内存直接传送,无需内核参与,硬件网卡搞定.IB需要HPC领域的专用硬件,ROCE则是RDMA协议在普通以太网卡的实现,RoCEv1是在MAC上的二层封装,局域网内可以,要通过路 ...

  7. 容器网络启用RDMA高速通讯-Freeflow

    容器网络启用RDMA高速通讯-Freeflow 容器网络启用RDMA高速通讯-Freeflow 本文编译自: Freeflow,https://github.com/openthings/Freefl ...

  8. Revisiting Network Support for RDMA

    重新审视RDMA的网络支持 本文为SIGCOMM 2018会议论文. 笔者翻译了该论文.由于时间仓促,且笔者英文能力有限,错误之处在所难免:欢迎读者批评指正. 本文及翻译版本仅用于学习使用.如果有任何 ...

  9. RDMA的基础概念

    一张图可以简单明确的说明,目前RDMA的几种技术的差别: RDMA是remote Direct memory access的简称,有几个最基本的特点: CPU offload kernel bypas ...

随机推荐

  1. React Hooks中父组件中调用子组件方法

    React Hooks中父组件中调用子组件方法 使用到的hooks-- useImperativeHandle,useRef /* child子组件 */ // https://reactjs.org ...

  2. GRIT VIEW删除事件

    1.点选表格后找到事件 RowCommand 2.輸入gvGroupUser_RowCommand后双击                                        ------注分 ...

  3. 什么是Familywise Error Rate

    1.什么是Familywise Error Rate(FWE or FWER) 定义:在一系列假设检验中,至少得出一次错误结论的概率. 换句话说,是造成至少一次Type I Error的概率.术语FW ...

  4. Java 之 OutputStreamReader类

    OutputStreamReader类 1.概述 转换流 java.io.OutputStreamReader ,是Writer的子类,是从字符流到字节流的桥梁. 它使用指定的字符集将字符编码为字节. ...

  5. 理解JVM之java内存模型

    java虚拟机规范中试图定义一种java内存模型(JMM)来屏蔽掉各种硬件和操作系统内存访问差异,以实现让java程序在各种平台都能打到一致的内存访问效果.所以java内存模型的主要目标是定义程序中各 ...

  6. laravel登录后其他页面拿不到登录信息

    登录本来是用表单的,我自作聪明的使用ajax提交 public function login(Request $request){ $data = $request->input(); $dat ...

  7. python网络爬虫之爬取图片

    今天使用requests和BeautifulSoup爬取了一些图片,还是很有成就感的,注释可能有误,希望大家多提意见: 方法一:requests import requests from bs4 im ...

  8. Android笔记(七) Android中的布局——线性布局

    我们的软件是由好多个界面组成的,而每个界面又由N多个控件组成,Android中借助布局来让各个空间有条不紊的摆放在界面上. 可以把布局看作是一个可以放置很多控件的容器,它可以按照一定的规律调整控件的位 ...

  9. word中快捷键查看与设定

    很多时候,我们在编辑word文档的时候,为了快速方便都使用快捷键,常用的快捷键大家都知道,但是不常用的是不是就比较懵圈,本文就来告诉你怎么查看与设置word的快捷键. 我使用的word2016 第一步 ...

  10. C++——Big Three(copy ctor、copy op=、dtor)

    Big Three C++ 中Big Three指的是copy ctor 和 copy op=  和  dtor m_data是个字符串指针.一般而言,处理字符串,都是使用指针,在需要存储字符的时候再 ...