PACKET套接口创建

内核函数packet_create处理PF_PACKET套接口的创建工作。其参数sock->type决定了采用哪一种工作模式,如果参数type为SOCK_PACKET即第一种模式,type为SOCK_DGRAM或者SOCK_RAW即为第二种模式。

两种模式内核会赋予不同的操作函数集合和数据包接收函数,例如后者使用packet_ops函数集,而前者使用packet_ops_spkt函数集。

接收函数一个为packet_rcv,一个为packet_rcv_spkt函数。

/**    Attach a protocol block
*/
spin_lock_init(&po->bind_lock);
mutex_init(&po->pg_vec_lock);
po->prot_hook.func = packet_rcv;
if (sock->type == SOCK_PACKET)
po->prot_hook.func = packet_rcv_spkt;
po->prot_hook.af_packet_priv = sk;

socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));

类型为SOCK_DGRAM/SOCK_RAW的PF_PACKET套接口,除了普通的在内核与用户层间拷贝数据包的方式外,还可通过setsockopt系统调用设置环形接收buffer,

通过mmap与应用层共享这部分内存。这样就可省去拷贝操作。但是数据包的套接口地址信息就不

通过recvfrom/recvmsg调用送到用户层,内核需将这部分信息和数据包拼接在一起,另外,数据包的一些信息如时间戳、VLAN等和环形buffer管理信息也需要在内核与用户态交互,

所以还需要一个结构,为此内核定义了TPACKET_HAEDER结构存储这些信息

目前TPACKET_HEADER有三个版本,每个版本的长度略有不同,用户层可使用setsockopt(PACKET_VERSION)设置需要的版本,另外也可通过getsockopt(PACKET_HDRLEN)获取到每个版本对应的头部长度,设置环形接收buffer需要此长度值。

    enum tpacket_versions {
TPACKET_V1,
TPACKET_V2,
TPACKET_V3
};

用户层通过setsockopt(PACKET_RX_RING/PACKET_TX_RING)设置环形buffer参数,内核函数packet_set_ring进行处理,并对这4个字段的合法性检查,来看一下其中的要求和关联。

1)内存块大小tp_block_size必须按照页面大小对其,即必须是页面大小的整数倍;每个内存块至少要能够容纳一个数据包;另外,tp_block_size的大小要求是页面大小的2的指数倍(2,4,8倍);

2)数据包大小tp_frame_size必须是16字节(TPACKET_ALIGNMENT)对其;不能太小,必须大于TPACKET头部信息的长度;
3)内存块数量tp_block_nr乘以每个内存块容纳的数据帧数目,应该等于数据包的总数tp_frame_nr。

合法性检查通过后,内核根据tp_block_size和tp_block_nr分配相应的存储页面,并将相关信息保持在packet_sock套接口的成员rx_ring(packet_ring_buffer)结构体中。最后,更改数据包接收函数为tpacket_rcv,其处理环形buffer接收数据包功能。

static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
int closing, int tx_ring)
{
struct pgv *pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk);
int was_running, order = 0;
struct packet_ring_buffer *rb;
struct sk_buff_head *rb_queue;
__be16 num;
int err = -EINVAL;
/* Added to avoid minimal code churn */
struct tpacket_req *req = &req_u->req; /* Opening a Tx-ring is NOT supported in TPACKET_V3 */
if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
WARN(1, "Tx-ring is not supported.\n");
goto out;
} rb = tx_ring ? &po->tx_ring : &po->rx_ring;
rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue; err = -EBUSY;
if (!closing) {
if (atomic_read(&po->mapped))
goto out;
if (atomic_read(&rb->pending))
goto out;
} if (req->tp_block_nr) {
/* Sanity tests and some calculations */
err = -EBUSY;
if (unlikely(rb->pg_vec))
goto out; switch (po->tp_version) {
case TPACKET_V1:
po->tp_hdrlen = TPACKET_HDRLEN;
break;
case TPACKET_V2:
po->tp_hdrlen = TPACKET2_HDRLEN;
break;
case TPACKET_V3:
po->tp_hdrlen = TPACKET3_HDRLEN;
break;
}
/*
Frame structure: - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
- struct tpacket_hdr
- pad to TPACKET_ALIGNMENT=16
- struct sockaddr_ll
- Gap, chosen so that packet data (Start+tp_net) alignes to TPACKET_ALIGNMENT=16
- Start+tp_mac: [ Optional MAC header ]
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- Pad to align to TPACKET_ALIGNMENT=16
*/ err = -EINVAL;
if (unlikely((int)req->tp_block_size <= 0))
goto out;
if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))// 必须是pagesize的倍数
goto out;
if (unlikely(req->tp_frame_size < po->tp_hdrlen +
po->tp_reserve))
goto out;
if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))//数据包大小tp_frame_size必须是16字节对其
goto out; rb->frames_per_block = req->tp_block_size/req->tp_frame_size;
if (unlikely(rb->frames_per_block <= 0))
goto out;
//内存块数量tp_block_nr乘以每个内存块容纳的数据帧数目,应该等于数据包的总数tp_frame_nr
if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
req->tp_frame_nr))
goto out; err = -ENOMEM;
order = get_order(req->tp_block_size);
pg_vec = alloc_pg_vec(req, order);// kmalloc tp_block_nr * tp_block_size
if (unlikely(!pg_vec))
goto out;
switch (po->tp_version) {
case TPACKET_V3:
/* Transmit path is not supported. We checked
* it above but just being paranoid
*/
if (!tx_ring)
init_prb_bdqc(po, rb, pg_vec, req_u, tx_ring);
break;
default:
break;
}
}
/* Done */
else {
err = -EINVAL;
if (unlikely(req->tp_frame_nr))
goto out;
} lock_sock(sk); /* Detach socket from network */
spin_lock(&po->bind_lock);
was_running = po->running;
num = po->num;
if (was_running) {
po->num = 0;
__unregister_prot_hook(sk, false);
}
spin_unlock(&po->bind_lock); synchronize_net(); err = -EBUSY;
mutex_lock(&po->pg_vec_lock);
if (closing || atomic_read(&po->mapped) == 0) {
err = 0;
spin_lock_bh(&rb_queue->lock);
swap(rb->pg_vec, pg_vec);
rb->frame_max = (req->tp_frame_nr - 1);
rb->head = 0;
rb->frame_size = req->tp_frame_size;
spin_unlock_bh(&rb_queue->lock); swap(rb->pg_vec_order, order);
swap(rb->pg_vec_len, req->tp_block_nr); rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
po->prot_hook.func = (po->rx_ring.pg_vec) ?
tpacket_rcv : packet_rcv;//替换数据报文解析函数
skb_queue_purge(rb_queue);
if (atomic_read(&po->mapped))
pr_err("packet_mmap: vma is busy: %d\n",
atomic_read(&po->mapped));
}
mutex_unlock(&po->pg_vec_lock); spin_lock(&po->bind_lock);
if (was_running) {
po->num = num;
register_prot_hook(sk);
}
spin_unlock(&po->bind_lock);
if (closing && (po->tp_version > TPACKET_V2)) {
/* Because we don't support block-based V3 on tx-ring */
if (!tx_ring)
prb_shutdown_retire_blk_timer(po, tx_ring, rb_queue);
}
release_sock(sk); if (pg_vec)
free_pg_vec(pg_vec, order, req->tp_block_nr);
out:
return err;
}
/*
+ Why use PACKET_MMAP
--------------------------------------------------------------------------------
In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very inefficient. It uses very limited buffers and requires one system call to capture each packet, it requires two if you want to get packet's timestamp (like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size configurable circular buffer mapped in user space that can be used to either send or receive packets. This way reading packets just needs to wait for them, most of the time there is no need to issue a single system call. Concerning transmission, multiple packets can be sent through one system call to get the highest bandwidth. By using a shared buffer between the kernel and the user also has the benefit of minimizing packet copies.
It's fine to use PACKET_MMAP to improve the performance of the capture and transmission process, but it isn't everything. At least, if you are capturing at high speeds (this is relative to the cpu speed), you should check if the device driver of your network interface card supports some sort of interrupt load mitigation or (even better) if it supports NAPI, also make sure it is enabled. For transmission, check the MTU (Maximum Transmission Unit) used and supported by devices of your network.
-------------------------------------------------------------------------------- + How to use mmap() to improve capture process
--------------------------------------------------------------------------------
From the user standpoint, you should use the higher level libpcap library, which is a de facto standard, portable across nearly all operating systems including Win32.
Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include support for PACKET_MMAP, and also probably the libpcap included in your distribution.
I'm aware of two implementations of PACKET_MMAP in libpcap:
http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2) http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap)
The rest of this document is intended for people who want to understand the low level details or want to improve libpcap by including PACKET_MMAP support.
-------------------------------------------------------------------------------- + How to use mmap() directly to improve capture process
--------------------------------------------------------------------------------
From the system calls stand point, the use of PACKET_MMAP involves the following process:
[setup] socket() -------> creation of the capture socket setsockopt() ---> allocation of the circular buffer (ring) option: PACKET_RX_RING mmap() ---------> mapping of the allocated buffer to the user process
[capture] poll() ---------> to wait for incoming packets
[shutdown] close() --------> destruction of the capture socket and deallocation of all associated
resources.
socket creation and destruction is straight forward, and is done the same way with or without PACKET_MMAP:
int fd;
fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
where mode is SOCK_RAW for the raw interface were link level information can be captured or SOCK_DGRAM for the cooked interface where link level information capture is not supported and a link level pseudo-header is provided by the kernel.
The destruction of the socket and all associated resources is done by a simple call to close(fd).
Next I will describe PACKET_MMAP settings and its constraints, also the mapping of the circular buffer in the user process and the use of this buffer.
-------------------------------------------------------------------------------- + How to use mmap() directly to improve transmission process
-------------------------------------------------------------------------------- Transmission process is similar to capture as shown below.
[setup] socket() -------> creation of the transmission socket setsockopt() ---> allocation of the circular buffer (ring) option: PACKET_TX_RING bind() ---------> bind transmission socket with a network interface
mmap() ---------> mapping of the allocated buffer to the user process
[transmission] poll() ---------> wait for free packets (optional) send() ---------> send all packets that are set as ready in the ring
The flag MSG_DONTWAIT can be used to return before end of transfer.
[shutdown] close() --------> destruction of the transmission socket and deallocation of all associated resources.
Binding the socket to your network interface is mandatory (with zero copy) to know the header size of frames used in the circular buffer.
As capture, each frame contains two parts:
-------------------- | struct tpacket_hdr | Header. It contains the status of | | of this frame |--------------------| | data buffer | . . Data that will be sent over the network interface. . .
--------------------
bind() associates the socket to your network interface thanks to sll_ifindex parameter of struct sockaddr_ll.
Initialization example:
struct sockaddr_ll my_addr;
struct ifreq s_ifr;
...
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
/* get interface index of eth0 */
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
/* fill sockaddr_ll struct to prepare binding */
my_addr.sll_family = AF_PACKET;
my_addr.sll_protocol = htons(ETH_P_ALL);
my_addr.sll_ifindex = s_ifr.ifr_ifindex;
/* bind socket to eth0 */
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
A complete tutorial is available at: http://wiki.gnu-log.net/
-------------------------------------------------------------------------------- + PACKET_MMAP settings
--------------------------------------------------------------------------------
To setup PACKET_MMAP from user level code is done with a call like
- Capture process setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
- Transmission process setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
The most significant argument in the previous call is the req parameter, this parameter must to have the following structure:
struct tpacket_req
{ unsigned int tp_block_size; /* Minimal size of contiguous block */ unsigned int tp_block_nr; /* Number of blocks */
unsigned int tp_frame_size; /* Size of frame */
unsigned int tp_frame_nr; /* Total number of frames */ };
This structure is defined in /usr/include/linux/if_packet.h and establishes a circular buffer (ring) of unswappable memory. Being mapped in the capture process allows reading the captured frames and related meta-information like timestamps without requiring a system call.
Frames are grouped in blocks. Each block is a physically contiguous region of memory and holds tp_block_size/tp_frame_size frames. The total number of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
frames_per_block = tp_block_size/tp_frame_size
indeed, packet_set_ring checks that the following condition is true
frames_per_block * tp_block_nr == tp_frame_nr
Lets see an example, with the following values:
tp_block_size= 4096
tp_frame_size= 2048
tp_block_nr = 4
tp_frame_nr = 8
we will get the following buffer structure:
block #1 block #2
+---------+---------+ +---------+---------+
| frame 1 | frame 2 | | frame 3 | frame 4 |
+---------+---------+ +---------+---------+
block #3 block #4

+---------+---------+ +---------+---------+
| frame 5 | frame 6 | | frame 7 | frame 8 |
+---------+---------+ +---------+---------+
A frame can be of any size with the only condition it can fit in a block. A block can only hold an integer number of frames, or in other words, a frame cannot be spawned across two blocks, so there are some details you have to take into account when choosing the frame_size. See "Mapping and use of the circular buffer (ring)".
currently, this structure is a dynamically allocated vector with kmalloc called pg_vec, its size limits the number of blocks that can be allocated.
+---+---+---+---+
| x | x | x | x |
+---+---+---+---+ | | | |
| | | v
| | v block #4
| v block #3
v block #2 block #1
kmalloc allocates any number of bytes of physically contiguous memory from a pool of pre-determined sizes. This pool of memory is maintained by the slab allocator which is at the end the responsible for doing the allocation and hence which imposes the maximum memory that kmalloc can allocate.
++ Transmission process Those defines are also used for transmission:
#define TP_STATUS_AVAILABLE 0 // Frame is available
#define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() #define TP_STATUS_SENDING 2 // Frame is currently in transmission #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a packet, the user fills a data buffer of an available frame, sets tp_len to current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. This can be done on multiple frames. Once the user is ready to transmit, it calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are forwarded to the network device. The kernel updates each status of sent frames with TP_STATUS_SENDING until the end of transfer. At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
header->tp_len = in_i_size;
header->tp_status = TP_STATUS_SEND_REQUEST;
retval = send(this->socket, NULL, 0, 0);
The user can also use poll() to check if a buffer is available: (status == TP_STATUS_SENDING)
struct pollfd pfd;
pfd.fd = fd;
pfd.revents = 0;
pfd.events = POLLOUT;
retval = poll(&pfd, 1, timeout);
------------------------------------------------------------------------------- + PACKET_TIMESTAMP
-------------------------------------------------------------------------------
The PACKET_TIMESTAMP setting determines the source of the timestamp in the packet meta information. If your NIC is capable of timestamping packets in hardware, you can request those hardware timestamps to used. Note: you may need to enable the generation of hardware timestamps with SIOCSHWTSTAMP.
PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
int req = 0;
req |= SOF_TIMESTAMPING_SYS_HARDWARE;
setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
If PACKET_TIMESTAMP is not set, a software timestamp generated inside the networking stack is used (the behavior before this setting was added).
*/

用户层要访问内核的接收环形buffer,需要通过mmap将其映射到用户空间;

mmapbuf = mmap(0, mmapbuflen, PROT_READ|PROT_WRITE, MAP_SHARED, sk, 0);

数据帧接收

  新接收到的数据帧应当放入共享环形buffer的哪个位置?由函数packet_lookup_frame计算得到。参数position为保存在环形buffer中的可用帧空间的头索引(rx_ring.head),根据此索引,

计算得到页面索引(内存块索引)和帧偏移,即得到可用来保存数据帧的地址(h.raw)。

  内核与用户层在操作环形buffer时的同步实现,参见tpacket_hdr字段中的tp_status字段,此字段的第一个bit位来实现功能,当前为0时(TP_STATUS_KERNEL)标识内核在使用此段数据帧空间,反之,为1时(TP_STATUS_USER)标识用户层面在使用此段空间。前面介绍的内核使用packet_lookup_frame函数查找可用的数据帧空间,找到之后使用函数__packet_get_status来判断一下此段空间是否可用,tp_status等于TP_STATUS_KERNEL可正常使用,否则,说明用户层还没有处理此段空间内的数据帧,通常在环形buffer已满的情况下出现。
内核在填充完数据帧空间之后,将tp_status的同步位设置为TP_STATUS_USER,同时调用sk->sk_data_ready(sk)通知用户层数据已准备好。

static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt, struct net_device *orig_dev)
{
struct sock *sk;
struct packet_sock *po;
struct sockaddr_ll *sll;
union {
struct tpacket_hdr *h1;
struct tpacket2_hdr *h2;
struct tpacket3_hdr *h3;
void *raw;
} h;
u8 *skb_head = skb->data;
int skb_len = skb->len;
unsigned int snaplen, res;
unsigned long status = TP_STATUS_USER;
unsigned short macoff, netoff, hdrlen;
struct sk_buff *copy_skb = NULL;
struct timeval tv;
struct timespec ts;
struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb); if (skb->pkt_type == PACKET_LOOPBACK)
goto drop; sk = pt->af_packet_priv;
po = pkt_sk(sk); if (!net_eq(dev_net(dev), sock_net(sk)))
goto drop; if (dev->header_ops) {
if (sk->sk_type != SOCK_DGRAM)
skb_push(skb, skb->data - skb_mac_header(skb));
else if (skb->pkt_type == PACKET_OUTGOING) {
/* Special case: outgoing packets have ll header at head */
skb_pull(skb, skb_network_offset(skb));
}
} if (skb->ip_summed == CHECKSUM_PARTIAL)
status |= TP_STATUS_CSUMNOTREADY; snaplen = skb->len; res = run_filter(skb, sk, snaplen);
if (!res)
goto drop_n_restore;
if (snaplen > res)
snaplen = res; if (sk->sk_type == SOCK_DGRAM) {
macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
po->tp_reserve;
} else {
unsigned int maclen = skb_network_offset(skb);
netoff = TPACKET_ALIGN(po->tp_hdrlen +
(maclen < 16 ? 16 : maclen)) +
po->tp_reserve;
macoff = netoff - maclen;
}
if (po->tp_version <= TPACKET_V2) {
if (macoff + snaplen > po->rx_ring.frame_size) {
if (po->copy_thresh &&
atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) {
if (skb_shared(skb)) {
copy_skb = skb_clone(skb, GFP_ATOMIC);
} else {
copy_skb = skb_get(skb);
skb_head = skb->data;
}
if (copy_skb)
skb_set_owner_r(copy_skb, sk);
}
snaplen = po->rx_ring.frame_size - macoff;
if ((int)snaplen < 0)
snaplen = 0;
}
}
spin_lock(&sk->sk_receive_queue.lock);
h.raw = packet_current_rx_frame(po, skb,
TP_STATUS_KERNEL, (macoff+snaplen));
if (!h.raw)
goto ring_is_full;
if (po->tp_version <= TPACKET_V2) {
packet_increment_rx_head(po, &po->rx_ring);
/*
* LOSING will be reported till you read the stats,
* because it's COR - Clear On Read.
* Anyways, moving it for V1/V2 only as V3 doesn't need this
* at packet level.
*/
if (po->stats.tp_drops)
status |= TP_STATUS_LOSING;
}
po->stats.tp_packets++;
if (copy_skb) {
status |= TP_STATUS_COPY;
__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
}
spin_unlock(&sk->sk_receive_queue.lock); skb_copy_bits(skb, 0, h.raw + macoff, snaplen); switch (po->tp_version) {
case TPACKET_V1:
h.h1->tp_len = skb->len;
h.h1->tp_snaplen = snaplen;
h.h1->tp_mac = macoff;
h.h1->tp_net = netoff;
if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)
&& shhwtstamps->syststamp.tv64)
tv = ktime_to_timeval(shhwtstamps->syststamp);
else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)
&& shhwtstamps->hwtstamp.tv64)
tv = ktime_to_timeval(shhwtstamps->hwtstamp);
else if (skb->tstamp.tv64)
tv = ktime_to_timeval(skb->tstamp);
else
do_gettimeofday(&tv);
h.h1->tp_sec = tv.tv_sec;
h.h1->tp_usec = tv.tv_usec;
hdrlen = sizeof(*h.h1);
break;
case TPACKET_V2:
h.h2->tp_len = skb->len;
h.h2->tp_snaplen = snaplen;
h.h2->tp_mac = macoff;
h.h2->tp_net = netoff;
if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)
&& shhwtstamps->syststamp.tv64)
ts = ktime_to_timespec(shhwtstamps->syststamp);
else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)
&& shhwtstamps->hwtstamp.tv64)
ts = ktime_to_timespec(shhwtstamps->hwtstamp);
else if (skb->tstamp.tv64)
ts = ktime_to_timespec(skb->tstamp);
else
getnstimeofday(&ts);
h.h2->tp_sec = ts.tv_sec;
h.h2->tp_nsec = ts.tv_nsec;
if (vlan_tx_tag_present(skb)) {
h.h2->tp_vlan_tci = vlan_tx_tag_get(skb);
status |= TP_STATUS_VLAN_VALID;
} else {
h.h2->tp_vlan_tci = 0;
}
h.h2->tp_padding = 0;
hdrlen = sizeof(*h.h2);
break;
case TPACKET_V3:
/* tp_nxt_offset,vlan are already populated above.
* So DONT clear those fields here
*/
h.h3->tp_status |= status;
h.h3->tp_len = skb->len;
h.h3->tp_snaplen = snaplen;
h.h3->tp_mac = macoff;
h.h3->tp_net = netoff;
if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)
&& shhwtstamps->syststamp.tv64)
ts = ktime_to_timespec(shhwtstamps->syststamp);
else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)
&& shhwtstamps->hwtstamp.tv64)
ts = ktime_to_timespec(shhwtstamps->hwtstamp);
else if (skb->tstamp.tv64)
ts = ktime_to_timespec(skb->tstamp);
else
getnstimeofday(&ts);
h.h3->tp_sec = ts.tv_sec;
h.h3->tp_nsec = ts.tv_nsec;
hdrlen = sizeof(*h.h3);
break;
default:
BUG();
} sll = h.raw + TPACKET_ALIGN(hdrlen);
sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
sll->sll_family = AF_PACKET;
sll->sll_hatype = dev->type;
sll->sll_protocol = skb->protocol;
sll->sll_pkttype = skb->pkt_type;
if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex; smp_mb();
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
{
u8 *start, *end; if (po->tp_version <= TPACKET_V2) {
end = (u8 *)PAGE_ALIGN((unsigned long)h.raw
+ macoff + snaplen);
for (start = h.raw; start < end; start += PAGE_SIZE)
flush_dcache_page(pgv_to_page(start));
}
smp_wmb();
}
#endif
if (po->tp_version <= TPACKET_V2)
__packet_set_status(po, h.raw, status);
else
prb_clear_blk_fill_status(&po->rx_ring); sk->sk_data_ready(sk, 0); drop_n_restore:
if (skb_head != skb->data && skb_shared(skb)) {
skb->data = skb_head;
skb->len = skb_len;
}
drop:
kfree_skb(skb);
return 0; ring_is_full:
po->stats.tp_drops++;
spin_unlock(&sk->sk_receive_queue.lock); sk->sk_data_ready(sk, 0);
kfree_skb(copy_skb);
goto drop_n_restore;
}

目前能看到的是, PACKET_MMAP只支持内核和用户态之间zero copy,但是内核里面还有一次ring buffer到DMA拷贝;

而PF_RING 通过DNA支持真正的zero copy,具体实现方案有待进一步研究,RTFS

---------------------------------------------------------------------------------
相关研究讨论的帖子
 
源自BSD的类似技术netmap

eg:

int main ( int argc, char **argv )
{
struct pollfd pfd;
struct sockaddr_ll addr;
int i; signal(SIGINT, sigproc); /* Open the packet socket */
if ( (fd=socket(PF_PACKET, SOCK_DGRAM, 0))<0 ) {
perror("socket()");
return 1;
} /* Setup the fd for mmap() ring buffer */
req.tp_block_size=4096;
req.tp_frame_size=1024;
req.tp_block_nr=64;
req.tp_frame_nr=4*64;
if ( (setsockopt(fd,
SOL_PACKET,
PACKET_RX_RING,
(char *)&req,
sizeof(req))) != 0 ) {
perror("setsockopt()");
close(fd);
return 1;
}; /* mmap() the sucker */
map=mmap(NULL,
req.tp_block_size * req.tp_block_nr,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, fd, 0);
if ( map==MAP_FAILED ) {
perror("mmap()");
close(fd);
return 1;
} /* Setup our ringbuffer */
ring=malloc(req.tp_frame_nr * sizeof(struct iovec));
for(i=0; i<req.tp_frame_nr; i++) {
ring[i].iov_base=(void *)((long)map)+(i*req.tp_frame_size);
ring[i].iov_len=req.tp_frame_size;
} /* bind the packet socket */
memset(&addr, 0, sizeof(addr));
addr.sll_family=AF_PACKET;
addr.sll_protocol=htons(0x03);
addr.sll_ifindex=0;
addr.sll_hatype=0;
addr.sll_pkttype=0;
addr.sll_halen=0;
if ( bind(fd, (struct sockaddr *)&addr, sizeof(addr)) ) {
munmap(map, req.tp_block_size * req.tp_block_nr);
perror("bind()");
close(fd);
return 1;
} for(i=0;;) {
while(*(unsigned long*)ring[i].iov_base) {
struct tpacket_hdr *h=ring[i].iov_base;
struct sockaddr_ll *sll=(void *)h + TPACKET_ALIGN(sizeof(*h));
unsigned char *bp=(unsigned char *)h + h->tp_mac; printf("%u.%.6u: if%u %s %u bytes\n",
h->tp_sec, h->tp_usec,
sll->sll_ifindex,
names[sll->sll_pkttype],
h->tp_len); /* tell the kernel this packet is done with */
h->tp_status=0;
mb(); /* memory barrier */ i=(i==req.tp_frame_nr-1) ? 0 : i+1;
} /* Sleep when nothings happening */
pfd.fd=fd;
pfd.events=POLLIN|POLLERR;
pfd.revents=0;
poll(&pfd, 1, -1);
} return 0;
}

packet_poll 分析:

1、通过datagram_poll 也就是接收缓存中的事件mask1

2、如果开启了ring mmap 就会检查rx_frame 返回mask2

最后返回mask1 | mask2的值

/**
* datagram_poll - generic datagram poll
* @file: file struct
* @sock: socket
* @wait: poll table
*
* Datagram poll: Again totally generic. This also handles
* sequenced packet sockets providing the socket receive queue
* is only ever holding data ready to receive.
*
* Note: when you _don't_ use this routine for this protocol,
* and you use a different write policy from sock_writeable()
* then please supply your own write_space callback.
*/
unsigned int datagram_poll(struct file *file, struct socket *sock,
poll_table *wait)
{
struct sock *sk = sock->sk;
unsigned int mask;
// 如果wait 为空NULL 不会执行其callback
sock_poll_wait(file, sk_sleep(sk), wait);
mask = 0; /* exceptional events? */
if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
mask |= POLLERR;
if (sk->sk_shutdown & RCV_SHUTDOWN)
mask |= POLLRDHUP | POLLIN | POLLRDNORM;
if (sk->sk_shutdown == SHUTDOWN_MASK)
mask |= POLLHUP; /* readable? */
if (!skb_queue_empty(&sk->sk_receive_queue))
mask |= POLLIN | POLLRDNORM; /* Connection-based need to check for termination and startup */
if (connection_based(sk)) {
if (sk->sk_state == TCP_CLOSE)
mask |= POLLHUP;
/* connection hasn't started yet? */
if (sk->sk_state == TCP_SYN_SENT)
return mask;
} /* writable? */
if (sock_writeable(sk))
mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
else
set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); return mask;
}
static unsigned int packet_poll(struct file *file, struct socket *sock,
poll_table *wait)
{
struct sock *sk = sock->sk;
struct packet_sock *po = pkt_sk(sk);
unsigned int mask = datagram_poll(file, sock, wait); spin_lock_bh(&sk->sk_receive_queue.lock);
if (po->rx_ring.pg_vec) {
if (!packet_previous_rx_frame(po, &po->rx_ring,
TP_STATUS_KERNEL))
mask |= POLLIN | POLLRDNORM;
}
spin_unlock_bh(&sk->sk_receive_queue.lock);
spin_lock_bh(&sk->sk_write_queue.lock);
if (po->tx_ring.pg_vec) {
if (packet_current_frame(po, &po->tx_ring, TP_STATUS_AVAILABLE))
mask |= POLLOUT | POLLWRNORM;
}
spin_unlock_bh(&sk->sk_write_queue.lock);
return mask;
}

PF_PACKET抓包mmap的更多相关文章

  1. suricata抓包方式之一 AF_PACKET

    1.前言 linux提供了原始套接字RAW_SOCKET,可以抓取数据链路层的报文.这样可以对报文进行深入分析.今天介绍一下AF_PACKET的用法,分为两种方式.第一种方法是通过套接字,打开指定的网 ...

  2. 使用RawSocket进行网络抓包

    aw socket,即原始套接字,可以接收本机网卡上的数据帧或者数据包,对与监听网络的流量和分析是很有作用的,一共可以有3种方式创建这种socket. 中文名 原始套接字 外文名 RAW SOCKET ...

  3. UNIX网络编程——尝试探索基于Linux C的网卡抓包过程

     抓包首先便要知道经过网卡的数据其实都是通过底层的链路层(MAC),在Linux系统中我们获取网卡的数据流量其实是直接从链路层收发数据帧.至于如何进行TCP/UDP连接本文就不再赘述(之前的一段关于w ...

  4. 基于Linux C的socket抓包程序和Package分析 (一)

    版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/guankle/article/details/27538031  測试执行平台:CentOS 6 ...

  5. (转载)基于Linux C的socket抓包程序和Package分析

    转载自 https://blog.csdn.net/kleguan/article/details/27538031 1. Linux抓包源程序 在OSI七层模型中,网卡工作在物理层和数据链路层的MA ...

  6. android http 抓包

    有时候想开发的时候想看APP发出的http请求和响应是什么,这就需要抓包了,这可以得到一些不为人知的api,比如还可以干些“坏事”... 需要工具: Fiddler2 抓包(点击下载) Android ...

  7. charles工具抓包教程(http跟https)

    1.下载charles 可以去charles官网下载,下载地址:http://www.charlesproxy.com/download/    根据自己的操作系统下载对应的版本,然后进行安装,然后打 ...

  8. 从Fiddler抓包到Jmeter接口测试(简单的思路)

    备注:本文为博主的同事总结的文章,未经博主允许不得转载. Fiddler下载和配置安装 从网上下载fiddler的安装包即可,直接默认,一直点击下一步,直至安装完成. 安装完成后直接打开Fiddler ...

  9. 逆向工程 - Reveal、IDA、Hopper、HTTPS抓包 等

    目录: 一. iOS 如何做才安全 二.ipa文件 三.沙盒 中的数据 四.Reveal:查看 任何APP 的UI结构 五.反编译工具:IDA 六.反编译工具:Hopper Disassembler ...

随机推荐

  1. 多测师_肖sir _python 练习题(一)100以内奇数,偶数,质数胡计算

    (1)求1~100的和方法: 方法一:print(sum(range(1,101))) 方法二: sum1 = 0 i = 1 while True: sum1 = sum1 + i if i == ...

  2. 超好用的UnixLinux 命令技巧 大神为你详细解读

    1.删除一个大文件 我在生产服务器上有一个很大的200GB的日志文件需要删除.我的rm和ls命令已经崩溃,我担心这是由于巨大的磁盘IO造成的,要删除这个大文件,输入: > /path/to/fi ...

  3. linux安装jdk-centos7系统:

      1 官网下载        http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

  4. 第三章 rsync 命令详解和实战用法

    一.rsync传输模式 1.本地模式2.远程模式3.守护进程模式 二.守护进程模式 1.安装rsync[root@backup ~]# yum -y install rsync 2.配置rsync[r ...

  5. C# 面试前的准备_基础知识点的回顾_03

    1.HTTP中Post和Get区别 这忒简单了吧,大家是不是感觉到兴奋了,长舒一口气了,终于出现了一个可以聊上10分钟的问题了. 根据HTTP规范,Get用于信息获取,而且应该是安全的和幂等的. 参数 ...

  6. 企业级RPC框架zRPC

    近期比较火的开源项目go-zero是一个集成了各种工程实践的包含了Web和RPC协议的功能完善的微服务框架,今天我们就一起来分析一下其中的RPC部分zRPC. zRPC底层依赖gRPC,内置了服务注册 ...

  7. C++实现求离散数学命题公式的真值表

    一.实验内容 (1)求任意一个命题公式的真值表. (2)利用真值表求任意一个命题公式的主范式. (3)利用真值表进行逻辑推理. 注:(2)和(3)可在(1)的基础上完成. 二.实验目的 真值表是命题逻 ...

  8. ASP.NET CORE 开发微信公众号(一、测试号管理)

    一.注册账号 百度微信公众平台,点击进入. 二.公众平台测试账号 点击进入平台后居然是小程序,我也很费解.以前是找到开发->开发者工具->公众平台测试账号,现在毛都没有了. 不过可以点击这 ...

  9. Sword Art Online 刀剑神域

    date: 2014-10-06 15:30:11 updated: 2014-10-06 15:30:11 [一] 他和她,第一次相见是在游戏里,两个角色的对话.现在说来都不算是正式见面呢. &qu ...

  10. 利用transformer进行中文文本分类(数据集是复旦中文语料)

    利用TfidfVectorizer进行中文文本分类(数据集是复旦中文语料) 利用RNN进行中文文本分类(数据集是复旦中文语料) 利用CNN进行中文文本分类(数据集是复旦中文语料) 和之前介绍的不同,重 ...