PACKET套接口创建

内核函数packet_create处理PF_PACKET套接口的创建工作。其参数sock->type决定了采用哪一种工作模式，如果参数type为SOCK_PACKET即第一种模式，type为SOCK_DGRAM或者SOCK_RAW即为第二种模式。

两种模式内核会赋予不同的操作函数集合和数据包接收函数，例如后者使用packet_ops函数集，而前者使用packet_ops_spkt函数集。

接收函数一个为packet_rcv，一个为packet_rcv_spkt函数。

/**    Attach a protocol block

     */

    spin_lock_init(&po->bind_lock);

    mutex_init(&po->pg_vec_lock);

    po->prot_hook.func = packet_rcv;

    if (sock->type == SOCK_PACKET)

        po->prot_hook.func = packet_rcv_spkt;

    po->prot_hook.af_packet_priv = sk;

socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));

类型为SOCK_DGRAM/SOCK_RAW的PF_PACKET套接口，除了普通的在内核与用户层间拷贝数据包的方式外，还可通过setsockopt系统调用设置环形接收buffer，

通过mmap与应用层共享这部分内存。这样就可省去拷贝操作。但是数据包的套接口地址信息就不

通过recvfrom/recvmsg调用送到用户层，内核需将这部分信息和数据包拼接在一起，另外，数据包的一些信息如时间戳、VLAN等和环形buffer管理信息也需要在内核与用户态交互，

所以还需要一个结构，为此内核定义了TPACKET_HAEDER结构存储这些信息

目前TPACKET_HEADER有三个版本，每个版本的长度略有不同，用户层可使用setsockopt（PACKET_VERSION）设置需要的版本，另外也可通过getsockopt（PACKET_HDRLEN）获取到每个版本对应的头部长度，设置环形接收buffer需要此长度值。

    enum tpacket_versions {

        TPACKET_V1,

        TPACKET_V2,

        TPACKET_V3

    };

用户层通过setsockopt（PACKET_RX_RING/PACKET_TX_RING）设置环形buffer参数，内核函数packet_set_ring进行处理，并对这4个字段的合法性检查，来看一下其中的要求和关联。

1）内存块大小tp_block_size必须按照页面大小对其，即必须是页面大小的整数倍；每个内存块至少要能够容纳一个数据包；另外，tp_block_size的大小要求是页面大小的2的指数倍（2,4,8倍）；

2）数据包大小tp_frame_size必须是16字节（TPACKET_ALIGNMENT）对其；不能太小，必须大于TPACKET头部信息的长度；
3）内存块数量tp_block_nr乘以每个内存块容纳的数据帧数目，应该等于数据包的总数tp_frame_nr。

合法性检查通过后，内核根据tp_block_size和tp_block_nr分配相应的存储页面，并将相关信息保持在packet_sock套接口的成员rx_ring（packet_ring_buffer）结构体中。最后，更改数据包接收函数为tpacket_rcv，其处理环形buffer接收数据包功能。

static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,

        int closing, int tx_ring)

{

    struct pgv *pg_vec = NULL;

    struct packet_sock *po = pkt_sk(sk);

    int was_running, order = 0;

    struct packet_ring_buffer *rb;

    struct sk_buff_head *rb_queue;

    __be16 num;

    int err = -EINVAL;

    /* Added to avoid minimal code churn */

    struct tpacket_req *req = &req_u->req;

    /* Opening a Tx-ring is NOT supported in TPACKET_V3 */

    if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {

        WARN(1, "Tx-ring is not supported.\n");

        goto out;

    }

    rb = tx_ring ? &po->tx_ring : &po->rx_ring;

    rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;

    err = -EBUSY;

    if (!closing) {

        if (atomic_read(&po->mapped))

            goto out;

        if (atomic_read(&rb->pending))

            goto out;

    }

    if (req->tp_block_nr) {

        /* Sanity tests and some calculations */

        err = -EBUSY;

        if (unlikely(rb->pg_vec))

            goto out;

        switch (po->tp_version) {

        case TPACKET_V1:

            po->tp_hdrlen = TPACKET_HDRLEN;

            break;

        case TPACKET_V2:

            po->tp_hdrlen = TPACKET2_HDRLEN;

            break;

        case TPACKET_V3:

            po->tp_hdrlen = TPACKET3_HDRLEN;

            break;

        }

        /*

           Frame structure:

           - Start. Frame must be aligned to TPACKET_ALIGNMENT=16

           - struct tpacket_hdr

           - pad to TPACKET_ALIGNMENT=16

           - struct sockaddr_ll

           - Gap, chosen so that packet data (Start+tp_net) alignes to TPACKET_ALIGNMENT=16

           - Start+tp_mac: [ Optional MAC header ]

           - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.

           - Pad to align to TPACKET_ALIGNMENT=16

         */

        err = -EINVAL;

        if (unlikely((int)req->tp_block_size <= 0))

            goto out;

        if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))// 必须是pagesize的倍数

            goto out;

        if (unlikely(req->tp_frame_size < po->tp_hdrlen +

                    po->tp_reserve))

            goto out;

        if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))//数据包大小tp_frame_size必须是16字节对其

            goto out;

        rb->frames_per_block = req->tp_block_size/req->tp_frame_size;

        if (unlikely(rb->frames_per_block <= 0))

            goto out;

        //内存块数量tp_block_nr乘以每个内存块容纳的数据帧数目，应该等于数据包的总数tp_frame_nr

        if (unlikely((rb->frames_per_block * req->tp_block_nr) !=

                    req->tp_frame_nr))

            goto out;

        err = -ENOMEM;

        order = get_order(req->tp_block_size);

        pg_vec = alloc_pg_vec(req, order);// kmalloc       tp_block_nr  *  tp_block_size

        if (unlikely(!pg_vec))

            goto out;

        switch (po->tp_version) {

        case TPACKET_V3:

        /* Transmit path is not supported. We checked

         * it above but just being paranoid

         */

            if (!tx_ring)

                init_prb_bdqc(po, rb, pg_vec, req_u, tx_ring);

                break;

        default:

            break;

        }

    }

    /* Done */

    else {

        err = -EINVAL;

        if (unlikely(req->tp_frame_nr))

            goto out;

    }

    lock_sock(sk);

    /* Detach socket from network */

    spin_lock(&po->bind_lock);

    was_running = po->running;

    num = po->num;

    if (was_running) {

        po->num = 0;

        __unregister_prot_hook(sk, false);

    }

    spin_unlock(&po->bind_lock);

    synchronize_net();

    err = -EBUSY;

    mutex_lock(&po->pg_vec_lock);

    if (closing || atomic_read(&po->mapped) == 0) {

        err = 0;

        spin_lock_bh(&rb_queue->lock);

        swap(rb->pg_vec, pg_vec);

        rb->frame_max = (req->tp_frame_nr - 1);

        rb->head = 0;

        rb->frame_size = req->tp_frame_size;

        spin_unlock_bh(&rb_queue->lock);

        swap(rb->pg_vec_order, order);

        swap(rb->pg_vec_len, req->tp_block_nr);

        rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;

        po->prot_hook.func = (po->rx_ring.pg_vec) ?

                        tpacket_rcv : packet_rcv;//替换数据报文解析函数

        skb_queue_purge(rb_queue);

        if (atomic_read(&po->mapped))

            pr_err("packet_mmap: vma is busy: %d\n",

                   atomic_read(&po->mapped));

    }

    mutex_unlock(&po->pg_vec_lock);

    spin_lock(&po->bind_lock);

    if (was_running) {

        po->num = num;

        register_prot_hook(sk);

    }

    spin_unlock(&po->bind_lock);

    if (closing && (po->tp_version > TPACKET_V2)) {

        /* Because we don't support block-based V3 on tx-ring */

        if (!tx_ring)

            prb_shutdown_retire_blk_timer(po, tx_ring, rb_queue);

    }

    release_sock(sk);

    if (pg_vec)

        free_pg_vec(pg_vec, order, req->tp_block_nr);

out:

    return err;

}

/*

+ Why use PACKET_MMAP

--------------------------------------------------------------------------------

In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very inefficient. It uses very limited buffers and requires one system call to capture each packet, it requires two if you want to get packet's timestamp (like libpcap always does).

In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size configurable circular buffer mapped in user space that can be used to either send or receive packets. This way reading packets just needs to wait for them, most of the time there is no need to issue a single system call. Concerning transmission, multiple packets can be sent through one system call to get the highest bandwidth. By using a shared buffer between the kernel and the user also has the benefit of minimizing packet copies.

It's fine to use PACKET_MMAP to improve the performance of the capture and transmission process, but it isn't everything. At least, if you are capturing at high speeds (this is relative to the cpu speed), you should check if the device driver of your network interface card supports some sort of interrupt load mitigation or (even better) if it supports NAPI, also make sure it is enabled. For transmission, check the MTU (Maximum Transmission Unit) used and supported by devices of your network.

-------------------------------------------------------------------------------- + How to use mmap() to improve capture process

--------------------------------------------------------------------------------

From the user standpoint, you should use the higher level libpcap library, which is a de facto standard, portable across nearly all operating systems including Win32.

Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include support for PACKET_MMAP, and also probably the libpcap included in your distribution.

I'm aware of two implementations of PACKET_MMAP in libpcap:

    http://wiki.ipxwarzone.com/             (by Simon Patarin, based on libpcap 0.6.2) http://public.lanl.gov/cpw/              (by Phil Wood, based on lastest libpcap)

The rest of this document is intended for people who want to understand the low level details or want to improve libpcap by including PACKET_MMAP support.

-------------------------------------------------------------------------------- + How to use mmap() directly to improve capture process

--------------------------------------------------------------------------------

From the system calls stand point, the use of PACKET_MMAP involves the following process:

[setup]     socket() -------> creation of the capture socket setsockopt() ---> allocation of the circular buffer (ring) option: PACKET_RX_RING mmap() ---------> mapping of the allocated buffer to the user process

[capture]   poll() ---------> to wait for incoming packets

[shutdown]  close() --------> destruction of the capture socket and deallocation of all associated

                              resources.

socket creation and destruction is straight forward, and is done the same way with or without PACKET_MMAP:

int fd;

fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))

where mode is SOCK_RAW for the raw interface were link level information can be captured or SOCK_DGRAM for the cooked interface where link level information capture is not supported and a link level pseudo-header is provided by the kernel.

The destruction of the socket and all associated resources is done by a simple call to close(fd).

Next I will describe PACKET_MMAP settings and its constraints, also the mapping of the circular buffer in the user process and the use of this buffer.

-------------------------------------------------------------------------------- + How to use mmap() directly to improve transmission process

-------------------------------------------------------------------------------- Transmission process is similar to capture as shown below.

[setup]          socket() -------> creation of the transmission socket setsockopt() ---> allocation of the circular buffer (ring) option: PACKET_TX_RING bind() ---------> bind transmission socket with a network interface

                 mmap() ---------> mapping of the allocated buffer to the user process

[transmission]   poll() ---------> wait for free packets (optional) send() ---------> send all packets that are set as ready in the ring

                                   The flag MSG_DONTWAIT can be used to return before end of transfer.

[shutdown]  close() --------> destruction of the transmission socket and deallocation of all associated resources.

Binding the socket to your network interface is mandatory (with zero copy) to know the header size of frames used in the circular buffer.

As capture, each frame contains two parts:

 -------------------- | struct tpacket_hdr | Header. It contains the status of |                    | of this frame |--------------------| | data buffer        | .                    .  Data that will be sent over the network interface. .                    .

 --------------------

 bind() associates the socket to your network interface thanks to sll_ifindex parameter of struct sockaddr_ll.

 Initialization example:

 struct sockaddr_ll my_addr;

 struct ifreq s_ifr;

 ...

 strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));

 /* get interface index of eth0 */

 ioctl(this->socket, SIOCGIFINDEX, &s_ifr);

 /* fill sockaddr_ll struct to prepare binding */

 my_addr.sll_family = AF_PACKET;

 my_addr.sll_protocol = htons(ETH_P_ALL);

 my_addr.sll_ifindex =  s_ifr.ifr_ifindex;

 /* bind socket to eth0 */

 bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));

 A complete tutorial is available at: http://wiki.gnu-log.net/

-------------------------------------------------------------------------------- + PACKET_MMAP settings

--------------------------------------------------------------------------------

To setup PACKET_MMAP from user level code is done with a call like

 - Capture process setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))

 - Transmission process setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))

The most significant argument in the previous call is the req parameter, this parameter must to have the following structure:

    struct tpacket_req

    { unsigned int    tp_block_size;  /* Minimal size of contiguous block */ unsigned int    tp_block_nr;    /* Number of blocks */

        unsigned int    tp_frame_size;  /* Size of frame */

        unsigned int    tp_frame_nr;    /* Total number of frames */ };

This structure is defined in /usr/include/linux/if_packet.h and establishes a circular buffer (ring) of unswappable memory. Being mapped in the capture process allows reading the captured frames and related meta-information like timestamps without requiring a system call.

Frames are grouped in blocks. Each block is a physically contiguous region of memory and holds tp_block_size/tp_frame_size frames. The total number of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because

    frames_per_block = tp_block_size/tp_frame_size

indeed, packet_set_ring checks that the following condition is true

    frames_per_block * tp_block_nr == tp_frame_nr

Lets see an example, with the following values:

     tp_block_size= 4096

     tp_frame_size= 2048

     tp_block_nr  = 4

     tp_frame_nr  = 8

we will get the following buffer structure:

        block #1                 block #2 
+---------+---------+    +---------+---------+ 
| frame 1 | frame 2 |    | frame 3 | frame 4 |
 +---------+---------+    +---------+---------+

        block #3                 block #4

 +---------+---------+    +---------+---------+ 
| frame 5 | frame 6 |    | frame 7 | frame 8 |
 +---------+---------+    +---------+---------+

A frame can be of any size with the only condition it can fit in a block. A block can only hold an integer number of frames, or in other words, a frame cannot be spawned across two blocks, so there are some details you have to take into account when choosing the frame_size. See "Mapping and use of the circular buffer (ring)".

currently, this structure is a dynamically allocated vector with kmalloc called pg_vec, its size limits the number of blocks that can be allocated.

    +---+---+---+---+

    | x | x | x | x |

    +---+---+---+---+ |   |   |   |

      |   |   |   v

      |   |   v  block #4

      |   v  block #3

      v  block #2 block #1

kmalloc allocates any number of bytes of physically contiguous memory from a pool of pre-determined sizes. This pool of memory is maintained by the slab allocator which is at the end the responsible for doing the allocation and hence which imposes the maximum memory that kmalloc can allocate.

++ Transmission process Those defines are also used for transmission:

     #define TP_STATUS_AVAILABLE        0 // Frame is available

     #define TP_STATUS_SEND_REQUEST     1 // Frame will be sent on next send() #define TP_STATUS_SENDING          2 // Frame is currently in transmission #define TP_STATUS_WRONG_FORMAT     4 // Frame format is not correct

First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a packet, the user fills a data buffer of an available frame, sets tp_len to current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. This can be done on multiple frames. Once the user is ready to transmit, it calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are forwarded to the network device. The kernel updates each status of sent frames with TP_STATUS_SENDING until the end of transfer. At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.

    header->tp_len = in_i_size;

    header->tp_status = TP_STATUS_SEND_REQUEST;

    retval = send(this->socket, NULL, 0, 0);

The user can also use poll() to check if a buffer is available: (status == TP_STATUS_SENDING)

    struct pollfd pfd;

    pfd.fd = fd;

    pfd.revents = 0;

    pfd.events = POLLOUT;

    retval = poll(&pfd, 1, timeout);

------------------------------------------------------------------------------- + PACKET_TIMESTAMP

-------------------------------------------------------------------------------

The PACKET_TIMESTAMP setting determines the source of the timestamp in the packet meta information.  If your NIC is capable of timestamping packets in hardware, you can request those hardware timestamps to used. Note: you may need to enable the generation of hardware timestamps with SIOCSHWTSTAMP.

PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING.  However, only the SOF_TIMESTAMPING_SYS_HARDWARE and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by PACKET_TIMESTAMP.  SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.

    int req = 0;

    req |= SOF_TIMESTAMPING_SYS_HARDWARE;

    setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))

If PACKET_TIMESTAMP is not set, a software timestamp generated inside the networking stack is used (the behavior before this setting was added).

*/

用户层要访问内核的接收环形buffer，需要通过mmap将其映射到用户空间;

mmapbuf = mmap(0, mmapbuflen, PROT_READ|PROT_WRITE, MAP_SHARED, sk, 0);

数据帧接收

　　新接收到的数据帧应当放入共享环形buffer的哪个位置？由函数packet_lookup_frame计算得到。参数position为保存在环形buffer中的可用帧空间的头索引（rx_ring.head），根据此索引，

计算得到页面索引（内存块索引）和帧偏移，即得到可用来保存数据帧的地址（h.raw）。

　　内核与用户层在操作环形buffer时的同步实现，参见tpacket_hdr字段中的tp_status字段，此字段的第一个bit位来实现功能，当前为0时（TP_STATUS_KERNEL）标识内核在使用此段数据帧空间，反之，为1时（TP_STATUS_USER）标识用户层面在使用此段空间。前面介绍的内核使用packet_lookup_frame函数查找可用的数据帧空间，找到之后使用函数__packet_get_status来判断一下此段空间是否可用，tp_status等于TP_STATUS_KERNEL可正常使用，否则，说明用户层还没有处理此段空间内的数据帧，通常在环形buffer已满的情况下出现。
内核在填充完数据帧空间之后，将tp_status的同步位设置为TP_STATUS_USER，同时调用sk->sk_data_ready(sk)通知用户层数据已准备好。

static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,

               struct packet_type *pt, struct net_device *orig_dev)

{

    struct sock *sk;

    struct packet_sock *po;

    struct sockaddr_ll *sll;

    union {

        struct tpacket_hdr *h1;

        struct tpacket2_hdr *h2;

        struct tpacket3_hdr *h3;

        void *raw;

    } h;

    u8 *skb_head = skb->data;

    int skb_len = skb->len;

    unsigned int snaplen, res;

    unsigned long status = TP_STATUS_USER;

    unsigned short macoff, netoff, hdrlen;

    struct sk_buff *copy_skb = NULL;

    struct timeval tv;

    struct timespec ts;

    struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);

    if (skb->pkt_type == PACKET_LOOPBACK)

        goto drop;

    sk = pt->af_packet_priv;

    po = pkt_sk(sk);

    if (!net_eq(dev_net(dev), sock_net(sk)))

        goto drop;

    if (dev->header_ops) {

        if (sk->sk_type != SOCK_DGRAM)

            skb_push(skb, skb->data - skb_mac_header(skb));

        else if (skb->pkt_type == PACKET_OUTGOING) {

            /* Special case: outgoing packets have ll header at head */

            skb_pull(skb, skb_network_offset(skb));

        }

    }

    if (skb->ip_summed == CHECKSUM_PARTIAL)

        status |= TP_STATUS_CSUMNOTREADY;

    snaplen = skb->len;

    res = run_filter(skb, sk, snaplen);

    if (!res)

        goto drop_n_restore;

    if (snaplen > res)

        snaplen = res;

    if (sk->sk_type == SOCK_DGRAM) {

        macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +

                  po->tp_reserve;

    } else {

        unsigned int maclen = skb_network_offset(skb);

        netoff = TPACKET_ALIGN(po->tp_hdrlen +

                       (maclen < 16 ? 16 : maclen)) +

            po->tp_reserve;

        macoff = netoff - maclen;

    }

    if (po->tp_version <= TPACKET_V2) {

        if (macoff + snaplen > po->rx_ring.frame_size) {

            if (po->copy_thresh &&

                atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf) {

                if (skb_shared(skb)) {

                    copy_skb = skb_clone(skb, GFP_ATOMIC);

                } else {

                    copy_skb = skb_get(skb);

                    skb_head = skb->data;

                }

                if (copy_skb)

                    skb_set_owner_r(copy_skb, sk);

            }

            snaplen = po->rx_ring.frame_size - macoff;

            if ((int)snaplen < 0)

                snaplen = 0;

        }

    }

    spin_lock(&sk->sk_receive_queue.lock);

    h.raw = packet_current_rx_frame(po, skb,

                    TP_STATUS_KERNEL, (macoff+snaplen));

    if (!h.raw)

        goto ring_is_full;

    if (po->tp_version <= TPACKET_V2) {

        packet_increment_rx_head(po, &po->rx_ring);

    /*

     * LOSING will be reported till you read the stats,

     * because it's COR - Clear On Read.

     * Anyways, moving it for V1/V2 only as V3 doesn't need this

     * at packet level.

     */

        if (po->stats.tp_drops)

            status |= TP_STATUS_LOSING;

    }

    po->stats.tp_packets++;

    if (copy_skb) {

        status |= TP_STATUS_COPY;

        __skb_queue_tail(&sk->sk_receive_queue, copy_skb);

    }

    spin_unlock(&sk->sk_receive_queue.lock);

    skb_copy_bits(skb, 0, h.raw + macoff, snaplen);

    switch (po->tp_version) {

    case TPACKET_V1:

        h.h1->tp_len = skb->len;

        h.h1->tp_snaplen = snaplen;

        h.h1->tp_mac = macoff;

        h.h1->tp_net = netoff;

        if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)

                && shhwtstamps->syststamp.tv64)

            tv = ktime_to_timeval(shhwtstamps->syststamp);

        else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)

                && shhwtstamps->hwtstamp.tv64)

            tv = ktime_to_timeval(shhwtstamps->hwtstamp);

        else if (skb->tstamp.tv64)

            tv = ktime_to_timeval(skb->tstamp);

        else

            do_gettimeofday(&tv);

        h.h1->tp_sec = tv.tv_sec;

        h.h1->tp_usec = tv.tv_usec;

        hdrlen = sizeof(*h.h1);

        break;

    case TPACKET_V2:

        h.h2->tp_len = skb->len;

        h.h2->tp_snaplen = snaplen;

        h.h2->tp_mac = macoff;

        h.h2->tp_net = netoff;

        if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)

                && shhwtstamps->syststamp.tv64)

            ts = ktime_to_timespec(shhwtstamps->syststamp);

        else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)

                && shhwtstamps->hwtstamp.tv64)

            ts = ktime_to_timespec(shhwtstamps->hwtstamp);

        else if (skb->tstamp.tv64)

            ts = ktime_to_timespec(skb->tstamp);

        else

            getnstimeofday(&ts);

        h.h2->tp_sec = ts.tv_sec;

        h.h2->tp_nsec = ts.tv_nsec;

        if (vlan_tx_tag_present(skb)) {

            h.h2->tp_vlan_tci = vlan_tx_tag_get(skb);

            status |= TP_STATUS_VLAN_VALID;

        } else {

            h.h2->tp_vlan_tci = 0;

        }

        h.h2->tp_padding = 0;

        hdrlen = sizeof(*h.h2);

        break;

    case TPACKET_V3:

        /* tp_nxt_offset,vlan are already populated above.

         * So DONT clear those fields here

         */

        h.h3->tp_status |= status;

        h.h3->tp_len = skb->len;

        h.h3->tp_snaplen = snaplen;

        h.h3->tp_mac = macoff;

        h.h3->tp_net = netoff;

        if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)

                && shhwtstamps->syststamp.tv64)

            ts = ktime_to_timespec(shhwtstamps->syststamp);

        else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)

                && shhwtstamps->hwtstamp.tv64)

            ts = ktime_to_timespec(shhwtstamps->hwtstamp);

        else if (skb->tstamp.tv64)

            ts = ktime_to_timespec(skb->tstamp);

        else

            getnstimeofday(&ts);

        h.h3->tp_sec  = ts.tv_sec;

        h.h3->tp_nsec = ts.tv_nsec;

        hdrlen = sizeof(*h.h3);

        break;

    default:

        BUG();

    }

    sll = h.raw + TPACKET_ALIGN(hdrlen);

    sll->sll_halen = dev_parse_header(skb, sll->sll_addr);

    sll->sll_family = AF_PACKET;

    sll->sll_hatype = dev->type;

    sll->sll_protocol = skb->protocol;

    sll->sll_pkttype = skb->pkt_type;

    if (unlikely(po->origdev))

        sll->sll_ifindex = orig_dev->ifindex;

    else

        sll->sll_ifindex = dev->ifindex;

    smp_mb();

#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1

    {

        u8 *start, *end;

        if (po->tp_version <= TPACKET_V2) {

            end = (u8 *)PAGE_ALIGN((unsigned long)h.raw

                + macoff + snaplen);

            for (start = h.raw; start < end; start += PAGE_SIZE)

                flush_dcache_page(pgv_to_page(start));

        }

        smp_wmb();

    }

#endif

    if (po->tp_version <= TPACKET_V2)

        __packet_set_status(po, h.raw, status);

    else

        prb_clear_blk_fill_status(&po->rx_ring);

    sk->sk_data_ready(sk, 0);

drop_n_restore:

    if (skb_head != skb->data && skb_shared(skb)) {

        skb->data = skb_head;

        skb->len = skb_len;

    }

drop:

    kfree_skb(skb);

    return 0;

ring_is_full:

    po->stats.tp_drops++;

    spin_unlock(&sk->sk_receive_queue.lock);

    sk->sk_data_ready(sk, 0);

    kfree_skb(copy_skb);

    goto drop_n_restore;

}

目前能看到的是， PACKET_MMAP只支持内核和用户态之间zero copy，但是内核里面还有一次ring buffer到DMA拷贝；

而PF_RING 通过DNA支持真正的zero copy，具体实现方案有待进一步研究，RTFS

---------------------------------------------------------------------------------

packet_poll 分析：

1、通过datagram_poll 也就是接收缓存中的事件mask1

2、如果开启了ring mmap 就会检查rx_frame 返回mask2

最后返回mask1 | mask2的值

/**

 *     datagram_poll - generic datagram poll

 *    @file: file struct

 *    @sock: socket

 *    @wait: poll table

 *

 *    Datagram poll: Again totally generic. This also handles

 *    sequenced packet sockets providing the socket receive queue

 *    is only ever holding data ready to receive.

 *

 *    Note: when you _don't_ use this routine for this protocol,

 *    and you use a different write policy from sock_writeable()

 *    then please supply your own write_space callback.

 */

unsigned int datagram_poll(struct file *file, struct socket *sock,

               poll_table *wait)

{

    struct sock *sk = sock->sk;

    unsigned int mask;

// 如果wait 为空NULL 不会执行其callback

    sock_poll_wait(file, sk_sleep(sk), wait);

    mask = 0;

    /* exceptional events? */

    if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))

        mask |= POLLERR;

    if (sk->sk_shutdown & RCV_SHUTDOWN)

        mask |= POLLRDHUP | POLLIN | POLLRDNORM;

    if (sk->sk_shutdown == SHUTDOWN_MASK)

        mask |= POLLHUP;

    /* readable? */

    if (!skb_queue_empty(&sk->sk_receive_queue))

        mask |= POLLIN | POLLRDNORM;

    /* Connection-based need to check for termination and startup */

    if (connection_based(sk)) {

        if (sk->sk_state == TCP_CLOSE)

            mask |= POLLHUP;

        /* connection hasn't started yet? */

        if (sk->sk_state == TCP_SYN_SENT)

            return mask;

    }

    /* writable? */

    if (sock_writeable(sk))

        mask |= POLLOUT | POLLWRNORM | POLLWRBAND;

    else

        set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);

    return mask;

}

static unsigned int packet_poll(struct file *file, struct socket *sock,

                poll_table *wait)

{

    struct sock *sk = sock->sk;

    struct packet_sock *po = pkt_sk(sk);

    unsigned int mask = datagram_poll(file, sock, wait);

    spin_lock_bh(&sk->sk_receive_queue.lock);

    if (po->rx_ring.pg_vec) {

        if (!packet_previous_rx_frame(po, &po->rx_ring,

            TP_STATUS_KERNEL))

            mask |= POLLIN | POLLRDNORM;

    }

    spin_unlock_bh(&sk->sk_receive_queue.lock);

    spin_lock_bh(&sk->sk_write_queue.lock);

    if (po->tx_ring.pg_vec) {

        if (packet_current_frame(po, &po->tx_ring, TP_STATUS_AVAILABLE))

            mask |= POLLOUT | POLLWRNORM;

    }

    spin_unlock_bh(&sk->sk_write_queue.lock);

    return mask;

}

PF_PACKET抓包mmap的更多相关文章

suricata抓包方式之一 AF_PACKET
1.前言 linux提供了原始套接字RAW_SOCKET,可以抓取数据链路层的报文.这样可以对报文进行深入分析.今天介绍一下AF_PACKET的用法,分为两种方式.第一种方法是通过套接字,打开指定的网 ...
使用RawSocket进行网络抓包
aw socket,即原始套接字,可以接收本机网卡上的数据帧或者数据包,对与监听网络的流量和分析是很有作用的,一共可以有3种方式创建这种socket. 中文名原始套接字外文名 RAW SOCKET ...
UNIX网络编程——尝试探索基于Linux C的网卡抓包过程
抓包首先便要知道经过网卡的数据其实都是通过底层的链路层(MAC),在Linux系统中我们获取网卡的数据流量其实是直接从链路层收发数据帧.至于如何进行TCP/UDP连接本文就不再赘述(之前的一段关于w ...
基于Linux C的socket抓包程序和Package分析 (一)
版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/guankle/article/details/27538031 測试执行平台:CentOS 6 ...
（转载）基于Linux C的socket抓包程序和Package分析
转载自 https://blog.csdn.net/kleguan/article/details/27538031 1. Linux抓包源程序在OSI七层模型中,网卡工作在物理层和数据链路层的MA ...
android http 抓包
有时候想开发的时候想看APP发出的http请求和响应是什么,这就需要抓包了,这可以得到一些不为人知的api,比如还可以干些“坏事”... 需要工具: Fiddler2 抓包(点击下载) Android ...
charles工具抓包教程(http跟https)
1.下载charles 可以去charles官网下载,下载地址:http://www.charlesproxy.com/download/ 根据自己的操作系统下载对应的版本,然后进行安装,然后打 ...
从Fiddler抓包到Jmeter接口测试(简单的思路)
备注:本文为博主的同事总结的文章,未经博主允许不得转载. Fiddler下载和配置安装从网上下载fiddler的安装包即可,直接默认,一直点击下一步,直至安装完成. 安装完成后直接打开Fiddler ...
逆向工程 - Reveal、IDA、Hopper、HTTPS抓包等
目录: 一. iOS 如何做才安全二.ipa文件三.沙盒中的数据四.Reveal:查看任何APP 的UI结构五.反编译工具:IDA 六.反编译工具:Hopper Disassembler ...

随机推荐

安装clion
转战c语言,首先搞定编辑器,之前用的pycharm所以就直接用clion了,但是装完不能直接用参考 https://www.cnblogs.com/lyc94620/p/9581786.html 所以 ...
0基础如何更快速入门Linux系统？学完Linux有哪些就业方向？
Linux系统是使用Linux内核及开源自由软件组成的一套操作系统,是一种类UNIX系统,其内核在1991年10月5日由林纳斯·托瓦兹首次发布. 它的主要特性:Linux文件一切皆文件.完全开源免费. ...
联赛%你测试10T2：漫无止境的八月
题意: 思路: 有几个特殊的性质: 在不考虑q里面的单点修改,我们先只判断一个序列是否Yes. 我们注意到每次操作都是对一个长度为k的区间进行区间加减1的操作,所以我们如果将序列里面的数按%k分组,把 ...
matplotlib 饼状图
import matplotlib.pyplot as plt import matplotlib as mpl # 支持中文 plt.rcParams['font.sans-serif'] = [' ...
python 爬虫可视化函数，可以先看看要爬取的数据是否存在
import requests url = "http://www.spbeen.com" headers = { "User-Agent":"tes ...
2020-2021-1 20209313 《Linux内核原理与分析》第一周作业
遇到的问题:安装ubuntu遇到问题描述:在本机上虚拟机的安装包点开就闪退,无法安装VMware 解决方案: 清理VMware相关注册表,更改用户名为英文,查阅相关资料,重装系统. 更换linux安 ...
spring-boot-route（二十）Spring Task实现简单定时任务
Spring Task是Spring 3.0自带的定时任务,可以将它看作成一个轻量级的Quartz,功能虽然没有Quartz那样强大,但是使用起来非常简单,无需增加额外的依赖,可直接上手使用. 一如 ...
圆形进度条的模仿3-DrawArc,DrawCircle,DrawText，自定义属性实例讲解
前面两篇中已经讲过如何使用drawARC,等,画其他的图形的方法的使用也是一样的,只是参数不同, 同时也讲了如何通过xml进行自定义属性,接下来这篇便是通过实例讲解如何实地应用起来, 效果如下,点击开 ...
Storage API简介和存储限制与逐出策略
目录简介常用的客户端存储方式 data storage的类型逐出策略 Storage API estimate persist persisted 综合使用总结简介对于现代浏览器来说,为了 ...
wait/sleep的区别
相同: 暂停线程,哪里停哪里开始不同: wait 释放锁等待 sleep 不释放锁等待 wait .notfy. notfyAll 都是属于Object sleep 属于Thread

PF_PACKET抓包mmap

socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));

数据帧接收

packet_poll 分析：

PF_PACKET抓包mmap的更多相关文章

随机推荐

热门专题