EJFAT UDP Transmission Performance

From epsciwiki
Revision as of 14:22, 10 June 2022 by Timmer (talk | contribs)
Jump to navigation Jump to search

UDP Performance Overview

This page is dedicated to researching methods to maximize reliable UDP transmission rates between nodes.

CURRENTLY UNDER CONSTRUCTION!


Transmission between indra-s1 and indra-s2

The following tests were run with 1 sender and 1 receiver. The sender was the packetBlaster program whose help gives the following output:


usage: ./packetBlaster
        [-h] [-v] [-ip6] [-sendnocp]
        [-bufdelay] (delay between each buffer, not packet)
        [-host <destination host (defaults to 127.0.0.1)>]
        [-p <destination UDP port>]
        [-i <outgoing interface name (e.g. eth0, currently only used to find MTU)>]
        [-mtu <desired MTU size>]
        [-t <tick>]
        [-ver <version>]
        [-id <data id>]
        [-pro <protocol>]
        [-e <entropy>]
        [-b <buffer size>]
        [-s <UDP send buffer size>]
        [-cores <comma-separated list of cores to run on>]
        [-tpre <tick prescale (1,2, ... tick increment each buffer sent)>]
        [-dpre <delay prescale (1,2, ... if -d defined, 1 delay for every prescale pkts/bufs)>]
        [-d <delay in microsec between packets>]

        EJFAT UDP packet sender that will packetize and send buffer repeatedly and get stats
        By default, data is copied into buffer and "send()" is used (connect is called).
        Using -sendnocp flag, data is sent using "send()" (connect called) and data copy minimized, but original data buffer changed


The blaster was sending 89kB buffers in 10 packets from Indra-s1 to the load balancer (129.57.109.254 / 19522) with mtu = 9000.
The sending thread was NOT tied to any specific core. And finally, the entropy and id are the same (0):


./packetBlaster -host 129.57.109.254 -p 19522 -mtu 9000 -ver 2 -sendnocp -t 0 -id 0 -e 0 -b 89000


The receiver was the packetBlastee program whose help gives the following output:


usage: ./packetBlastee
        [-h] [-v] [-ip6]
        [-a <listening IP address (defaults to INADDR_ANY)>]
        [-p <listening UDP port>]
        [-b <internal buffer byte sizez>]
        [-r <UDP receive buffer byte size>]
        [-cores <comma-separated list of cores to run on>]
        [-tpre <tick prescale (1,2, ... expected tick increment for each buffer)>]

        This is an EJFAT UDP packet receiver made to work with packetBlaster.


The blastee was receiving on Indra-s2. Initially the receiving thread was NOT tied to any specific core.
This program is able to track the number of dropped packets and to make sure this stat is accurate,
the value given to the -dpre command line option must be identical for both sender & receiver. This
ensures that the receiver knows which tick is coming next.


./packetBlastee -p 17750


The speed of data transfer depend upon a number of factors:

  1. if the sending thread was tied to a specific core or cores
  2. if the receiving thread was tied to a specific core or cores
  3. if a linux operating system ksoftirqd thread was running and consuming significant cpu time
  4. which pair of terminals the sender & receiver are run from


Terminals

Of the multiple terminals running on s1 and s2, some 1 pair produce the hight transfer rate. In this case that rate averages to 4GB/s over minutes of time. This rate can only be achieved with one pair of terminals. Using others results in a lower maximum rate to 3.5GB/s.


In standard IETF RFC format:

protocol 'L:8,B:8,Version:8,Protocol:8,Reserved:16,Entropy:16,Tick:64'
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       L       |       B       |    Version    |    Protocol   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 3               4                   5                   6  
 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              Rsvd             |            Entropy            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 6                                               12  
 4 5       ...           ...         ...         0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                              Tick                             |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The value of the Tick field is a convention between data source and sink and LB control plane and is populated by the data source so as to send all UDP packets with a shared value to a single destination host IP; e.g., for the JLab particle detector DAQ system it would likely be timestamp. The Entropy field (bits 48-63) serves to deliver packets to a range of ports at the host IP.

Reassembly Engine Meta-Data

The RE meta-data (Figure X, yellow section) is 64 bits and consists of

  • bits 0-3 the 4 bit Version number
  • bits 4-13 a 10 bit Reserved field
  • bit 14 indicates first packet
  • bit 15 indicates last packet
  • bits 16-31 an unsigned 16 bit Data Id
  • bits 32-63 an unsigned 32 bit packet sequence number or optionally data offset byte number from beginning of file (BOF) for reassembly

In standard IETF RFC format:

protocol 'Version:4,Rsvd:10,First:1,Last:1,ROC-ID:16,Offset:32'
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version|        Rsvd       |F|L|            Data-ID            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Packet Sequence # or Byte Offset from Beginning of File   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The Data Id field is a shared convention between data source and sink and is populated (or ignored) to suite the data transfer / load balance application, e.g., for the JLab particle detector DAQ system it would likely be ROC channel # or proxy.

The sequence number or optionally data offset byte number provides the RE with the necessary information to reassemble the transferred data into a meaningful contiguous sequence and is a shared convention between data source and sink. As such, the relationship between data_id, and sequence number or offset is undefined and is application specific. In many use cases for example, the sequence number or offset will be subordinate to the data_id, i.e., each set of packets with a coomon data_id will be individually sequenced as a distinct group from other groups with a different data_id for a common tick value.

Strictly speaking, the RE meta-data is opaque to the LB and therefore considered as part of the payload and is itself therefore a convention between data producer/consumer.

The resultant data stream is shown just below the block diagram in Figure X and depicts the stream UDP packet structure from the source data system to the LB. Individual packets are meta-data tagged both for the LB, to route based on tick to the proper compute node, and for the RE with packet offset spanning the collection of packets for a single tick for eventual destination reassembly.

The depicted sequence is only illustrative, and no assumption about the order of packets with respect to either tick or offset should be made by the LB or the RE.

UDP Header

The UDP Header Source Port field can optionally be modified/populated as follows:

Source Port = lower 16 bits of Load Balancer Tick (for LAG switch entropy)

The UDP Header Destination Port field must be modified/populated as follows:

Destination Port = Value that indicates LB should perform load balancing (else packet is discarded) = 'LB' = 0x4c42