\section{Implementation of the Flow Control for ESP} The flow control scheme developed in section \ref{fc-design} had to be implemented on top of the existing implementation of the ethernet streaming protocol (ESP) by \cite{mreinhardt}. ESPs implementation provides applications the usual socket programming interface, just like TCP. This means every data transmission is started by an application making a call to \fname{sendmsg} or its related functions and ends with the receiving application's call to \fname{recvmsg} (or its related). \begin{figure} \centering \input{./graphics/data_flow.pdf_t} \caption[Data flow from sending to receiving process]{Stations of data flow from the sending to the receiving process. The curved arrow indicates the data transmission over the network.} \label{fig:data-flow} \end{figure} As ESP is a streaming protocol providing reliable data transmission, the first step to be made before data can be transmitted is to set up a virtual channel (called ``connection'' in the following) between the two sockets. This is accomplished using a 3-way handshake quite similar to what TCP does and explained in detail in \cite{mreinhardt}. Calling the functions \fname{sendmsg} or \fname{recvmsg} is sensible only for sockets which have entered the connected state. As shown in figure~\ref{fig:data-flow}, there are several stations the data has to pass before it's finally written to the buffer space provided by the receiving process: \begin{enumerate} \item First the message has to be assembled by the sending process. While this sounds trivial, this step can contain several non-trivial operations. For example, byte-order conversions could be necessary or the data might be padded out with dummy bytes to allow aligned (and thus faster) access to individual pieces, if the architecture profits by (or even requires) doing so. There is no point in transmitting these dummy bytes over the network, so they should not be included with the final message. These tasks of compositing the final message to be sent are often performed by an intermediate piece of software like the various implementations of the MPI (message passing library; \cite{mpi-programming}) standard\footnote{At the time of this writing, only the OpenMPI \cite{ompi, mreinhardt} implementation has support for the ESP protocol layer.}. \item After the message has been composed and is ready for transmission, it's handed to the protocol layer which resides within kernel-space. This is where the ESP protocol steps in. The message is split into MSS-sized packets, equipped with a transport header, stored on the write queue and possibly the first packets are already sent out over the network. This step is explained in detail in section~\ref{sendmsg-handler}. \item Next step is the actual transmission of the assembled packets over the network, including retransmissions if necessary. Here is the field of activity for the flow control and congestion avoidance algorithms. \item The packets arriving at the networking device of the receiving host are handed out to the protocol handler registered with the kernel for the specific ethernet packet type. This handler decides if a socket exists which might be interested in the incoming packet by looking at the source and destination port numbers and adresses \cite{mreinhardt}. If it finds one, the packet is given to this socket for processing. Packets containing useful payload are stored on the socket-private receive queue until they are fetched by the receiving process. \item Finally, the receiving process will do one or more call(s) to the \fname{recvmsg()} function. Here the socket buffers are taken off the receive queue and copied over to user-space. This step is repeated until the full message is stored in the application-provided buffer. Possibly the data has to be unpacked by intermediate layers to be ready for further processing by the actual application, as mentioned in step~1. \end{enumerate} The steps 2 -- 5 are carried out by (or involve activities from) the protocol layer, that is ESP. The practical aspects of these tasks, including some optimizations applied, are to be explained in this chapter. For easier reading there is a general survey of the state information which is maintained by the ESP protocol layer in section~\ref{state-information} on page~\pageref{state-information}\,ff. This should be used as a reference, if the meaning of some of the identifier names is not obvious to the reader. \subsection{Design of the ESP Output Engine} The output engine is responsible for actually emitting packets into the network. Two kinds of packets are to be differentiated here: packets carrying payload (``data packets'') and packets used for status updates only (``empty packets''). Data packets are only created upon user request (see \ref{sendmsg-handler}) and are always stored on the write queue until they are acknowledget by the receiver. They are sent out whenever the congestion avoidance strategy decides that it is valid to do so. Events which may trigger the sending of data, roughly ordered by the probability of their occurrence, are: \begin{itemize} \item The receipt of an acknowledgement (ACK). \item The user calls the \fname{sendmsg} handler. \item The receipt of a retransmission request (RRQ). \item The expiration of the retransmission timer. \end{itemize} Empty packets (packets with a length of $0$ recorded in the ESP header) are used frequently for the exchange of state information, but they never enter the write queue. These packets instead are constructed and emitted on the fly. If a retransmission of such an packet is necessary, it's simply constructed and emitted again. This procedure does introduce some negligible overhead, but it helps to simplify and streamline the implementation a lot. As these packets are very small\footnote{Depending on what to count, the size of these frames is 39\,bytes (storage space required when buffered) or 64\,bytes (the minimum ethernet frame duration, which includes padding).}, it is not neccessary to limit the emission of these packets by the congestion avoidance. If there was a noteworthy share of bandwidth used by these empty packets, it would rather indicate a bug in the implementation. \begin{figure} \centering \input{./graphics/output-engine.pdf_t} \caption[The ESP output engine]{Structure of the ESP output engine. An arrow stands for a possible function call. The only function which actually emits packets to the network is esp\_skb\_xmit.} \label{fig:output-engine} \end{figure} Figure~\ref{fig:output-engine} gives an overview of the various function calling paths leading to a packet being emitted into the network, where the upper frame contains functions which emit packets taken off the write queue, and the lower frame deals with empty (i.e. flags-only) packets. The emission of acknowledgements is not included in this figure, because it uses a special-cased implementation as explained in section~\ref{ack-queue}. This is, because if a socket wants to send out an ACK, it might actually trigger the sending of an ACK which was enqued by another socket, as outlined in section \ref{handling-multiple-senders}. The \fname{esp\_skb\_xmit()} function can not handle this situation, the a special implementation was needed. \subsection{Receiving Data from User-space} \label{sendmsg-handler} There is a single point where data is handed from user-space to the ESP protocol layer for transmission, which is implemented by the function \fname{sock\_esp\_sendmsg()}. This function had to be extended in several ways in order to implement the proposed flow control scheme. The first part of this function tests if the preconditions for sending data on the given socket are met, namely if the socket really is in a connected state, the shutdown of the connection has not already been announced to the receiver, and finally if the device this socket operates on has not been shut down inbetween. If all of these conditions are met, it is valid to send out data on this socket. As explained in section~\ref{requesting-slots}, any transmission has to be announced to the receiver by the means of a TXS-flagged packet to be sent as the very first of any transmission. To ensure this, the next step is to test the per-socket variable \fname{trans\_announced}. If the value stored there is non-zero, the current call to \fname{sock\_esp\_sendmsg()} is part of an already announced transmission, which was interrupted by a signal sent to the user-space process or the lack of write queue space and a subsequent timeout while waiting for more write space via the kernel function \fname{sk\_stream\_wait\_memory()}. In either case, the transmission is already announced and the \fname{sendmsg()} handler can just go on enqueing data onto the write queue. But if the value is zero, a new transmission is about to start, and there are two ways to deal with this situation. The first way is to minute that the next packet to be constructed will have to carry the TXS flag, and to record this packet's sequence number as the new value for \fname{announce\_seq}. This procedure initiates a new transmission. Still, there is a more elegant solution, which can be used if the socket's \fname{send\_head} is not \NULL: As already mentioned, the send head always points to the next socket buffer to be sent out {\em for the first time}. If this variable actually points to a socket buffer, no matter what the position of this specific buffer is on the write queue, it is sure there is data on the queue which has not been sent out before. Consequently, because \fname{trans\_announced} was zero, the last packet on the write queue will have the TXF flag set, which ought to indicate the end of transmission to the receiver. Now it is possible to just clear the TXF flag on this last packet. Doing so effectively merges these two consecutive transmissions into a single, larger one. This procedure has two positive effects: it lowers the processing overhead on the receiver side and allows the sender to stick with using the full congestion window, skipping the waiting for the first ACK, which would be required upon transmission announcement oherwise. The rest of the function is mainly a big loop which slices the application-provided buffer into MSS-sized packets and enqueues them to the tail of the write queue. Thereby it sets the \fname{send\_head} if it was \NULL\/ and takes care to set the TXF flag on the last packet, as well as clearing \fname{trans\_announced} in this case. Additionally, in every pass a call to \fname{esp\_push\_one\_frame()} is made. This aims for minimizing the end-to-end latency by overlapping the receipt of data from userspace and sending the data out over the network. Additionally, the \fname{sendmsg()} handler function has been reworked so it can gather the data to output from multiple buffers. This feature, exposed to user-space by the \fname{writev()} (``write a vector'') function, is the primary way the Open MPI library makes use of the ESP protocol. Having a \fname{sendmsg()} which is capable of doinng vector I/O can significantly reduce the number user- kernel-space transitions needed to carry out certain collective MPI operations, namely from the scatter / gather group of operations. \subsection{Sending Data to the Receiver} As can be seen in figure~\ref{fig:output-engine}, the regular calling stack which leads to data packets being sent out to the receiver always leads through the function \fname{esp\_push\_one\_frame()}, which shall be explained here along with its supporting functions. All data being sent out comes off the write queue, whose structure is shown in figure~\ref{fig:write_queue_scheme}. The write queue is organized as a double linked list, with the list head (pointers to the first and the last socket buffer on the queue) stored in the struct \fname{sk\_write\_queue}, along with some auxiliary information used to maintain the list and provide the ability to use it from concurrently running contexts safely by providing a locking mechanism. \begin{figure} \centering \input{./graphics/write_queue.pdf_t} \caption[The ESP write queue]{The ESP write queue. Arrows indicate pointers; for the write queue, the pointers to the previous element are omitted for clarity. The last SKB on the queue is the oldest, whose \fname{pkt\_seq} equals \fname{una\_seq}; the send head points to the next packet to be emitted for the first time.} \label{fig:write_queue_scheme} \end{figure} The first operation this function performs is to check if the socket's \fname{sk\_send\_head} pointer is not \NULL. If it is \NULL, the function returns immediately because there is no {\em new} data on queue to be sent. Otherwise, the congestion avoidance is asked if it is valid to send out the packet found. This is done by calling \fname{esp\_cwnd\_test()}, which performs the following: First it checks if the last TXS-flagged packet was already acknowledged by the receiver. If this is true (because \fname{una\_seq} is after \fname{announce\_seq}), this qualifies for using the full congestion window. In this the case the packet may be emitted if its sequence number is before \fname{una\_seq} plus \fname{esp\_burst\_length}. Otherwise the socket is currently waiting for the first ACK from the receiver, and only packets with a sequence number up to (including) \fname{announce\_seq} plus \fname{esp\_initial\_ack\_burst\_length} may be emitted. If the congestion avoidance finally decides that its valid to emit the given packet, it is handed to the \fname{esp\_skb\_xmit()} function and emitted as explained in \cite{mreinhardt}. A special case this function takes care of is the following: Suppose the application uses non-blocking I/O and tries to send a message which exceeds the size of the send buffer. In this case the \fname{sendmsg()} handler will copy over as much of the message to the send buffer as possible and return to user-space. As the message transmission goes on, some buffer space will become free again, which the application {\em might} fill up. But if it decides not to do so (which is a perfectly valid thing to do), eventually the last packet in the send buffer will be sent out, which will not have the TXF flag set. Because of this, the receiver will not know the sending socket ran out of data and the retransmission request timer will try to recover from an apparent packet loss, which did not occur. To avoid this, the \fname{esp\_push\_one\_frame()} function will set the TXF flag on this very last packet found in the send buffer prior to its emission. \subsection{Managing the ACK queue} \label{ack-queue} The ACK queue is the data structure where odd\footnote{Odd in the sense that sending them out would be at the risk of causing network congestion.} ACKs can be stored until it's time for them to be transmitted to the sending host, where they will open the send window and allow for more data to be transfered. The occasions at which an ACK has to be sent out are: \begin{itemize} \item Received data was successfully inserted into the receive queue. \item Packet loss was detected because of a jump in packet sequence numbers. \item A TXS-flagged packet was received. \item A memory pressure situation was recently lifted because the application read some of the data which was buffered there. \end{itemize} \subsubsection{Enqueuing an ACK} The task of enqueuing an ACK onto the receive queue is performed by the function \fname{esp\_enqueue\_ack()}. This function takes two parameters: The socket which wants to enqueue an ACK and a parameter to indicate additional flags (apart from the mandatory ACK flag), which the socket wishes to be set on the packet that will be constructed. The first action this function performs is to check if the device the given socket operates on is still up, as there is no point in enqueuing an ACK for a shut down networking device. If this check passes, the packet carrying the ACK (and maybe additional flags which were handed in as the mentioned function parameter) is constructed by calling \fname{esp\_skb\_alloc\_sk()} which returns a socket buffer (SKB) containing the needed network and protocol layer headers, but no data section. This socket is now charged to the socket, which is important to avoid memory being leaked and for the later identification of the originating socket of this ACK as well. Additionally, the socket's \fname{ack\_seq} is set to the current value of \fname{recv\_seq} to account for this ACK being sent out sooner or later. The function \fname{esp\_enqueue\_ack()} is called from numerous places without prior checking the value of \fname{esp\_recv\_count}. Because of this, there is chance that the enqueueing of the ACK in fact is not necessary at all, as there is currently only one socket receiving data. This is checked here, and if there is only one receiver, the ACK will be sent out immediately on the fast path, which skips manipulating the ACK queue. If this optimization is not possible because of a multiple-sender situation, the next step is to grab a lock on the ACK queue to allow its safe manipulation. Now the just created packet is added to the tail of the ACK queue, thus making it grow. This growing is desired whenever the size of the queue is less than the current value of \fname{esp\_recv\_count}, as outlined in section \ref{handling-multiple-senders}. If this is the case the function is done for now, releases its lock on the ACK queue and returns. If the size of queue is greater than or equal to the count of sending hosts, it is time to send out an acknowledgment that was enqueued by another socket. Therefore the packet at the head is dequeued and the lock on the queue is released. Now the function holds a reference to a SKB which contains an ACK for some other socket waiting for data. The last thing to do prior to sending out this ACK is to start the originating socket's retransmission request timer, in order to give this socket a chance to notice if anything goes wrong, either with the transport of this ACK or the data frames the sender will emit in return when he processes this ACK. The strategy of always anqueuing at the tail of the ACK queue and always dequeuing from the head has two positive effects: First it allows for $O(1)$ access, which is important as access to this queue happens very often, and performance has to be taken in consideration to avoid introducing a new bottleneck. Secondly, it ensures that the available bandwidth is evenly distributed among the competitive sockets. \subsection{State Information used by the Flow Control} \label{state-information} There is several state information which has to be maintained by the ESP protocol layer. Some of it is private to the individual socket allocated, some is shared across all sockets. \subsubsection{Per-Socket Information} The per-socket information is stored either within the \fname{struct sock} as defined by the Linux kernel, or in its ESP-specific extension \fname{struct esp\_opt}. These two stuctures are always allocated and used together and shall not be distinguished here. \begin{itemize} \item \fname{sk\_send\_head}: A pointer to the next packet on the write queue to be sent out if congestion avoidance allows it. If this pointer is \NULL, all packets on the write queue have been sent out at least once. \item \fname{send\_seq}: The sequence number to be used for the next data packet constructed; this also is the sequence number of the packet at head of the send queue plus one. \item \fname{recv\_seq}: The sequence number of the next data packet expected from the sender. \item \fname{ack\_seq}: The least sequence number to acknowledge next. All packets with a sequence number less than \fname{ack\_seq} have been acknowledged at least once. \item \fname{una\_seq}: The sequence number of the first sent but not yet acknowledged packet; therefore the sequence number of the packet at the tail of the write queue. \item \fname{trans\_announced}: Non-zero if there is a currently running transmission from this socket announced to the receiver, zero otherwise. \item \fname{recv\_announced}: Non-zero if the sender has announced a transmission to this socket, zero otherwise. \item \fname{announce\_seq}: The sequence number of the last data packet sent carrying the TXS flag; if this packet is already acknowledged, the full congestion window may be used. \item \fname{rcv\_mem\_pressure}: Non-zero if this socket is currently short of buffer space for the receive queue (and thus under ``memory pressure''); zero otherwise. \end{itemize} \subsubsection{Global Information} The global information is shared between all ESP sockets, and its storage space is allocated upon loading of the ESP kernel module. Special care has to be taken to either only use atomic operations on these structures or to use the locking mechanisms as provided by the Linux kernel to guard critical sections. \begin{itemize} \item \fname{esp\_recv\_count}: The number of sockets which have a incoming transmission running, as detected by having received a TXS flagged packet. \item \fname{esp\_ack\_queue}: The queue where the acknowledgements for the received data may be parked if necessary to avoid congesting the network. \item \fname{esp\_burst\_length}: The congestion window used during bulk transfers. \item \fname{esp\_initial\_ack\_burst\_length}: The congestion window used while waiting for the first ACK from the receiver. \item \fname{esp\_packets\_to\_ack}: The least number of successfully enqueued data packets needed to trigger the sending of an ACK. \end{itemize} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: