\section{Congestion Avoidance} \subsection{Different Types of Congestion} There are two kinds of congestion which can occur in a computer network. These are receiver congestion and network congestion, which will be explained in more detail below, observing the specifics of a cluster environment. \subsubsection{Receiver Congestion} \label{recv-congestion} Receiver congestion occurs if the recipient of data is unable to process it at the same speed the data arrives at it's network interface. The networking protocol stores the data received into socket buffers which are a stopover before the user-space application which initiated the communication is ready to consume the data received by calling \fname{recvmsg} or an equivalent. If this application does expensive computation on the data or experiences shortage in I/O performance, the networking protocol may run out of socket buffers. When there are no more socket buffers available it will be forced to drop incoming packets, which are lost then. Receiver congestion is very easy to detect by the software implementing the network protocol because it knows when it runs out of socket buffers for storing the packets coming in from the network. More specifically it even knows exactly how many buffer space it has in spare at any time. \subsubsection{Network Congestion} \label{network-congestion} \begin{figure} \centering \input{./graphics/congestion/switch-and-peers.pdf_t} \caption[Topology of a cluster environment]{Topology of a cluster environment. Each of the peers is directly and equally connected to its port on the switch. The ports on their part are connected to the switch's backplane. Thick lines are part of data transmission paths, and line width is proportional to available bandwidth.} \label{cluster-topology} \end{figure} Basically, network congestion means that the network can not bear the amount of data currently transmitted between communication hosts \cite{rfc-2914}. Network congestion can happen wherever in a network a store-and-forward approch is used to mediate between a section $N_1$ offering a bandwidth $C_1$ and a section $N_2$ offering bandwidth $C_2$ with $C_2 > C_1$. Store-and-forward means that there is a buffer where the packets for $N_2$ are stored at rate $C_1$, before they are forwarded to their destination. As long as this buffer is not empty, the data is sent out to $N_2$ at rate $C_2$ and finally removed from the buffer. A lower bound on the size $S_\text{buff}$ of this buffer for a network with a maximum packet size of $p_\text{max}$ bytes is given by the equation: \begin{equation} \label{eq:min-buffer} S_\text{buff} \ge \frac{C_1}{C_2} \cdot p_\text{max} \end{equation} A typical size for this buffer in ethernet switches is around 128\,kB. Whenever this buffer is full and an additional packet arrives from $N_1$ this packet has to be dropped and is lost. That is what commonly is referred to as network congestion. In a cluster environment this can happen for data leaving the backplane and entering the per-peer port buffer\footnote{Actually today's switches know another mode of operation called cut-through switching. Rather than storing and forwarding full packets the switch only waits for the packet's header to arrive, so it knows the destination address. As soon as the responsible port is known, data is forwarded there. This mode offers lower latency but it is prone to collisions. It is disabled when collisions become too frequent, e.g. more than one peer tries to send data to a single receiving peer.}, but it becomes apparent only if the bandwidth actually \textit{utilized} in $N_1$ is greater than $C_2$. Because all peers are connected to the switch with equal upstream bandwidth, two or more peers have to send data to a common third peer at full speed to generate congestion. Then data for the receiving peer comes in at a rate several times higher than it can be forwarded to it's destination. The per-port buffer of the switch will fill up, and eventually the switch will be forced to start dropping packets. Recapulating, in a typical cluster environment, where all peers are interconnected by one or more switches using a backplane capable of offering full bisection bandwidth, the chances for network congestion to occure are rather limited to certain well-known situations. \subsection{Preventing Network Congestion} Network congestion can be prevented by limiting the rate at which data is emitted into the network by the involved hosts. The natural way of doing so would be to obey a limit of the form $$ \frac{\text{data}}{t} \le C_{\text{limit}} $$ Unfortunately, this leads the problem that $t$ is surprisingly hard to measure. It is virtually impossible to have clocks with sufficient low granularity under Linux these days, as with any general purpose (i.e. non-realtime) operating system. A typical granularity at which tasks can be scheduled in these environments is 10\,ms or worse. A gigabit ethernet device would have sent out about 1.28\,MBytes of data in this time frame, which is far too much as a base for implementing a reasonable flow control. Observing the fact that transmission speed at physical layer in the OSI reference model \cite{osi-model} is constant for ethernet networking devices (as for almost every networking device) leads to the insight, that limiting the amount of data which is ``in flight'' on the network is sufficient for preventing network congestion. The term in flight refers to data which has left the networking device of the sending peer but has not yet arrived completely at the networking device of the receiving peer. Being pessimistic it has to be assumed that this data is currently stored in a buffer threatened by overflow as outlined in section~\ref{network-congestion}. \subsection{Simplifications in a Cluster Environment} \label{cluster-simplifications} Typically, congestion avoidance on a large-scale network as the internet is a task at the mercy of many unknown quantities. The available bandwidth may change at any time, new routes may appear, previosly used routes may disappear \cite{tcp-congestion}. In order to develop a congestion avoidance scheme suitable for clusters, the following simplifications will be exploited in section~\ref{fc-design}: \begin{itemize} \item The bandwidth available between two peers chosen at random from the set of peers in the cluster is the same for any two peers. \item This bandwidth available between two peers is constant over time, given those two peers do not communicate with other peers. \item The bandwidth available to a group of peers is completely independent of the bandwidth consumed by any other group of peers as long as there is no communication between these two groups. This is a direct conclusion of the claim for a network offering full bisection bandwidth. \item There is only one route between two peers at any time. (Even if physically loops exist in the topology, techniques like the spanning-tree protocol \cite{stp-explained} are exerted by the switches to logically eliminate them.) This means that packets are neither duplicated nor reordered. \end{itemize} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% IspellDict: "english" %%% End: