On implementation of DCTCP on three-tier and fat-tree data center network topologies

In this section, we explain the three-tier and fat-tree DCN architectures indicating
the merits of fat-tree topology. We discuss traffic characteristics of data center
networks and indicate the different traffic patterns commonly found in diverse data
center applications. We explain the incast and outcast congestion in data center networks
and discuss basic features of DCTCP.

Data center network architectures

Data center network architectures are typically classified into two categories: switch-centric
and server-centric. In switch-centric DCNs, the routing intelligence is placed on
switches and each server connects to the network through a single port. In server-centric
DCNs, the routing intelligence is placed on servers which connect to the network through
multiple ports while switches serve merely as cross-bars. Although a number of architectures
in both categories have been proposed in order to achieve scalability, efficiency,
reliability, cost minimization etc. in addition to some dual-centric architectures
combining the best of both categories, the legacy three-tier tree-based architecture
continues to be the most widely deployed and fat-tree being the most promising in
terms of scalability, robustness and cost (Bilal et al. 2013]). Both of these architectures are switch-centric. A typical data center network consists
of access layer, aggregation layer and core layer. It consists of routers and switches,
in two-level or three-level hierarchy. In two-level hierarchy there are no aggregation
switches while three-level hierarchy includes all three layers. The most realistic
and practical DCN simulation involves three-level architectures. We briefly explain
the legacy three-tier and fat-tree architectures below:

Three-tier data center network architecture

The three-tier design of data center network comprises of tree or hierarchy of switches
or routers. The root of the tree forms the core layer, the middle tier forms the aggregation layer, and the leaves of the tree form the edge/access layer. The switches at the access layer have some 1 GigE ports (typically 48-288) and some
10 GigE ports for uplink connectivity with the switches of the aggregation layer.
The upper layer switches have 10 GigE ports (typically 32-128) and reasonable capacity.
Figure 1 shows a section of three-tier DCN topology with switches at different layers and
1 GigE links connecting servers to access layer switches and 10 GigE links connecting
switches and routers of upper layers. The switches in access layer are low cost Top
Of Rack (TOR) switches which are Ethernet switches connecting servers in the same
rack (typically 20–40 servers) through 1 GigE links. These access layer switches are
connected to the aggregation layer switches through 10 GigE links. The aggregation
layer switches are connected to the core layer switches. The core layer has core layer
switches and one or more border routers providing connectivity between data center
network and Internet. Normally the aggregation layer has a load balancer.

thumbnailFig. 1. Three-tier data center network topology

Some of the drawbacks of this design are oversubscription less than 1:1 due to prohibitive
costs. An oversubscription of 1:1 means that all servers communicate with other arbitrary
servers at full bandwidth of their network interface. Typical oversubscription in
this topology is 2.5:1 or 8:1 (Al-Fares et al. 2008]). Large data centers have multi-rooted core switches with multiple core switches
which requires multipath routing techniques. This leads to oversubscription, limiting
multiplicity of paths, and excessively large routing table entries increasing lookup
latency. The most serious shortcoming of this topology is excessive cost due to 10 GigE
switches in the upper layers. These shortcomings were identified and resolved by Al-Fares
et al. (2008]) who proposed fat-tree topology discussed below.

Fat-tree data center network architecture

The fat-tree data center network design incorporates the low cost Ethernet commodity
switches to form a k-ary fat-tree (Al-Fares et al. 2008]). As shown in Fig. 2, there are k pods, each having 2 layers of k/2 switches. Each switch in the lower layer is k-port connecting directly to k/2 servers through k/2 ports and connecting with k/2 ports of aggregation layer through remaining k/2 ports. There are (k/2)
2k-port core switches with one port connecting to each pod. Generally, a fat-tree with
k-port switches supports k3
/4 servers. The fat-tree topology supports the use of identical, commodity switches
in all layers offering multiple-times cost reduction as compared to tier architectures.
This design employs two-level route lookups to assist multi-path routing. In order
to prevent congestion at a single port due to concentration of traffic to a subnet
and to keep the number of prefixes to a limited number, two-level routing tables are
used that spread outgoing traffic from a pod evenly among core switches by using the
low-order bits of the destination IP address.

thumbnailFig. 2. Fat-tree data center network topology

Traffic characteristics of data center networks

In this subsection, we summarize the traffic characteristics of data center networks
which we studied in order to understand realistic and common DCN traffic patterns
for our simulations. Investigation of data center traffic characteristics is normally
done by analyzing the Simple Network Management Protocol (SNMP) data from production
data centers and examining the temporal and spatial patterns of traffic volumes and
loss rates in switches. Benson et al. (2009], 2010]) conducted this study by analyzing the SNMP data from 19 corporate and enterprise
data centers. They studied link utilization and packet loss in the core, aggregation
and edge layer switches. They report that roughly 60 % of the core and edge links
are actively being used with utilization in core significantly higher than the lower
layers mainly due to a smaller number of core links multiplexing traffic from lower
layers. An important observation is low losses at the core layer despite higher utilization
which can be attributed to the use of 10 GigE links and ports at the core layer in
three-tier topology. In fat-tree topology, due to same capacities of all switches,
core switches are expected to experience higher losses. Another observation is that
a small fraction of links experience higher losses than other links. They suggest
splitting traffic uniformly across all links in order to avoid under-utilized and
over-utilized links. They observe the On–Off traffic pattern at the edge switches
and also indicate the presence of burst losses. Kandula et al. (2009]) analyzed the nature of data center network traffic and studied traffic patterns,
congestion, flow characteristics and TCP Incast. They observe that the median numbers
of correspondents for a server are two other servers within its rack and four servers
outside the rack. Ersoz et al. (2007]) studied network traffic in a cluster-based multi-tier data center which according
to them is the most cost effective scheme to design data center networks.

Overall data center network traffic is classified as:

1. Inter-data center traffic: This traffic is between two data center networks. It
is studied in detail by Chen et al. (2011]) who used network traces gathered at five major Yahoo! data centers. This type of
traffic is not the focus of our work.

2. Intra-data center traffic: This is the traffic within a single data center network.
Traffic is basically the flow of packets or data. These flows can be one-to-one, one-to-many,
many-to-one or many-to-many. Also in DCN these flows can be classified as long-duration
flows with large number of packets called elephant or long flows and short-duration
flows with small number of packets called mice or short flows. Elephant flows require
high throughput while mice flows are short control flows demanding low delay.

Benson et al. (2010]) analyzed SNMP statistics from ten 2-tier and 3-tier data centers belonging to three
different types of organizations and summarized their findings which are as follows:
Majority of flows in data centers are small-sized lasting for less than few hundreds
of milliseconds. In cloud data centers, 75 % of traffic remains within a rack while
in university and private enterprise data centers, 40–90 % of traffic leaves the rack
passing through the network. Losses are more at the aggregation layer than other layers.
An important observation is that utilizations within the core and aggregation layers
are higher (all links with losses having 30 % average utilization) than at the edge
layer which has lower utilization (links with losses having 60 % utilization). This
shows that losses in upper layers cannot be ignored and there is a need to analyze
out-of-rack traffic patterns which we do not see in DCTCP analysis in Alizadeh et
al. (2010]). We test DCTCP for both within-rack and out-of-rack traffic patterns.

TCP incast and TCP outcast

TCP incast and TCP outcast congestion results due to two different scenarios of data
center traffic which severely degrade throughput and are discussed in detail in Chen
et al. (2012]) and Prakash et al. (2012]) along with their solutions. We simulate and analyze both incast and outcast for
three-tier and fat-tree topologies of DCN with traffic patterns spanning upper layers.
In this subsection, we briefly discuss the incast and outcast scenarios. TCP incast
is a condition which affects network throughput in many-to-one traffic pattern which
is quite common in data center networks. This condition occurs when multiple senders
send data to the same receiver as shown in Fig. 3. The link between the switch and the receiver becomes the bottleneck resulting in
throughput far below the link bandwidth (Chen et al. 2012]). Keeping in view the DCN topology, this condition occurs at the TOR switch to which
the receiver is connected. Incast congestion can develop both due to within-rack traffic
which means all senders are connected to the same TOR switch to which the receiver
is connected as well as due to out of rack traffic when senders are spread out in
the network but they all send data to the same receiver. The incast problem due to
these two scenarios can have different impact on network throughput/delay and needs
separate analysis in both three-tier and fat-tree topologies. Figure 3 shows the scenario when senders are scattered in the network connected to different
TOR switches for both topologies.

thumbnailFig. 3. Incast congestion in data center network in a three-tier topology and b fat-tree topology

TCP outcast is a frequent problem which occurs in a multi-rooted tree topology when
several large and small TCP flows turn up at several input ports of a switch and these
flows contest for the same output port (Prakash et al. 2012]). This scenario is shown in Fig. 4. In this scenario, large TCP flows are queued at the output port whereas the remaining
small TCP flows are dropped. This problem occurs because the commodity switches used
in networks use tail-drop queue management scheme which exhibit a phenomenon called
port blackout where a series of packets from one port are dropped. This results in
severe unfairness in network resource sharing. Similar to incast scenario, outcast
congestion also requires separate analysis for senders connected to the same TOR switch
and for senders connected to different TOR switches. The second traffic pattern is
shown in Fig. 4 for both topologies.

thumbnailFig. 4. Outcast congestion in data center network in a three-tier topology and b fat-tree topology

DCTCP

In this subsection, we briefly explain Data Center TCP (DCTCP) (Alizadeh et al. 2010], 2011]). DCTCP is a transport layer protocol especially designed for data centre networks.
As discussed earlier, the DCN applications have two types of flows: large flows necessitating
high throughput and small flows necessitating low delay. The data traffic inside DCN
is classified as query (2 KB-20 KB), short messages (50 KB – 1 MB) and large flows
(1 MB – 50 MB). Most of the flows in data centers are small (less than or equal to
10 KB). These requirements are not handled efficiently by the conventional TCP. DCTCP
solves this issue in order to meet the requirements of DCN applications. DCTCP achieves
low latency, high throughput and high burst tolerance. DCTCP reacts according to the
level of congestion and reduces the window size based on fraction of the marked packets.
It uses an Active Queue Management (AQM) policy in which the router marks the packet
with Explicit Congestion Notification (ECN) rather than dropping it when the number
of packets that are queued exceeds the marking threshold (K). ECN is used by DCTCP for the early detection of congestion instead of waiting for
segment loss to occur. DCTCP algorithm has three main components discussed below:

Marking at the switch

DCTCP uses a simple active queue management scheme in which the switch detects the
congestion and sets the Congestion Encountered (CE) codepoint in the IP header. If
queue occupancy is greater than a marking threshold (K) upon the arrival of an incoming packet, it is marked with CE codepoint. This allows
the sender to know about the queue overshoot.

ECN-echo at the receiver

The receiver echoes the congestion information back to the sender using the ECN-Echo
(ECE) flag of TCP header. ECN-Echo (ECE) flag is set by the receiver in the series
of ACKs until it receives confirmation from the sender. The receiver tries to convey
back to the sender exact sequence of marked packets.

Controller at the sender

The sender reacts to the congestion indication by (ECE) flag in the received ACKs
from the receiver by reducing the TCP congestion window (cwnd). The estimate of fraction of packets marked is represented by ? and is updated once per Round Trip Time (RTT) as:

a onClick=popup('http://www.springerplus.com/content/5/1/766/mathml/M1','MathML',630,470);return false; target=_blank href=http://www.springerplus.com/content/5/1/766/mathml/M1View MathML/a

(1)

where F is the fraction of packets marked in the last window of data, g is the weight given to new samples against the past in the estimation of ? and it is 0  g  1. From Eq. (1) we get to know that through ?, we can estimate the probability that the queue size is greater than the threshold
(K). If ? is close to 0 it means low level of congestion and if close to 1 it indicates high
level of congestion.

The conventional TCP cuts its window size by a factor of 2 in response to a marked
ACK but DCTCP uses ? to reduce its window size as mentioned below:

a onClick=popup('http://www.springerplus.com/content/5/1/766/mathml/M2','MathML',630,470);return false; target=_blank href=http://www.springerplus.com/content/5/1/766/mathml/M2View MathML/a

(2)

when ? is near 0 it means low level of congestion and the congestion window is only slightly
reduced but when there is high level of congestion (? is near 1), DCTCP cuts its window size by half just like TCP.