|
|
Network Working Group V. Jacobson |
|
|
Request for Comments: 1072 LBL |
|
|
R. Braden |
|
|
ISI |
|
|
October 1988 |
|
|
|
|
|
|
|
|
TCP Extensions for Long-Delay Paths |
|
|
|
|
|
|
|
|
Status of This Memo |
|
|
|
|
|
This memo proposes a set of extensions to the TCP protocol to provide |
|
|
efficient operation over a path with a high bandwidth*delay product. |
|
|
These extensions are not proposed as an Internet standard at this |
|
|
time. Instead, they are intended as a basis for further |
|
|
experimentation and research on transport protocol performance. |
|
|
Distribution of this memo is unlimited. |
|
|
|
|
|
1. INTRODUCTION |
|
|
|
|
|
Recent work on TCP performance has shown that TCP can work well over |
|
|
a variety of Internet paths, ranging from 800 Mbit/sec I/O channels |
|
|
to 300 bit/sec dial-up modems [Jacobson88]. However, there is still |
|
|
a fundamental TCP performance bottleneck for one transmission regime: |
|
|
paths with high bandwidth and long round-trip delays. The |
|
|
significant parameter is the product of bandwidth (bits per second) |
|
|
and round-trip delay (RTT in seconds); this product is the number of |
|
|
bits it takes to "fill the pipe", i.e., the amount of unacknowledged |
|
|
data that TCP must handle in order to keep the pipeline full. TCP |
|
|
performance problems arise when this product is large, e.g., |
|
|
significantly exceeds 10**5 bits. We will refer to an Internet path |
|
|
operating in this region as a "long, fat pipe", and a network |
|
|
containing this path as an "LFN" (pronounced "elephan(t)"). |
|
|
|
|
|
High-capacity packet satellite channels (e.g., DARPA's Wideband Net) |
|
|
are LFN's. For example, a T1-speed satellite channel has a |
|
|
bandwidth*delay product of 10**6 bits or more; this corresponds to |
|
|
100 outstanding TCP segments of 1200 bytes each! Proposed future |
|
|
terrestrial fiber-optical paths will also fall into the LFN class; |
|
|
for example, a cross-country delay of 30 ms at a DS3 bandwidth |
|
|
(45Mbps) also exceeds 10**6 bits. |
|
|
|
|
|
Clever algorithms alone will not give us good TCP performance over |
|
|
LFN's; it will be necessary to actually extend the protocol. This |
|
|
RFC proposes a set of TCP extensions for this purpose. |
|
|
|
|
|
There are three fundamental problems with the current TCP over LFN |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 1] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
paths: |
|
|
|
|
|
|
|
|
(1) Window Size Limitation |
|
|
|
|
|
The TCP header uses a 16 bit field to report the receive window |
|
|
size to the sender. Therefore, the largest window that can be |
|
|
used is 2**16 = 65K bytes. (In practice, some TCP |
|
|
implementations will "break" for windows exceeding 2**15, |
|
|
because of their failure to do unsigned arithmetic). |
|
|
|
|
|
To circumvent this problem, we propose a new TCP option to allow |
|
|
windows larger than 2**16. This option will define an implicit |
|
|
scale factor, to be used to multiply the window size value found |
|
|
in a TCP header to obtain the true window size. |
|
|
|
|
|
|
|
|
(2) Cumulative Acknowledgments |
|
|
|
|
|
Any packet losses in an LFN can have a catastrophic effect on |
|
|
throughput. This effect is exaggerated by the simple cumulative |
|
|
acknowledgment of TCP. Whenever a segment is lost, the |
|
|
transmitting TCP will (eventually) time out and retransmit the |
|
|
missing segment. However, the sending TCP has no information |
|
|
about segments that may have reached the receiver and been |
|
|
queued because they were not at the left window edge, so it may |
|
|
be forced to retransmit these segments unnecessarily. |
|
|
|
|
|
We propose a TCP extension to implement selective |
|
|
acknowledgements. By sending selective acknowledgments, the |
|
|
receiver of data can inform the sender about all segments that |
|
|
have arrived successfully, so the sender need retransmit only |
|
|
the segments that have actually been lost. |
|
|
|
|
|
Selective acknowledgments have been included in a number of |
|
|
experimental Internet protocols -- VMTP [Cheriton88], NETBLT |
|
|
[Clark87], and RDP [Velten84]. There is some empirical evidence |
|
|
in favor of selective acknowledgments -- simple experiments with |
|
|
RDP have shown that disabling the selective acknowlegment |
|
|
facility greatly increases the number of retransmitted segments |
|
|
over a lossy, high-delay Internet path [Partridge87]. A |
|
|
simulation study of a simple form of selective acknowledgments |
|
|
added to the ISO transport protocol TP4 also showed promise of |
|
|
performance improvement [NBS85]. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 2] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
(3) Round Trip Timing |
|
|
|
|
|
TCP implements reliable data delivery by measuring the RTT, |
|
|
i.e., the time interval between sending a segment and receiving |
|
|
an acknowledgment for it, and retransmitting any segments that |
|
|
are not acknowledged within some small multiple of the average |
|
|
RTT. Experience has shown that accurate, current RTT estimates |
|
|
are necessary to adapt to changing traffic conditions and, |
|
|
without them, a busy network is subject to an instability known |
|
|
as "congestion collapse" [Nagle84]. |
|
|
|
|
|
In part because TCP segments may be repacketized upon |
|
|
retransmission, and in part because of complications due to the |
|
|
cumulative TCP acknowledgement, measuring a segments's RTT may |
|
|
involve a non-trivial amount of computation in some |
|
|
implementations. To minimize this computation, some |
|
|
implementations time only one segment per window. While this |
|
|
yields an adequate approximation to the RTT for small windows |
|
|
(e.g., a 4 to 8 segment Arpanet window), for an LFN (e.g., 100 |
|
|
segment Wideband Network windows) it results in an unacceptably |
|
|
poor RTT estimate. |
|
|
|
|
|
In the presence of errors, the problem becomes worse. Zhang |
|
|
[Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is |
|
|
not possible to accumulate reliable RTT estimates if |
|
|
retransmitted segments are included in the estimate. Since a |
|
|
full window of data will have been transmitted prior to a |
|
|
retransmission, all of the segments in that window will have to |
|
|
be ACKed before the next RTT sample can be taken. This means at |
|
|
least an additional window's worth of time between RTT |
|
|
measurements and, as the error rate approaches one per window of |
|
|
data (e.g., 10**-6 errors per bit for the Wideband Net), it |
|
|
becomes effectively impossible to obtain an RTT measurement. |
|
|
|
|
|
We propose a TCP "echo" option that allows each segment to carry |
|
|
its own timestamp. This will allow every segment, including |
|
|
retransmissions, to be timed at negligible computational cost. |
|
|
|
|
|
|
|
|
In designing new TCP options, we must pay careful attention to |
|
|
interoperability with existing implementations. The only TCP option |
|
|
defined to date is an "initial option", i.e., it may appear only on a |
|
|
SYN segment. It is likely that most implementations will properly |
|
|
ignore any options in the SYN segment that they do not understand, so |
|
|
new initial options should not cause a problem. On the other hand, |
|
|
we fear that receiving unexpected non-initial options may cause some |
|
|
TCP's to crash. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 3] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
Therefore, in each of the extensions we propose, non-initial options |
|
|
may be sent only if an exchange of initial options has indicated that |
|
|
both sides understand the extension. This approach will also allow a |
|
|
TCP to determine when the connection opens how big a TCP header it |
|
|
will be sending. |
|
|
|
|
|
2. TCP WINDOW SCALE OPTION |
|
|
|
|
|
The obvious way to implement a window scale factor would be to define |
|
|
a new TCP option that could be included in any segment specifying a |
|
|
window. The receiver would include it in every acknowledgment |
|
|
segment, and the sender would interpret it. Unfortunately, this |
|
|
simple approach would not work. The sender must reliably know the |
|
|
receiver's current scale factor, but a TCP option in an |
|
|
acknowledgement segment will not be delivered reliably (unless the |
|
|
ACK happens to be piggy-backed on data). |
|
|
|
|
|
However, SYN segments are always sent reliably, suggesting that each |
|
|
side may communicate its window scale factor in an initial TCP |
|
|
option. This approach has a disadvantage: the scale must be |
|
|
established when the connection is opened, and cannot be changed |
|
|
thereafter. However, other alternatives would be much more |
|
|
complicated, and we therefore propose a new initial option called |
|
|
Window Scale. |
|
|
|
|
|
2.1 Window Scale Option |
|
|
|
|
|
This three-byte option may be sent in a SYN segment by a TCP (1) |
|
|
to indicate that it is prepared to do both send and receive window |
|
|
scaling, and (2) to communicate a scale factor to be applied to |
|
|
its receive window. The scale factor is encoded logarithmically, |
|
|
as a power of 2 (presumably to be implemented by binary shifts). |
|
|
|
|
|
Note: the window in the SYN segment itself is never scaled. |
|
|
|
|
|
TCP Window Scale Option: |
|
|
|
|
|
Kind: 3 |
|
|
|
|
|
+---------+---------+---------+ |
|
|
| Kind=3 |Length=3 |shift.cnt| |
|
|
+---------+---------+---------+ |
|
|
|
|
|
Here shift.cnt is the number of bits by which the receiver right- |
|
|
shifts the true receive-window value, to scale it into a 16-bit |
|
|
value to be sent in TCP header (this scaling is explained below). |
|
|
The value shift.cnt may be zero (offering to scale, while applying |
|
|
a scale factor of 1 to the receive window). |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 4] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
This option is an offer, not a promise; both sides must send |
|
|
Window Scale options in their SYN segments to enable window |
|
|
scaling in either direction. |
|
|
|
|
|
2.2 Using the Window Scale Option |
|
|
|
|
|
A model implementation of window scaling is as follows, using the |
|
|
notation of RFC-793 [Postel81]: |
|
|
|
|
|
* The send-window (SND.WND) and receive-window (RCV.WND) sizes |
|
|
in the connection state block and in all sequence space |
|
|
calculations are expanded from 16 to 32 bits. |
|
|
|
|
|
* Two window shift counts are added to the connection state: |
|
|
snd.scale and rcv.scale. These are shift counts to be |
|
|
applied to the incoming and outgoing windows, respectively. |
|
|
The precise algorithm is shown below. |
|
|
|
|
|
* All outgoing SYN segments are sent with the Window Scale |
|
|
option, containing a value shift.cnt = R that the TCP would |
|
|
like to use for its receive window. |
|
|
|
|
|
* Snd.scale and rcv.scale are initialized to zero, and are |
|
|
changed only during processing of a received SYN segment. If |
|
|
the SYN segment contains a Window Scale option with shift.cnt |
|
|
= S, set snd.scale to S and set rcv.scale to R; otherwise, |
|
|
both snd.scale and rcv.scale are left at zero. |
|
|
|
|
|
* The window field (SEG.WND) in the header of every incoming |
|
|
segment, with the exception of SYN segments, will be left- |
|
|
shifted by snd.scale bits before updating SND.WND: |
|
|
|
|
|
SND.WND = SEG.WND << snd.scale |
|
|
|
|
|
(assuming the other conditions of RFC793 are met, and using |
|
|
the "C" notation "<<" for left-shift). |
|
|
|
|
|
* The window field (SEG.WND) of every outgoing segment, with |
|
|
the exception of SYN segments, will have been right-shifted |
|
|
by rcv.scale bits: |
|
|
|
|
|
SEG.WND = RCV.WND >> rcv.scale. |
|
|
|
|
|
|
|
|
TCP determines if a data segment is "old" or "new" by testing if |
|
|
its sequence number is within 2**31 bytes of the left edge of the |
|
|
window. If not, the data is "old" and discarded. To insure that |
|
|
new data is never mistakenly considered old and vice-versa, the |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 5] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
left edge of the sender's window has to be at least 2**31 away |
|
|
from the right edge of the receiver's window. Similarly with the |
|
|
sender's right edge and receiver's left edge. Since the right and |
|
|
left edges of either the sender's or receiver's window differ by |
|
|
the window size, and since the sender and receiver windows can be |
|
|
out of phase by at most the window size, the above constraints |
|
|
imply that 2 * the max window size must be less than 2**31, or |
|
|
|
|
|
max window < 2**30 |
|
|
|
|
|
Since the max window is 2**S (where S is the scaling shift count) |
|
|
times at most 2**16 - 1 (the maximum unscaled window), the maximum |
|
|
window is guaranteed to be < 2*30 if S <= 14. Thus, the shift |
|
|
count must be limited to 14. (This allows windows of 2**30 = 1 |
|
|
Gbyte.) If a Window Scale option is received with a shift.cnt |
|
|
value exceeding 14, the TCP should log the error but use 14 |
|
|
instead of the specified value. |
|
|
|
|
|
|
|
|
3. TCP SELECTIVE ACKNOWLEDGMENT OPTIONS |
|
|
|
|
|
To minimize the impact on the TCP protocol, the selective |
|
|
acknowledgment extension uses the form of two new TCP options. The |
|
|
first is an enabling option, "SACK-permitted", that may be sent in a |
|
|
SYN segment to indicate that the the SACK option may be used once the |
|
|
connection is established. The other is the SACK option itself, |
|
|
which may be sent over an established connection once permission has |
|
|
been given by SACK-permitted. |
|
|
|
|
|
The SACK option is to be included in a segment sent from a TCP that |
|
|
is receiving data to the TCP that is sending that data; we will refer |
|
|
to these TCP's as the data receiver and the data sender, |
|
|
respectively. We will consider a particular simplex data flow; any |
|
|
data flowing in the reverse direction over the same connection can be |
|
|
treated independently. |
|
|
|
|
|
3.1 SACK-Permitted Option |
|
|
|
|
|
This two-byte option may be sent in a SYN by a TCP that has been |
|
|
extended to receive (and presumably process) the SACK option once |
|
|
the connection has opened. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 6] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
TCP Sack-Permitted Option: |
|
|
|
|
|
Kind: 4 |
|
|
|
|
|
+---------+---------+ |
|
|
| Kind=4 | Length=2| |
|
|
+---------+---------+ |
|
|
|
|
|
3.2 SACK Option |
|
|
|
|
|
The SACK option is to be used to convey extended acknowledgment |
|
|
information over an established connection. Specifically, it is |
|
|
to be sent by a data receiver to inform the data transmitter of |
|
|
non-contiguous blocks of data that have been received and queued. |
|
|
The data receiver is awaiting the receipt of data in later |
|
|
retransmissions to fill the gaps in sequence space between these |
|
|
blocks. At that time, the data receiver will acknowledge the data |
|
|
normally by advancing the left window edge in the Acknowledgment |
|
|
Number field of the TCP header. |
|
|
|
|
|
It is important to understand that the SACK option will not change |
|
|
the meaning of the Acknowledgment Number field, whose value will |
|
|
still specify the left window edge, i.e., one byte beyond the last |
|
|
sequence number of fully-received data. The SACK option is |
|
|
advisory; if it is ignored, TCP acknowledgments will continue to |
|
|
function as specified in the protocol. |
|
|
|
|
|
However, SACK will provide additional information that the data |
|
|
transmitter can use to optimize retransmissions. The TCP data |
|
|
receiver may include the SACK option in an acknowledgment segment |
|
|
whenever it has data that is queued and unacknowledged. Of |
|
|
course, the SACK option may be sent only when the TCP has received |
|
|
the SACK-permitted option in the SYN segment for that connection. |
|
|
|
|
|
TCP SACK Option: |
|
|
|
|
|
Kind: 5 |
|
|
|
|
|
Length: Variable |
|
|
|
|
|
|
|
|
+--------+--------+--------+--------+--------+--------+...---+ |
|
|
| Kind=5 | Length | Relative Origin | Block Size | | |
|
|
+--------+--------+--------+--------+--------+--------+...---+ |
|
|
|
|
|
|
|
|
This option contains a list of the blocks of contiguous sequence |
|
|
space occupied by data that has been received and queued within |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 7] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
the window. Each block is contiguous and isolated; that is, the |
|
|
octets just below the block, |
|
|
|
|
|
Acknowledgment Number + Relative Origin -1, |
|
|
|
|
|
and just above the block, |
|
|
|
|
|
Acknowledgment Number + Relative Origin + Block Size, |
|
|
|
|
|
have not been received. |
|
|
|
|
|
Each contiguous block of data queued at the receiver is defined in |
|
|
the SACK option by two 16-bit integers: |
|
|
|
|
|
|
|
|
* Relative Origin |
|
|
|
|
|
This is the first sequence number of this block, relative to |
|
|
the Acknowledgment Number field in the TCP header (i.e., |
|
|
relative to the data receiver's left window edge). |
|
|
|
|
|
|
|
|
* Block Size |
|
|
|
|
|
This is the size in octets of this block of contiguous data. |
|
|
|
|
|
|
|
|
A SACK option that specifies n blocks will have a length of 4*n+2 |
|
|
octets, so the 44 bytes available for TCP options can specify a |
|
|
maximum of 10 blocks. Of course, if other TCP options are |
|
|
introduced, they will compete for the 44 bytes, and the limit of |
|
|
10 may be reduced in particular segments. |
|
|
|
|
|
There is no requirement on the order in which blocks can appear in |
|
|
a single SACK option. |
|
|
|
|
|
Note: requiring that the blocks be ordered would allow a |
|
|
slightly more efficient algorithm in the transmitter; however, |
|
|
this does not seem to be an important optimization. |
|
|
|
|
|
3.3 SACK with Window Scaling |
|
|
|
|
|
If window scaling is in effect, then 16 bits may not be sufficient |
|
|
for the SACK option fields that define the origin and length of a |
|
|
block. There are two possible ways to handle this: |
|
|
|
|
|
(1) Expand the SACK origin and length fields to 24 or 32 bits. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 8] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
(2) Scale the SACK fields by the same factor as the window. |
|
|
|
|
|
|
|
|
The first alternative would significantly reduce the number of |
|
|
blocks possible in a SACK option; therefore, we have chosen the |
|
|
second alternative, scaling the SACK information as well as the |
|
|
window. |
|
|
|
|
|
Scaling the SACK information introduces some loss of precision, |
|
|
since a SACK option must report queued data blocks whose origins |
|
|
and lengths are multiples of the window scale factor rcv.scale. |
|
|
These reported blocks must be equal to or smaller than the actual |
|
|
blocks of queued data. |
|
|
|
|
|
Specifically, suppose that the receiver has a contiguous block of |
|
|
queued data that occupies sequence numbers L, L+1, ... L+N-1, and |
|
|
that the window scale factor is S = rcv.scale. Then the |
|
|
corresponding block that will be reported in a SACK option will |
|
|
be: |
|
|
|
|
|
Relative Origin = int((L+S-1)/S) |
|
|
|
|
|
Block Size = int((L+N)/S) - (Relative Origin) |
|
|
|
|
|
where the function int(x) returns the greatest integer contained |
|
|
in x. |
|
|
|
|
|
The resulting loss of precision is not a serious problem for the |
|
|
sender. If the data-sending TCP keeps track of the boundaries of |
|
|
all segments in its retransmission queue, it will generally be |
|
|
able to infer from the imprecise SACK data which full segments |
|
|
don't need to be retransmitted. This will fail only if S is |
|
|
larger than the maximum segment size, in which case some segments |
|
|
may be retransmitted unnecessarily. If the sending TCP does not |
|
|
keep track of transmitted segment boundaries, the imprecision of |
|
|
the scaled SACK quantities will only result in retransmitting a |
|
|
small amount of unneeded sequence space. On the average, the data |
|
|
sender will unnecessarily retransmit J*S bytes of the sequence |
|
|
space for each SACK received; here J is the number of blocks |
|
|
reported in the SACK, and S = snd.scale. |
|
|
|
|
|
3.4 SACK Option Examples |
|
|
|
|
|
Assume the left window edge is 5000 and that the data transmitter |
|
|
sends a burst of 8 segments, each containing 500 data bytes. |
|
|
Unless specified otherwise, we assume that the scale factor S = 1. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 9] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
Case 1: The first 4 segments are received but the last 4 are |
|
|
dropped. |
|
|
|
|
|
The data receiver will return a normal TCP ACK segment |
|
|
acknowledging sequence number 7000, with no SACK option. |
|
|
|
|
|
|
|
|
Case 2: The first segment is dropped but the remaining 7 are |
|
|
received. |
|
|
|
|
|
The data receiver will return a TCP ACK segment that |
|
|
acknowledges sequence number 5000 and contains a SACK option |
|
|
specifying one block of queued data: |
|
|
|
|
|
Relative Origin = 500; Block Size = 3500 |
|
|
|
|
|
|
|
|
Case 3: The 2nd, 4th, 6th, and 8th (last) segments are |
|
|
dropped. |
|
|
|
|
|
The data receiver will return a TCP ACK segment that |
|
|
acknowledges sequence number 5500 and contains a SACK option |
|
|
specifying the 3 blocks: |
|
|
|
|
|
Relative Origin = 500; Block Size = 500 |
|
|
Relative Origin = 1500; Block Size = 500 |
|
|
Relative Origin = 2500; Block Size = 500 |
|
|
|
|
|
|
|
|
Case 4: Same as Case 3, except Scale Factor S = 16. |
|
|
|
|
|
The SACK option would specify the 3 scaled blocks: |
|
|
|
|
|
Relative Origin = 32; Block Size = 30 |
|
|
Relative Origin = 94; Block Size = 31 |
|
|
Relative Origin = 157; Block Size = 30 |
|
|
|
|
|
These three reported blocks have sequence numbers 512 through |
|
|
991, 1504 through 1999, and 2512 through 2992, respectively. |
|
|
|
|
|
|
|
|
3.5 Generating the SACK Option |
|
|
|
|
|
Let us assume that the data receiver maintains a queue of valid |
|
|
segments that it has neither passed to the user nor acknowledged |
|
|
because of earlier missing data, and that this queue is ordered by |
|
|
starting sequence number. Computation of the SACK option can be |
|
|
done with one pass down this queue. Segments that occupy |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 10] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
contiguous sequence space are aggregated into a single SACK block, |
|
|
and each gap in the sequence space (except a gap that is |
|
|
terminated by the right window edge) triggers the start of a new |
|
|
SACK block. If this algorithm defines more than 10 blocks, only |
|
|
the first 10 can be included in the option. |
|
|
|
|
|
3.6 Interpreting the SACK Option |
|
|
|
|
|
The data transmitter is assumed to have a retransmission queue |
|
|
that contains the segments that have been transmitted but not yet |
|
|
acknowledged, in sequence-number order. If the data transmitter |
|
|
performs re-packetization before retransmission, the block |
|
|
boundaries in a SACK option that it receives may not fall on |
|
|
boundaries of segments in the retransmission queue; however, this |
|
|
does not pose a serious difficulty for the transmitter. |
|
|
|
|
|
Let us suppose that for each segment in the retransmission queue |
|
|
there is a (new) flag bit "ACK'd", to be used to indicate that |
|
|
this particular segment has been entirely acknowledged. When a |
|
|
segment is first transmitted, it will be entered into the |
|
|
retransmission queue with its ACK'd bit off. If the ACK'd bit is |
|
|
subsequently turned on (as the result of processing a received |
|
|
SACK option), the data transmitter will skip this segment during |
|
|
any later retransmission. However, the segment will not be |
|
|
dequeued and its buffer freed until the left window edge is |
|
|
advanced over it. |
|
|
|
|
|
When an acknowledgment segment arrives containing a SACK option, |
|
|
the data transmitter will turn on the ACK'd bits for segments that |
|
|
have been selectively acknowleged. More specifically, for each |
|
|
block in the SACK option, the data transmitter will turn on the |
|
|
ACK'd flags for all segments in the retransmission queue that are |
|
|
wholly contained within that block. This requires straightforward |
|
|
sequence number comparisons. |
|
|
|
|
|
|
|
|
4. TCP ECHO OPTIONS |
|
|
|
|
|
A simple method for measuring the RTT of a segment would be: the |
|
|
sender places a timestamp in the segment and the receiver returns |
|
|
that timestamp in the corresponding ACK segment. When the ACK segment |
|
|
arrives at the sender, the difference between the current time and |
|
|
the timestamp is the RTT. To implement this timing method, the |
|
|
receiver must simply reflect or echo selected data (the timestamp) |
|
|
from the sender's segments. This idea is the basis of the "TCP Echo" |
|
|
and "TCP Echo Reply" options. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 11] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
4.1 TCP Echo and TCP Echo Reply Options |
|
|
|
|
|
TCP Echo Option: |
|
|
|
|
|
Kind: 6 |
|
|
|
|
|
Length: 6 |
|
|
|
|
|
+--------+--------+--------+--------+--------+--------+ |
|
|
| Kind=6 | Length | 4 bytes of info to be echoed | |
|
|
+--------+--------+--------+--------+--------+--------+ |
|
|
|
|
|
This option carries four bytes of information that the receiving TCP |
|
|
may send back in a subsequent TCP Echo Reply option (see below). A |
|
|
TCP may send the TCP Echo option in any segment, but only if a TCP |
|
|
Echo option was received in a SYN segment for the connection. |
|
|
|
|
|
When the TCP echo option is used for RTT measurement, it will be |
|
|
included in data segments, and the four information bytes will define |
|
|
the time at which the data segment was transmitted in any format |
|
|
convenient to the sender. |
|
|
|
|
|
TCP Echo Reply Option: |
|
|
|
|
|
Kind: 7 |
|
|
|
|
|
Length: 6 |
|
|
|
|
|
+--------+--------+--------+--------+--------+--------+ |
|
|
| Kind=7 | Length | 4 bytes of echoed info | |
|
|
+--------+--------+--------+--------+--------+--------+ |
|
|
|
|
|
|
|
|
A TCP that receives a TCP Echo option containing four information |
|
|
bytes will return these same bytes in a TCP Echo Reply option. |
|
|
|
|
|
This TCP Echo Reply option must be returned in the next segment |
|
|
(e.g., an ACK segment) that is sent. If more than one Echo option is |
|
|
received before a reply segment is sent, the TCP must choose only one |
|
|
of the options to echo, ignoring the others; specifically, it must |
|
|
choose the newest segment with the oldest sequence number (see next |
|
|
section.) |
|
|
|
|
|
To use the TCP Echo and Echo Reply options, a TCP must send a TCP |
|
|
Echo option in its own SYN segment and receive a TCP Echo option in a |
|
|
SYN segment from the other TCP. A TCP that does not implement the |
|
|
TCP Echo or Echo Reply options must simply ignore any TCP Echo |
|
|
options it receives. However, a TCP should not receive one of these |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 12] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
options in a non-SYN segment unless it included a TCP Echo option in |
|
|
its own SYN segment. |
|
|
|
|
|
4.2 Using the Echo Options |
|
|
|
|
|
If we wish to use the Echo/Echo Reply options for RTT measurement, we |
|
|
have to define what the receiver does when there is not a one-to-one |
|
|
correspondence between data and ACK segments. Assuming that we want |
|
|
to minimize the state kept in the receiver (i.e., the number of |
|
|
unprocessed Echo options), we can plan on a receiver remembering the |
|
|
information value from at most one Echo between ACKs. There are |
|
|
three situations to consider: |
|
|
|
|
|
(A) Delayed ACKs. |
|
|
|
|
|
Many TCP's acknowledge only every Kth segment out of a group of |
|
|
segments arriving within a short time interval; this policy is |
|
|
known generally as "delayed ACK's". The data-sender TCP must |
|
|
measure the effective RTT, including the additional time due to |
|
|
delayed ACK's, or else it will retransmit unnecessarily. Thus, |
|
|
when delayed ACK's are in use, the receiver should reply with |
|
|
the Echo option information from the earliest unacknowledged |
|
|
segment. |
|
|
|
|
|
(B) A hole in the sequence space (segment(s) have been lost). |
|
|
|
|
|
The sender will continue sending until the window is filled, and |
|
|
we may be generating ACKs as these out-of-order segments arrive |
|
|
(e.g., for the SACK information or to aid "fast retransmit"). |
|
|
An Echo Reply option will tell the sender the RTT of some |
|
|
recently sent segment (since the ACK can only contain the |
|
|
sequence number of the hole, the sender may not be able to |
|
|
determine which segment, but that doesn't matter). If the loss |
|
|
was due to congestion, these RTTs may be particularly valuable |
|
|
to the sender since they reflect the network characteristics |
|
|
immediately after the congestion. |
|
|
|
|
|
(C) A filled hole in the sequence space. |
|
|
|
|
|
The segment that fills the hole represents the most recent |
|
|
measurement of the network characteristics. On the other hand, |
|
|
an RTT computed from an earlier segment would probably include |
|
|
the sender's retransmit time-out, badly biasing the sender's |
|
|
average RTT estimate. |
|
|
|
|
|
|
|
|
Case (A) suggests the receiver should remember and return the Echo |
|
|
option information from the oldest unacknowledged segment. Cases (B) |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 13] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
and (C) suggest that the option should come from the most recent |
|
|
unacknowledged segment. An algorithm that covers all three cases is |
|
|
for the receiver to return the Echo option information from the |
|
|
newest segment with the oldest sequence number, as specified earlier. |
|
|
|
|
|
A model implementation of these options is as follows. |
|
|
|
|
|
|
|
|
(1) Receiver Implementation |
|
|
|
|
|
A 32-bit slot for Echo option data, rcv.echodata, is added to |
|
|
the receiver connection state, together with a flag, |
|
|
rcv.echopresent, that indicates whether there is anything in the |
|
|
slot. When the receiver generates a segment, it checks |
|
|
rcv.echopresent and, if it is set, adds an echo-reply option |
|
|
containing rcv.echodata to the outgoing segment then clears |
|
|
rcv.echopresent. |
|
|
|
|
|
If an incoming segment is in the window and contains an echo |
|
|
option, the receiver checks rcv.echopresent. If it isn't set, |
|
|
the value of the echo option is copied to rcv.echodata and |
|
|
rcv.echopresent is set. If rcv.echopresent is already set, the |
|
|
receiver checks whether the segment is at the left edge of the |
|
|
window. If so, the segment's echo option value is copied to |
|
|
rcv.echodata (this is situation (C) above). Otherwise, the |
|
|
segment's echo option is ignored. |
|
|
|
|
|
|
|
|
(2) Sender Implementation |
|
|
|
|
|
The sender's connection state has a single flag bit, |
|
|
snd.echoallowed, added. If snd.echoallowed is set or if the |
|
|
segment contains a SYN, the sender is free to add a TCP Echo |
|
|
option (presumably containing the current time in some units |
|
|
convenient to the sender) to every outgoing segment. |
|
|
|
|
|
Snd.echoallowed should be set if a SYN is received with a TCP |
|
|
Echo option (presumably, a host that implements the option will |
|
|
attempt to use it to time the SYN segment). |
|
|
|
|
|
|
|
|
5. CONCLUSIONS AND ACKNOWLEDGMENTS |
|
|
|
|
|
We have proposed five new TCP options for scaled windows, selective |
|
|
acknowledgments, and round-trip timing, in order to provide efficient |
|
|
operation over large-bandwidth*delay-product paths. These extensions |
|
|
are designed to provide compatible interworking with TCP's that do not |
|
|
implement the extensions. |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 14] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
The Window Scale option was originally suggested by Mike St. Johns of |
|
|
USAF/DCA. The present form of the option was suggested by Mike Karels |
|
|
of UC Berkeley in response to a more cumbersome scheme proposed by Van |
|
|
Jacobson. Gerd Beling of FGAN (West Germany) contributed the initial |
|
|
definition of the SACK option. |
|
|
|
|
|
All three options have evolved through discussion with the End-to-End |
|
|
Task Force, and the authors are grateful to the other members of the |
|
|
Task Force for their advice and encouragement. |
|
|
|
|
|
6. REFERENCES |
|
|
|
|
|
[Cheriton88] Cheriton, D., "VMTP: Versatile Message Transaction |
|
|
Protocol", RFC 1045, Stanford University, February 1988. |
|
|
|
|
|
[Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet |
|
|
Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm., |
|
|
Scottsdale, Arizona, March 1986. |
|
|
|
|
|
[Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times |
|
|
in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT, |
|
|
August 1987. |
|
|
|
|
|
[Clark87] Clark, D., Lambert, M., and L. Zhang, "NETBLT: A Bulk |
|
|
Data Transfer Protocol", RFC 998, MIT, March 1987. |
|
|
|
|
|
[Nagle84] Nagle, J., "Congestion Control in IP/TCP |
|
|
Internetworks", RFC 896, FACC, January 1984. |
|
|
|
|
|
[NBS85] Colella, R., Aronoff, R., and K. Mills, "Performance |
|
|
Improvements for ISO Transport", Ninth Data Comm Symposium, |
|
|
published in ACM SIGCOMM Comp Comm Review, vol. 15, no. 5, |
|
|
September 1985. |
|
|
|
|
|
[Partridge87] Partridge, C., "Private Communication", February |
|
|
1987. |
|
|
|
|
|
[Postel81] Postel, J., "Transmission Control Protocol - DARPA |
|
|
Internet Program Protocol Specification", RFC 793, DARPA, |
|
|
September 1981. |
|
|
|
|
|
[Velten84] Velten, D., Hinden, R., and J. Sax, "Reliable Data |
|
|
Protocol", RFC 908, BBN, July 1984. |
|
|
|
|
|
[Jacobson88] Jacobson, V., "Congestion Avoidance and Control", to |
|
|
be presented at SIGCOMM '88, Stanford, CA., August 1988. |
|
|
|
|
|
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. |
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 15] |
|
|
|
|
|
RFC 1072 TCP Extensions for Long-Delay Paths October 1988 |
|
|
|
|
|
|
|
|
SIGCOMM '86, Stowe, Vt., August 1986. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacobson & Braden [Page 16] |
|
|
|
|
|
|