Network working group I. van Beijnum Internet-Draft IMDEA Networks Expires: November 7, 2009 May 6, 2009 One-ended multipath TCP draft-van-beijnum-1e-mp-tcp-00 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on November 7, 2009. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract Normal TCP/IP operation is for the routing system to select a best path that remains stable for some time, and for TCP to adjust to the properties of this path to optimize throughput. A multipath TCP would be able to either use capacity on multiple paths, or van Beijnum Expires November 7, 2009 [Page 1] Internet-Draft One-ended multipath TCP May 2009 dynamically find the best performing path, and therefore reach higher throughput. By adapting to the properties of several paths through the usual congestion control algorithms, a multipath TCP shifts its traffic to less congested paths, leaving more capacity available for traffic that can't move to another path on more congested paths. And when a path fails, this can be detected and worked around by TCP much more quickly than by waiting for the routing system to repair the failure. This memo specifies a multipath TCP that is implemented on the sending host only, without requiring modifications on the receiving host. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Notational Conventions . . . . . . . . . . . . . . . . . . . . 5 3. Congestion control . . . . . . . . . . . . . . . . . . . . . . 5 3.1. RTT measurements . . . . . . . . . . . . . . . . . . . . . 5 3.2. Fast retransmit . . . . . . . . . . . . . . . . . . . . . 6 3.3. Slow retransmit . . . . . . . . . . . . . . . . . . . . . 6 3.4. SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.5. Fairness and TCP friendliness . . . . . . . . . . . . . . 8 4. Path selection . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1. The multipath IP layer . . . . . . . . . . . . . . . . . . 9 4.2. The path indication option . . . . . . . . . . . . . . . . 10 4.3. Timestamp integration option . . . . . . . . . . . . . . . 12 4.4. Path for retransmissions . . . . . . . . . . . . . . . . . 12 4.5. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.6. Path MTU discovery . . . . . . . . . . . . . . . . . . . . 13 5. Flow control and buffer sizes . . . . . . . . . . . . . . . . 14 6. Handling of RSTs . . . . . . . . . . . . . . . . . . . . . . . 14 7. Middlebox considerations . . . . . . . . . . . . . . . . . . . 14 8. Security considerations . . . . . . . . . . . . . . . . . . . 15 9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 15 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11.1. Normative References . . . . . . . . . . . . . . . . . . . 16 11.2. Informational References . . . . . . . . . . . . . . . . . 16 Appendix A. Document and discussion information . . . . . . . . . 17 Appendix B. An implementation strategy . . . . . . . . . . . . . 17 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 21 van Beijnum Expires November 7, 2009 [Page 2] Internet-Draft One-ended multipath TCP May 2009 1. Introduction In order to achieve redundancy to protect against failures, network operators generally install more links than the minimum necessary to achieve reachability. So there are often multiple paths between any two given hosts, even when paths not allowed by policy are removed. However, routing protocols usually select a single "best" path. When multiple paths are used at the same time by the routing system, those tend to be parallel links between two routers or paths that are otherwise very similar. As such, a lot of potentially usable network capacity is left unused. A multipath transport protocol would be able to use more of that capacity by sending its data along multiple paths at the same time, or by switching to a path with more available capacity. As TCP [RFC0793] is used by the vast majority of all networked applications, and TCP is responsible for the vast majority of all data transmitted over the internet, the logical choice would be to make TCP capable of using multiple paths. SCTP already has the ability to use multiple paths through the use of multiple addresses. However, using SCTP in this way requires significant application changes and deployment would be challenging because there is no obvious way for an application to know whether a service is available over SCTP rather than, or in addition to, TCP. In addition, SCTP as defined today [RFC2960] does not accommodate the concurrent use of multiple paths. Additional paths are purely used for backup purposes. This memo describes a one-ended multipath TCP, which only changes the behavior of the TCP sender, achieving multipath advantages when communicating with unmodified TCP receivers. This means it is not possible to perform path selection by using different destination addresses. However, other mechanisms that are transparent to the receiver are possible. A simple one would be for the sender to send some packets to one router, and other packets to another router. If these routers then make different routing decisions for the destination address in the TCP packets, the packets flow over different paths part of the way. Other mechanisms to achieve the same goal are also possible. However, with a single destination address, paths can't be completely disjoint. Using multiple paths at the same time brings up a number of challenges and questions: o Naive scheduling (such as round robin) of transmissions over the different paths reduces performance of each path to that of the slowest path. van Beijnum Expires November 7, 2009 [Page 3] Internet-Draft One-ended multipath TCP May 2009 o Using multiple paths causes reordering, which triggers the fast retransmit algorithm, causing unnecessary retransmissions and reduced performance. o TCP requires in-order delivery of data to the application, so when losses occur on one path, buffer capacity may run out and data can't be transmitted on unaffected paths until the lost data has been retransmitted. o Using multiple paths with an instance of regular congestion control on each path for a single TCP session makes that session use network capacity more aggressively than single path sessions, which can be considered "unfair" and increases packet loss. This memo seeks to address the first two issues by running separate instances of TCP's congestion control algorithms for the subflows that flow over different paths. Buffer issues are addressed by retransmitting packets before buffer space runs out, even if normal retransmission timers haven't fired yet. The fairness issue is a topic of ongoing research; this specification simply limits the number of subflows to limit unfairness and increased loss. The one-ended multipath TCP takes advantage of the fact that TCP [RFC0793] congestion control [RFC2581] and flow control are performed by the sender. With regard to flow control and congestion control, the role of the receiver is limited to sending back acknowledgments and advertise how much data it is prepared to receive. Hence, it is possible for the sender to utilize different paths and modify the fast retransmit logic as long as the receiver recognizes the packets as belonging to the same session. So a multipath TCP sender can distribute packets over multiple paths as long as this doesn't require incompatible modifications to the IP or TCP header contents, most notably the addresses. A single-ended multipath TCP session must still be between a single source address and a single destination address, regardless of the path taken by packets. The subset of the packets belonging to a TCP session flowing over a given path is designated a subflow. In order to benefit from using multiple paths, it's necessary for the multipath TCP sender to execute separate TCP congestion control instances for the packets belonging to different subflows. In the case where all packets are subject to the same congestion window, performance over a fast and a slow path will often be poorer than over just the fast path, defeating the purpose of using multiple paths. For instance, in the case of a 10 Mbps and a 100 Mbps path with otherwise identical properties, a simple round robin distribution of the packets and the use of a single congestion window van Beijnum Expires November 7, 2009 [Page 4] Internet-Draft One-ended multipath TCP May 2009 will limit performance to that of the slowest path multiplied by the number of paths, 20 Mbps in this case. 2. Notational Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. Congestion control A multipath TCP maintains instances of all congestion control related variables for each subflow. This includes, but is not limited to, the congestion window, the ssthresh, the retransmission timeout (RTO), the user timeout and RTT measurements. However, because TCP requires in-order delivery of data, there must be a single send buffer and a single receive buffer, thus flow control must happen session-wide. Per-subflow congestion control is performed by recording the path used to transmit each packet. Acknowledgments are then attributed to the subflow the acknowledged packets were sent over and the congestion window and other congestion control variables for the relevant subflow are updated accordingly. 3.1. RTT measurements Because a multipath TCP sender knows which packet it sent over which path, it can perform per-path round trip time measurements. This only works if return packets are consistently sent over the same path (or a set of paths with the same latency). If the receiver is not multipath-aware, this condition will generally hold: acknowledgments will flow from the receiver to the sender over a single path unless there is a topology change in the routing system or packets that belong to a single session are distributed over different paths by routers, which is rare. To multipath-capable routers on the return path (if any), the non-multipath-aware host appears to select the default path for all of its packets. However, if, like the sender, the receiver is multipath-aware, then the return path that the receiver chooses to send ACKs over will influence the RTTs seen by the original sender. The situation where the sender is unaware of fact that the receiver selects different return paths with different latencies is suboptimal, even compared to consistently measuring the RTT over the slowest path, as this leads to higher variability in the RTT measurements and therefore a higher van Beijnum Expires November 7, 2009 [Page 5] Internet-Draft One-ended multipath TCP May 2009 RTO. Having the receiver send ACKs over the same path mitigates the problem somewhat; but presumably, if the receiver is also multipath capable and has data to send, it will want to send this data over more than one path. So RTT measurements may inadvertently end up measuring different return paths in that case. A better solution is for the sender to include an indication in packets that allows the receiver to determine through which path the sender sent the packet. This information, along with the path initially chosen for the outgoing packet that is acknowledged, allows TCP to attribute each RTT measurement to a specific path. Because congestion control happens per path, there must also be a separate retransmission timeout (RTO) value for each path. 3.2. Fast retransmit Different paths will almost certainly have different RTTs, and even if the average RTT is the same, normal burstiness and differences in packet sizes will make packets routinely arrive through the different paths in a different order than the order in which they were transmitted. Without modifications to the algorithm, this would trigger the fast retransmit algorithm unnecessarily. To avoid this, fast retransmit is executed whenever, for packets belonging to the same subflow, after an unACKed packet or sequence of packets, more than two segments of new data is ACKed with SACK. This means fast retransmit happens per subflow, and reordering between subflows no longer triggers fast retransmit. 3.3. Slow retransmit In multipath TCP, a per-path RTO is employed to recover from congestion events that fast retransmit can't handle. Because the missing packets create holes in the data stream, subsequent packets received over other paths must be buffered in the receive buffer. Unless the receive buffer is extremely large, this means the entire session stalls when the receive buffer fills up. This situation persists until the RTO expires for the congested or broken path so the missing packets can be retransmitted. Should the path in question be completely broken, this will then lead to an almost immediate new stall, and the stall/RTO cycles will then continue until the user timeout / R2 timer [RFC1122] for the subflow expires. This is solved by taking unacknowledged packets transmitted over subflows that are stalled because they have exhausted their congestion window and are now waiting for the RTO to expire, and scheduling retransmissions of those packets over other paths before van Beijnum Expires November 7, 2009 [Page 6] Internet-Draft One-ended multipath TCP May 2009 the RTO of the stalled subflow expires. This should be done such that the missing packet arrives before it becomes necessary to stop sending data altogether because the receiver advertises a zero receive buffer. Such retransmissions therefore happen as the receive buffer space advertised by the receiver reaches RTT * MSS for the path that will be used for the retransmission; presumably the path with the lowest RTT. In essence, this creates a second level of fast retransmit that acts across subflows in addition to the normal fast retransmit that happens per subflow. This mechanism is named "slow retransmit". In the case of single path TCP, scheduling retransmissions before the RTO expires could be problematic because this would be more aggressive than standard (New)Reno congestion control. But in the case of multipath TCP, the retransmission can happen over one of the other paths, which is still progressing. By scheduling a retransmission faster than an RTO, there is an increased risk that a packet that was still working its way through the network is retransmitted unnecessarily. However, the alternative is allowing the progress of the session to stall (on all paths), reducing throughput significantly. 3.4. SACK When packets (belonging to different subflows) arrive out of order, the the receiver can't acknowledge the receipt of the out of order packets using TCP's normal cumulative acknowledgment. However, the [RFC2018] (also see [RFC1072]) Selective Acknowledgment (SACK) mechanism is widely implemented. SACK makes it possible for a receiver to indicate that three or four additional ranges of data were received in addition to what is acknowledged using a normal cumulative ACK. When packets are sent over multiple paths and arrive out of order, the information in the SACK returned by the receiver can tell the sender how each subflow is progressing, so per-subflow congestion control can progress smoothly and unnecessary retransmissions are largely avoided. One-ended multipath TCP requires the use of SACK to be able to determine which subflows are progressing even if other subflows are stalled, and thus the normal TCP ACK isn't progressing. If the remote host doesn't indicate the SACK capability during the three-way handshake, a multipath TCP implementation SHOULD limit itself to using only a single subflow and thus disabling multipath processing for the session in question. van Beijnum Expires November 7, 2009 [Page 7] Internet-Draft One-ended multipath TCP May 2009 3.5. Fairness and TCP friendliness One of the goals of multipath TCP is increased performance over regular TCP. However, it would be harmful to realize this benefit by taking more than a "fair" share of the available bandwidth. One choice would be to make each subflow execute normal NewReno congestion control on each subflow, so that each individual subflow competes with other TCPs on the same footing as a regular TCP session. If all subflows use non-overlapping physical paths, other TCPs are no worse off than in the situation where the multipath TCP were a regular TCP sharing their path, so this could be considered fair even though the multipath TCP increases its bandwidth in direct relationship to the number of subflows used. Note that in this case, although multipath TCP sends at the same rate as regular TCP on a given path, resource pooling [wischik08pooling] benefits are still realized because a given transmission completes faster so it uses up resources for a shorter amount of time. But if several logical paths share a physical path, multipath TCP takes a larger share of the bandwidth on that path. This would only be acceptable as fair for a very small number of subflows. The other end of the spectrum would be for multipath TCP to conform to exactly the same congestion window increase and decrease envelope that a regular TCP exhibits, being no more aggressive than a regular single path TCP session. At this point in time we will assume that fairness is a tunable factor of the regular NewReno AIMD envelope. A simple way to limit the amount of additional aggressiveness exhibited by multipath TCP is a limit on the number of subflows. Until more analysis has been performed and/or there is more experience with multipath TCP, a multipath TCP implementation SHOULD limit itself to using no more than 3 subflows concurrently. 4. Path selection Note that in order to gain multipath benefits, the multipath TCP layer must be able to determine the logical path followed by each packet so it can measure path properties and perform per-path congestion control. In order to limit the number of packets flowing over each path to the amount allowed by the per path congestion window, the multipath TCP layer must be able to specify over which path a given packet is transmitted. The situation where routers distribute packets over different paths based on their own criteria makes it impossible for hosts to send less traffic over congested paths and more traffic over uncongested paths and is therefore incompatible with multipath TCP. When routers distribute traffic belonging to the same flow (or, in the case of van Beijnum Expires November 7, 2009 [Page 8] Internet-Draft One-ended multipath TCP May 2009 multipath TCP: subflow) over different paths this will also cause reordering and the associated performance impact on TCP. 4.1. The multipath IP layer The one-ended multipath TCP is logically layered on a multipath IP layer, which is able to to deliver packets to the same destination address through one or more logical paths, where the set of n logical paths share between one and m physical paths. In some cases, the multipath IP layer will be able to determine that a logical path isn't working, or maps to the same physical path as a previous logical path. For example, if the multipath TCP indicates that a packet should be sent over the third path, and the multipath IP is set up to use different next hop addresses for path selection, but only two next hop addresses are available, the multipath IP layer can provide feedback to the multipath TCP layer. In other cases, packets simply won't be delivered, or will be delivered through the same physical path used by other logical paths. This may for instance happen when multipath TCP selects path 1 and multipath IP puts a path selector with value "1" in the packet, but there are no multipath capable routers between the source and destination, so all packets, regardless of the presence and/or value of a path selector, are routed over the same physical path. It is up to the multipath TCP layer to handle each of these situations. For the purposes of this multipath TCP specification, the simplest possible interface to the multipath IP layer is assumed. When TCP segments traveling down the stack from the TCP layer to the IP layer aren't accompanied by a path selector value, or the path selector value is zero, the IP layer delivers packets in the same way as for unmodified TCP and other existing transport protocols, i.e., over the default path. Segments may also be accompanied by a path selector value higher than zero, which indicates the desired path. If the desired logical path is available, or may be available, the multipath IP layer attempts to deliver the packet using that logical path. If the desired logical path is known to be unavailable, the multipath IP layer drops the segment. It is assumed that paths as seen by the multipath IP layer are mapped to logical paths with increasing numbers roughly ordered in order of decreasing assumed performance or availability. I.e., if path x doesn't work or has low performance, that doesn't necessarily mean that path x+1 doesn't work or has low performance, but if if paths x, x+1 and x+2 don't work or have low performance, then it's highly likely that paths x+3 and beyond also don't work or have even lower performance. Routers may have good next hop or even intra-domain van Beijnum Expires November 7, 2009 [Page 9] Internet-Draft One-ended multipath TCP May 2009 link weight information and link congestion information, but they generally don't have information about the end-to-end path properties, so the ordering of paths from high to low availability/ performance must be considered little more than a hint. The multipath IP layer may be implemented through a variety of mechanisms, including but not limited to: o Using different outgoing interfaces on the host o Directing packets towards different next hop routers o Integration with shim6 [I-D.ietf-shim6-proto] so that packets can use different address pairs o Manipulation of fields used in ECMP [RFC2992] (i.e., a different flow label) o Type of service routing (such as [RFC4915]) o Different lower layer encapsulation, such as MPLS o Tunneling through overlays o Source routing o An explicit path selector field in packets, acted upon by routers At this time, no choice is made between these different mechanisms. 4.2. The path indication option Note that several of the fields discussed below are defined with future developments in mind, they are not necessarily immediately useful. In order to allow for accurate RTT measurements and to inform the IP layer of the selected path, a TCP option indicating the desired path is included in all segments that don't use the default path. The format of this option is as follows: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | KIND=TBA | LENGTH = 3 |D| MP |R| SP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The length is 3. D is the "discard eligibility" flag (1 bit). It is similar, but not van Beijnum Expires November 7, 2009 [Page 10] Internet-Draft One-ended multipath TCP May 2009 identical, to the frame relay discard eligibility bit or the ATM cell loss priority bit. Set to zero, no special behavior is requested. Set to one, this indicates that loss of the packet will be inconsequential. This allows routers to drop packets with D=1 more readily than other packets under congested conditions, and also to completely block packets with D=1 on links that are considered long- term congested or expensive, even if there is no momentary congestion. Setting the D bit to 1 for some subflows (presumably, ones with a performance lower than the best performing subflow) allows multipath TCP to give way to regular TCP and other single path traffic on congested or expensive paths. As long as the multipath TCP sets D to 0 on the subflow with the best performance, multipath TCP should still perform better than regular TCP, but the reduction in bandwidth use on the other paths helps achieve resource pooling benefits. MP is a is a path selector that may be interpreted by multiple routers along the way (3 bits). A value of 0 is the default path that is also taken by packets that don't contain a multipath option. Multipath TCP aware routers should take this value into account when performing ECMP [RFC2992]. Packets with any value for MP MUST be forwarded, even if the number of available paths is smaller than the value in MP. R (1 bit) is reserved for future use. MUST be set to zero on transmission and ignored on reception. SP is a path selector that is interpreted only once by the local TCP stack or a router close to the sender (3 bits). A value of 0 is the default path that is also taken by packets that don't contain a multipath option. If the value in SP points to a path that isn't available, the packet SHOULD be silently dropped. This behavior, as opposed to selecting an alternate path out of the available ones, helps avoid the use of duplicate paths. As such, a router may only interpret SP rather than MP when it is known that the router is the only one acting on SP. All other routers may only act on MP. It is not expected that routers will make routing decisions directly based on the path indication option, as this option occurs deep inside the packet and not in a fixed place. However, a multipath IP layer or a middlebox may write a path selection value into a field in packets that is easily accessible to routers. But conceptually, the routers act upon the values in SP and MP. The initial packets for each TCP session MUST use D, MP and SP values of zero. If D, MP and SP are all zero, then the path selector option isn't included in the packet. This makes sure that single path van Beijnum Expires November 7, 2009 [Page 11] Internet-Draft One-ended multipath TCP May 2009 operation remains possible even if packets with the path selector option are filtered in the network or rejected by the receiver. The packets that are part of the TCP three-way handshake SHOULD be sent over the default path, in which case they don't contain the path selector option; hence the ability to do multipath TCP isn't indicated to the correspondent at the beginning of the session as is usual for most other TCP extensions. 4.3. Timestamp integration option As an optimization, hosts MAY borrow the four bits used by the path selector option from the timestamp option, and thus save one byte of option space, which means the path selector option can replace the padding necessary when the timestamp option is used and not increase header overhead. In that case, the combined path selector and timestamp options MUST appear as follows: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | KIND=TBA | LENGTH = 2 | KIND=8 | LENGTH = 10 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |D| MP | TS Value (TSval) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TS Echo Reply (TSecr) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ D and MP are the same as in the three-byte form of the path selector option. R and SP do not occur in this form of the path selector option and are assumed to be zero. TSval is the locally generated timestamp. Because the timestamp is reduced to 28 bits, the minimum clock frequency is increased from the 59 nanoseconds mandated by [RFC1323] to 1 microsecond so the timestamp wraps in no less than 255 seconds. TSecr is the timestamp echoed back to the other side (32 bits). All hosts conforming to this specification MUST be able to recognize the integrated path selector and timestamp options, but they are not required to generate them. 4.4. Path for retransmissions A multipath TCP implementation MUST be capable of scheduling retransmissions over a path different from the path used to transmit the packet originally. This includes packets subject to fast retransmit. van Beijnum Expires November 7, 2009 [Page 12] Internet-Draft One-ended multipath TCP May 2009 4.5. ECN Explicit Congestion Notification works by routers setting a congestion indication in the IP header of packets rather than dropping those packets when they experience congestion. The receiver echos this information back to the sender which then performs congestion control in exactly the same way as if a packet was lost. The ECN specification ([RFC3168]) is such that the receiver sets the ECN-Echo (ECE) flag in the TCP header for all subsequent packets that it sends back until the sender sets the Congestion Window Reduced (CWR) flag. As the ECE flag is set in multiple ACKs, there is no obvious way to correlate the ECN indication in an ACK with a specific packet that experienced congestion, and subsequently, the path that is congested. At this time, a multipath TCP conforming to this specification SHOULD NOT use ECN. ECN MAY be negotiated, but when more than a single path is used at a given time, packets SHOULD be sent with the ECN field set to Not-ECN (00), and incoming non-zero ECE flags SHOULD NOT be acted upon with regard to congestion control. 4.6. Path MTU discovery Path MTU discovery [RFC1191] is performed for TCP by having TCP reduce its packet sizes whenever "packet too big but DF set" ICMP messages are received. As the name suggests, the path MTU is dependent on the path used, so multipath TCP must maintain MTU information for each path, and adjust this information for each path individually based on the too big messages that it receives. The time between probing with a larger than previously discovered MTU must either be randomized or explicitly coordinated to avoid probing larger MTUs for multiple subflows at the same time, as probing larger MTUs is likely to lead to a lost packet, and having losses on multiple paths at the same time would be suboptimal. For instance, rather than probe every t, in the case of 2 paths, after t*0.5 the first path is probed, after t the second and after t*1.5 the first is probed again. Both the IPv4 and IPv6 versions of ICMP return enough of the original packet in a "packet too big" message to be able to recover the sequence number from the original packet, which makes it possible to correlate the too big message with the packet that caused it, and thus the path used to transmit the packet. van Beijnum Expires November 7, 2009 [Page 13] Internet-Draft One-ended multipath TCP May 2009 5. Flow control and buffer sizes In order to accommodate the increased number of packets in flight, the send buffer must be increased in direct relationship with the number of paths being used. Alternatively, the number of paths used concurrently should be limited to send buffer / avgRTT. Although under normal operation, the receive buffer doesn't fill up, there are two reasons the receive buffer must be the same size as the send buffer: it must be able to accommodate a round trip time plus two segments worth of data during fast retransmit, and the advertised receive window limits the amount of data the sender will transmit before waiting for acknowledgments. So in practice, the receive buffer limits the maximum size of the send buffer, and therefore, the number of paths that can be supported concurrently. There is no simple rule of thumb to determine the number of paths that should be used, as the maximum number of paths that the receive window can accommodate depends both on the maximum receive window advertised by the receiver and by the RTTs on the paths. 6. Handling of RSTs If an RST is received after enabling a new path, this could be a reaction to the presence of an unknown option. So the optimal situation would be for an RST to reset just the path used to send the packet that generated the RST, not the entire session. Only when the last path or the default path (on which packets don't include special options) receives an RST, the entire session should be reset. 7. Middlebox considerations NATs are designed to be transparent to TCP. Because one-ended multipath TCP conforms to normal TCP semantics on the wire, multipath TCP should in principle also be compatible with NAT. However, if different paths are served by different NATs that apply different translations, the receiver won't be able to determine that the different subflows through the different paths belong to the same TCP session. So for NAT to work, the translation must either happen in a location that all paths flow through, or the different NATs on the different paths must act as a single, distributed NAT and apply the same translation to the different subflows. Middleboxes that only see traffic flowing over a subset of the paths used will see large numbers of gaps in the sequence number space. They may also not observe only a partial three-way handshake, or not van Beijnum Expires November 7, 2009 [Page 14] Internet-Draft One-ended multipath TCP May 2009 observe any ACKs. As such, like with NATs, middleboxes that enforce conformance to known TCP behavior, must be placed such that they observe all subflows. For middleboxes that just check whether packets fall inside the TCP window, it may be sufficient for multipath TCP senders to make sure that all paths see at least one packet per window. Middleboxes that enforce sequence number integrity will almost certainly also block TCP packets for which they didn't observe the three way handshake. A possible way to accommodate that behavior would be to send copies of all session establishment and tear down packets over all paths that the sender may use. However, this strategy is still likely to fail unless the receiver does the same so the middleboxes may observe the signaling packets flowing in both directions. It's also possible that middleboxes (or perhaps hosts themselves) reject packets with the path indicator TCP option. Since packets flowing over the default path don't carry the path indicato option, these packets should always be allowed through, so single path operation is always possible. When a multipath TCP sender starts to send packets over alternative paths, those packets won't make it to the receiver because they contain the path indicator option. The result is that a new subflow, which would use a congestion window of two maximum segment sizes, would send two packets and then experiences a retransmission timeout. Slow retransmit makes sure the packets are transmitted before the session stalls, so the impact of the lost packets is negligible. 8. Security considerations None at this time. 9. IANA considerations IANA is requested to provide a TCP option kind number for the path indication option. 10. Acknowledgements The single ended multipath TCP was developed together with Marcelo Bagnulo and Arturo Azcorra. Members of the Trilogy project, especially Costin Raiciu, have contributed valuable insights. Iljitsch van Beijnum is supported by Trilogy van Beijnum Expires November 7, 2009 [Page 15] Internet-Draft One-ended multipath TCP May 2009 (http://www.trilogy-project.org), a research project (ICT-216372) partially funded by the European Community under its Seventh Framework Program. The views expressed here are those of the author(s) only. The European Commission is not liable for any use that may be made of the information in this document. 11. References 11.1. Normative References [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions for High Performance", RFC 1323, May 1992. [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999. [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", RFC 2992, November 2000. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. 11.2. Informational References [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay paths", RFC 1072, October 1988. [RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989. [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, "Stream Control Transmission Protocol", RFC 2960, October 2000. van Beijnum Expires November 7, 2009 [Page 16] Internet-Draft One-ended multipath TCP May 2009 [RFC4915] Psenak, P., Mirtorabi, S., Roy, A., Nguyen, L., and P. Pillay-Esnault, "Multi-Topology (MT) Routing in OSPF", RFC 4915, June 2007. [wischik08pooling] Wischik, D., Handley, M., and M. Bagnulo Braun, "The resource pooling principle", Computer Communication Review 38, September 2008. [I-D.ietf-shim6-proto] Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming Shim Protocol for IPv6", draft-ietf-shim6-proto-11 (work in progress), December 2008. Appendix A. Document and discussion information The latest version of this document will always be available at http://www.muada.com/drafts/. Please direct questions and comments to the multipathtcp@ietf.org mailinglist or directly to the author. Appendix B. An implementation strategy In order to perform per-path congestion control, all of the ACK-based events that trigger congestion control responses as well as all the variables used by the congestion control algorightms must be recreated in the multipath situation. These are the triggers and variables for the four mechanisms in RFC 2581. 1. the path MTU (page 4) 2. the arrival of an ACK that acknowledges new data (page 4) 3. the arrival of a non-duplicate ACK (page 4) or the sum of new data acknowledged (page 5) 4. triggering of the retransmission timer (page 5) 5. the flightsize or number of bytes sent but not acknowledged (page 5) 6. the retransmission of a segment (page 5) 7. the arrival of a third or subsequent duplicate ACK (page 6, page 7) van Beijnum Expires November 7, 2009 [Page 17] Internet-Draft One-ended multipath TCP May 2009 8. whether a retransmission timeout period has elapsed since the last reception of an ACK (page 7) 1, 4, 6 and 8 are maintained session-wide. We recreate these events and variables based on SACK information in the one-sequence number multipath TCP case as follows. We keep track of every packet sent. (Alternatively: multi-packet contiguous blocks of data transmitted over the same path.) When an ACK comes in, we first remove the stored information about packets/ data blocks that are cumulatively ACKed, noting how much data was ACKed for each path that the packets were sent over. Then we do the same for all the SACK blocks in the ACK. Because we remove the information about (S)ACKed data and you can remove something just once, we don't have to keep track of previous SACKs like the current BSD implementation does. The only slightly tricky part is emulating duplicate ACKs. This may not even be really necessary, as the SACKs give us better information to base fast retransmit on, but that's something for another day. What happens in the pseudo code is that when traversing the list of sent packets (this happens in order of seqnum), we note the path that packets that aren't SACKed are sent over. When we're done processing SACK data and it turns out that for a path there are one or more packets that we skipped over when processing SACK data and there was also data SACKed after a skipped packet, there was a lost (or reordered) packet on this path. When the amount of "duplicate ACKed" data grows beyond two segment sizes, we've reached the equivalent of three duplicate ACKs so we trigger fast retransmit (7). We update the congestion window (2 and 3) when there was data (S)ACKed for a path. ACKs that don't acknowledge any data for a path aren't relevant because we don't need them to trigger fast retransmit and we assume that they're sent to (S)ACK data for other paths, anyway. (Or they could be window updates.) We maintain the flightsize (5) by simply adding data bytes as packets are transmitted and subtracting when they're (S)ACKed. Because we have explicit SACKs, we don't need to guess based on duplicate ACKs. The flightsize is also adjusted when we perform fast retransmit or a regular retransmission over a path other than which was used for the original packet. In addition, we explicitly mark some packets to trigger once-per-RTT actions when they're ACKed. Pseudo code for the above: van Beijnum Expires November 7, 2009 [Page 18] Internet-Draft One-ended multipath TCP May 2009 // initializing data structures is left as an exercise for the // reader // transmitting packets // assume we've selected a path to transmit over path.flightsize = path.flightsize + packet.datasize packet.path = path packet.status.acked = false // set up state to remember to do per RTT stuff when packet is // ACKed if path.do_per_rtt_next_packet == true path.per_rtt_seqnum = packet.seqnum.first packet.per_rtt = true path.do_per_rtt_next_packet = false else packet.status.per_rtt = false // don't set ECN on outgoing packets for now, can add logic // for deciding which packets to ECN enable later packet.ecn.sent = 0 // add to linked list of sent packets (to handle retrans- // missions, linked list must maintain seqnum order, not FIFO // or LIFO) llpush(packet) // receiving (S)ACKs // normal flow-wide flow control actions based on cumACK // also happen (elsewhere) // handle ECN, must detect transitions rather than // depend on actual value if packet.ecnecho == true if ecn.previous == true ecn.current = false else ecn.current = true ecn.previous = true else ecn.previous = false // initialize some stuff before we handle the ACK for each path path.do_per_rtt = false path.ackedbytes = 0 path.unacked.sure = 0 path.unacked.maybe = 0 path.ecn.received = false van Beijnum Expires November 7, 2009 [Page 19] Internet-Draft One-ended multipath TCP May 2009 // remove cumulatively ACKed packets llwalk_init packet = llwalk_next while packet.seqnum.first < ack.cumulative // ECN, we only act if we enabled ECN when we sent the packet if ecn.current & packet.ecn.sent <> 0 path.ecn.received = true // if part of a packet is ACKed, we need some trickery if packet.seqnum.last_plus_one > ack.cumulative path.ackedbytes += ack.cumulative - packet.seqnum.first packet.seqnum.first = ack.cumulative else path.ackedbytes = path.ackedbytes + packet.datasize if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum path.do_per_rtt = true llremove(packet) packet = llwalk_next // now we handle the SACKs (assume exactly one SACKblock for // simplicity) we continue walking the linked list, no need to // restart while packet.seqnum.first < ack.sack.last_plus_one if packet.seqnum.last_plus_one < ack.sack.first // these packets overlap with the SACK block // for simplicity, assume packets are always completely // SACKed in reality we need to split a packet if only the // middle is SACKed ECN, we only act if we enabled ECN when // we sent the packet if ecn.current & packet.ecn.sent <> 0 path.ecn.received = true path.ackedbytes = path.ackedbytes + packet.datasize if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum path.do_per_rtt = true // add potentially unacked bytes to for sure unacked bytes // because we now know we had a SACK hole if any // unacked maybe bytes path.unacked.sure = path.unacked.sure + path.unacked.maybe path.unacked.maybe = 0 // remove packet from the list llremove(packet) else // note how many bytes we skipped unSACKed // if later data is SACKed, that's our version of a dup ACK path.unacked.maybe = path.unacked.maybe + packet.datasize packet = llwalk_next // done processing, now tally up the the results foreach path van Beijnum Expires November 7, 2009 [Page 20] Internet-Draft One-ended multipath TCP May 2009 // update flightsize (item 5 in CC events/variables list) path.flightsize = path.flightsize - path.ackedbytes // if any data was ACKed if path.ackedbytes <> 0 // some stuff was ACKed for this path if path.unacked.sure > 2 * path.mss // more than 2 * MSS worth of data in SACK hole = fast // retransmit execute fast retransmit (item 7 in CC // events/variables list) need to handle flightsize in // some way here ignore ECN because we already have a loss // send back ECN window update indication, though else // SACKs were cumulative for this path // execute cwnd update (items 2 and 3 in CC events/ // variables list) // ECN must be taken into account here // and send back ECN window update indication if path.do_per_rtt // execute per RTT actions // indicate that this should be set for next packet sent path.do_per_rtt_next_packet == true Note that the pseudo-code doesn't cover all the mechanisms explained earlier. Also, ECN is handled here because it's not too difficult to do. The hard part is deciding which packets to enable ECN for. Author's Address Iljitsch van Beijnum IMDEA Networks Avda. del Mar Mediterraneo, 22 Leganes, Madrid 28918 Spain Email: iljitsch@muada.com van Beijnum Expires November 7, 2009 [Page 21]