| <?xml version="1.0" encoding="US-ASCII"?> |
| <!DOCTYPE rfc SYSTEM "rfc2629.dtd"> |
| |
| <rfc category="info" docName="sp-tcp-mapping-01"> |
| |
| <front> |
| |
| <title abbrev="TCP mapping for SPs"> |
| TCP Mapping for Scalability Protocols |
| </title> |
| |
| <author fullname="Martin Sustrik" initials="M." role="editor" |
| surname="Sustrik"> |
| <organization>GoPivotal Inc.</organization> |
| <address> |
| <email>msustrik@gopivotal.com</email> |
| </address> |
| </author> |
| |
| <date month="August" year="2013" /> |
| |
| <area>Applications</area> |
| <workgroup>Internet Engineering Task Force</workgroup> |
| |
| <keyword>TCP</keyword> |
| <keyword>SP</keyword> |
| |
| <abstract> |
| <t>This document defines the TCP mapping for scalability protocols. |
| The main purpose of the mapping is to turn the stream of bytes |
| into stream of messages. Additionaly, the mapping provides some |
| additional checks during the connection establishment phase.</t> |
| </abstract> |
| |
| </front> |
| |
| <middle> |
| |
| <section title = "Underlying protocol"> |
| |
| <t>This mapping should be layered directly on the top of TCP or, |
| alternatively, on the top of ETSN (which itself is a thin layer |
| on the top of TCP).</t> |
| |
| <t>In the former case there's no fixed TCP port to use for the |
| communication. Instead, port number are assigned to individual |
| services by the user. In the latter case the communication happens |
| on the TCP port assigned to ETSN by IANA. User identifies individual |
| services using ETSN service names.</t> |
| |
| </section> |
| |
| <section title = "Connection initiation"> |
| |
| <t>As soon as the underlying connection, whether TCP or ETSN, is |
| established, both parties MUST send the protocol header (described in |
| detail below) immediately. Both endpoints MUST then wait for the |
| protocol header from the peer before proceeding on.</t> |
| |
| <t>The goal of this design is to keep connection establishment as |
| fast as possible by avoiding any additional protocol handshakes, |
| i.e. network round-trips. Specifically, the protocol headers |
| can be bundled directly with to the last packets of TCP handshake |
| and thus have virtually zero performance impact.</t> |
| |
| <t>The protocol header is 8 bytes long and looks like this:</t> |
| |
| <figure> |
| <artwork> |
| +------+------+------+--------------+------------+----------------+ |
| | 0x00 | 0x53 | 0x50 | version (8b) | type (16b) | reserved (16b) | |
| +------+------+------+--------------+------------+----------------+ |
| </artwork> |
| </figure> |
| |
| <t>First four bytes of the protocol header are used to make sure that |
| the peer's protocol is compatible with the protocol used by the local |
| endpoint. Keep in mind that this protocol is designed to run on an |
| arbitrary TCP port, thus the standard compatibility check -- if it runs |
| on port X and protocol Y is assigned to X by IANA, it speaks protocol Y |
| -- does not apply. We have to use an alternative mechanism.</t> |
| |
| <t>First four bytes of the protocol header MUST be set to 0x00, 0x53, 0x50 |
| and 0x01 respectively. If the protocol header received from the peer |
| differs, the TCP connection MUST be closed immediately.</t> |
| |
| <t>The fact that the first byte of the protocol header is binary zero |
| eliminates any text-based protocols that were accidentally connected |
| to the endpiont. Subsequent two bytes make the check even more |
| rigorous. At the same time they can be used as a debugging hint to |
| indicate that the connection is supposed to use one of the scalability |
| protocols -- ASCII representation of these bytes is 'SP' that can |
| be easily spotted in when capturing the network traffic. Finally, |
| the fourth byte rules out any incompatible versions of this |
| protocol.</t> |
| |
| <t>Fifth and sixth bytes of the header form a 16-bit unsigned integer in |
| network byte order representing the type of SP endpoint on the layer |
| above. The value SHOULD NOT be interpreted by the mapping, rather |
| the interpretation should be delegated to the scalability protocol |
| above the mapping. For informational purposes, it should be noted that |
| the field encodes information such as SP protocol ID, protocol version |
| and the role of endpoint within the protocol. Individual values are |
| assigned by IANA.</t> |
| |
| <t>Finally, the last two bytes of the protocol header are reserved for |
| future use and must be set to binary zeroes. If the protocol header |
| from the peer contains anything else than zeroes in this field, the |
| implementation MUST close the underlying TCP connection.</t> |
| |
| </section> |
| |
| <section title = "Message delimitation"> |
| |
| <t>Once the protocol header is accepted, endpoint can send and receive |
| messages. Message is an arbitrarily large chunk of binary data. Every |
| message starts with 64-bit unsigned integer in network byte order |
| representing the size, in bytes, of the remaining part of the message. |
| Thus, the message payload can be from 0 to 2^64-1 bytes long. |
| The payload of the specified size follows directly after the size |
| field:</t> |
| |
| <figure> |
| <artwork> |
| +------------+-----------------+ |
| | size (64b) | payload | |
| +------------+-----------------+ |
| </artwork> |
| </figure> |
| |
| <t>It may seem that 64 bit message size is excessive and consumes too much |
| of valueable bandwidth, especially given that most scenarios call for |
| relatively small messages, in order of bytes or kilobytes.</t> |
| |
| <t>Variable length field may seem like a better solution, however, our |
| experience is that variable length size field doesn't provide any |
| performance benefit in the real world.</t> |
| |
| <t>For large messages, 64 bits used by the field form a negligible portion |
| of the message and the performance impact is not even measurable.</t> |
| |
| <t>For small messages, the overal throughput is heavily CPU-bound, never |
| I/O-bound. In other words, CPU processing associated with each |
| individual message limits the message rate in such a way that network |
| bandwidth limit is never reached. In the future we expect it to be |
| even more so: network bandwidth is going to grow faster than CPU speed. |
| All in all, some performance improvement could be achieved using |
| variable length size field with huge streams of very small messages |
| on very slow networks. We consider that scenario to be a corner case |
| that's almost never seen in a real world.</t> |
| |
| <t>On the other hand, it may be argued that limiting the messages to |
| 2^64-1 bytes can prove insufficient in the future. However, |
| extrapolating the message size growth size seen in the past indicates |
| that 64 bit size should be sufficient for the expected lifetime of |
| the protocol (30-50 years).</t> |
| |
| <t>Finally, it may be argued that chaining arbitrary number of smaller |
| data chunks can yield unlimited message size. The downside of this |
| approach is that the message payload cannot be continuous on the wire, |
| it has to be interleaved with chunk headers. That typically requires |
| one more copy of the data in the receiving part of the stack which |
| may be a problem for very large messages.</t> |
| |
| </section> |
| |
| <section title = "Note on multiplexing"> |
| |
| <t>Several modern general-purpose protocols built on top of TCP provide |
| multiplexing capability, i.e. a way to transfer multiple independent |
| message streams over a single TCP connection. This mapping deliberately |
| opts to provide no such functionality. Instead, independent message |
| streams should be implemented as different TCP connections. This |
| section provides the rationale for the design decision.</t> |
| |
| <t>First of all, multiplexing is typically added to protocols to avoid |
| the overhead of establishing additional TCP connections. This need |
| arises in environments where the TCP connections are extremely |
| short-lived, often used only for a single handshake between the peers. |
| Scalability protocols, on the other hand, require long-lived |
| connections which doesn't make the feature necessary.</t> |
| |
| <t>At the same time, multiplexing on top of TCP, while doable, is inferior |
| to the real multiplexing done using multiple TCP connections. |
| Specifically, TCP's head-of-line blocking feature means that a single |
| lost TCP packet will hinder delivery for all the streams on the top of |
| the connection, not just the one the missing packets belonged to.</t> |
| |
| <t>At the same time, implementing multiplexing is a non-trivial matter |
| and results in increased development cost, more bugs and larger |
| attack surface.</t> |
| |
| <t>Finally, for multiplexing to work properly, large messages have to be |
| split into smaller data chunks interleaved by chunk headers, which |
| makes receiving stack less efficient, as already discussed above.</t> |
| |
| </section> |
| |
| <section anchor="IANA" title="IANA Considerations"> |
| <t>This memo includes no request to IANA.</t> |
| </section> |
| |
| <section anchor="Security" title="Security Considerations"> |
| <t>The mapping isn't intended to provide any additional security in |
| addition to what TCP does. DoS concerns are addressed within |
| the specification.</t> |
| </section> |
| |
| </middle> |
| |
| </rfc> |
| |