OpenOnload NIC / InfiniBand RDMA
From the API perspective take a look at OFED's rdma_cm library. You can also run existing TCP/IP programs over InfiniBand using IPoIB and SDP but you do sacrifice some performance.
I've seen a big upsurge in InfiniBand activity in Trading environments over the last 18 months. There is a lot going on, but few are talking about it. I've helped both arbitrage houses and Exchanges to deploy it recently.
----------------------------------------
InfiniBand does not sacrifice reliability or in-order delivery, It has simply moved it into the hardware, so you get all the reliability benefits of TCP without paying the latency or performance cost.
A TCP/IP kernel stack will copy data at least three times before it is sent of the NIC. These are the optimized ones. I've seen some which copied it up to 6 times.
FCoE also needs this reliability and in-order delivery, hence all the activity with Ethernet DCB and Converged Ethernet. The whole advantage of FCoE over iSCSI is that it does not require the overhead of TCP. Whilst this is bearable on the server. It has made Enterprise class iSCSI storage arrays very difficult to implement due to the concentration of TCP overheads they need to support.
------------------------------------------
Sendfile is only a means to mmap a file and then copy from it's kernel buffer, It removes the kernel-user-kernel copy which occurs if you read a file and write to a socket. This is a specific use case optimization and it still suffers IP stack to Ethernet driver copies.
I agree that TCP sockets are definitely a more familiar programming model, however if you want the lowest latency then you have to go to OFED VERBS. If you are not that brave, then try OFED Reliable Datagram Sockets. These provide the familiarity of sockets whilst exploiting the L2 reliability. Oracle has based it's RAC clustering on RDS.
FCoE does need link level reliability. FCoE is simply SCSI commands carried over L2 DCB Ethernet. SCSI error recovery is very expensive and it expects in-order delivery of data. Packet drops are avoided by correct use of the of the Ethernet pause feature. Whilst it is good that this has been extended to multichannel there is still no standard defined means to manage them. This is where the maturity of InfiniBand has benefits. It has a very extensive management model which is realised with its subnet manager. It allows the definition of a sophisticated QoS model to mix different applications using it's 15 Service Levels and 15 Virtual Lanes. Registered applications can query the SM and discover their QoS settings. Managing a L2 network is a realistic proposition with InfiniBand. Even the vendors aren't there yet for Converged Ethernet. I'm sure they will eventually.
I do take my hat off though to Solarflare, it's openonload is definitely the preferred option for TCP over 10G Ethernet. I would still like to see it ported to a wider range of hardware and incorporated directly into the Linux kernel.
-------------------------------------------------
InfiniBand does not sacrifice reliability or in-order delivery, It has simply moved it into the hardware, so you get all the reliability benefits of TCP without paying the latency or performance cost.
A TCP/IP kernel stack will copy data at least three times before it is sent of the NIC. These are the optimized ones. I've seen some which copied it up to 6 times.
FCoE also needs this reliability and in-order delivery, hence all the activity with Ethernet DCB and Converged Ethernet. The whole advantage of FCoE over iSCSI is that it does not require the overhead of TCP. Whilst this is bearable on the server. It has made Enterprise class iSCSI storage arrays very difficult to implement due to the concentration of TCP overheads they need to support.
--------------------------------------------------
Whilst many of them no doubt work, they typically require both ends to run the same modifications which flys in the face of the platform independence of TCP.
I carried out tests bypassing TCP completely and using the OFED RDMA options. On a 10Gb/s longline we were able to sustain wire speed, for a single threaded transfer across a 540km (6.3mS) link once the RDMA transfer size > reached 640K (at the default 64K it achieved 55% of b/w.
We were able to sustain similiar throughput across TCP, by setting window size to 640K whilst there were no congestion events. As soon as these happen the throughput profile drops to the well known saw tooth casued by the standard TCP Reno congestion management that the custom modules seek to address.
The OFED kernel modules are now fully integrated in the Linux kernel, and are also available for Windows and with full interoperability with the leading UNIX distro's.
I believe it is safer to stick to these standards.
Achieving low latency is however a different requirement to the above. There we do not use RDMA for messages < 64k due to the latency cost in setting up the transfer. Here, TCP bypass and zcopy transfers direct from user memory to the I/O card provide the optimal results. This was achieved runnig the standard OFED stack across long haul InfiniBand.
OFED works with both InfiniBand and leading RDMA enabled 10ge cards. InfiniBand still offers lower serialization overhead due to it's higher speed and remarkedly is still cheaper to deploy then 10ge. InfiniBand importantly has a full layer two management stack for routing, trunking and path selection, whereas you have to resort to the IP layer and therefore the router overhead with Ethernet. With long haul InfiniBand options now well proven, there is no need to add the latency cost of routers to these links.
-------------------------------------------------
InfiniBand has it's own network and transport protocols, and because of the reliable delivery is not dependent on TCP to correct the errors. It also has RDMA baked into the standard, so it all works and interoperates between vendors. I've been running mixed vendor environments for years and have yet to encounter a problem on interworking. This is the biggest concern I have about the the OpenOnload work. It needs to be widely adopted by a broad number of Ethernet suppliers so we don't have to worry about configuring matched pairs to deploy it.
I believe OFED is the most promising option today, with support for both InfiniBand and many RDMA enabled Ethernet cards . This enables you to hedge the physical network layer. Today, there is no doubt that InfiniBand is ahead of 10ge in performance, cost, scalability, maturity and interoperability. However the Ethernet community has always been creative and not shy about taking good ideas from elsewhere and re-using them. I expect that by the time 40G Ethernet becomes deployable inside the data center it will have many of the same capabilities.
In the meantime, you can further hedge and deploy the Mellanox VPI InfiniBand cards which has ports are software configurable between InfiniBand and Ethernet.
OFED does include IPoIB for interoperability and can pre-load SDP to enable existing TCP based applications ( i.e. not multicast) to bypass TCP. However, the lowest latency requires you to go to the OFED VERB level and this is complex. We are starting to see some of the middleware vendors do this, such as IBM's Websphere MQ low latency messaging which give you the option to buy rather than build your own solution.
The Java community are also working on RDMA. The uSTREAM project enables you to utilize OFED VERBS using a Java interface rather than have to program them directly (VERB API's are in in 'C').
-----------------------------------------------------------------------
SDP is primarily about increasing throughput (18Gbp/s v. 7Gbp/s on same server) and offloading CPU. For small transfers (<16K) we found IPoIB provided lower latency. For transfers above a configurable size (we set ours to 32K) it will use a bzcopy straight from pinned user memory, avoiding the normal transfer through user space.
We measured this on Intel Woodcrest servers so would hope for a improvement on Nehalem. Would be good to understand what has been done improve single session throughput above the levels we saw. There is a tradeoff on some of the tuning parameters. We optimize for low latency rather than throughput, which may explain the lower throughput we've experienced.
I'm not personally convinced RDMA is the right approach for low latency. It's really useful for large amounts of data, but since the majority of messages we are handling are less than 1500 bytes and fit comfortably in a single packet. It's quicker just to SEND it than setup the RDMA transfer. There are solutions such as Voltair's Messaging Service (available for Wombat) which are RDMA based. This is an interesting solution where clients pull the messages to them using RDMA. This lowers the load on the distribution server and provides greater scalability but I'm not convinced alternatives could not lower latency further.
Has anyone considered Reliable Datagram Sockets (RDS)? These are used by Oracle to cache coherency across InfiniBand for RAC Clusters. These provide guaranteed delivery for peer-to-peer networks whilst providing a programming model familiar to programmers.
We have looked at DAPL, but like Oracle came to the conclusion that it is just too difficult to use and appears to be in decline.
Has anybody tried DAPL or RDS in a Financial Services or is the serious work still going on at the VERB level.
-------------------------------------------------
What I meant was that for messages that would fit inside a single MTU (1500 bytes on Ethernet and 2K on current InfiniBand implementations), it may be faster to simply SEND it rather than initiate a RDMA transmission. I'm refering to the infiniBand send/receive VERBS.
I've tested an Open Source VERB level multicast implementation which achieved a end-to-end RTT latency of 6.14uS median (Std Dev 0.51uS) compared with the 23.99 uS median (Std Dev 27.29 uS) when
running IPoIB.
On the same configuration we achieved 14.66 uSec typical latency
using ib_send and 13.88 uSec using ib_write_lat (RDMA writes) using the standard OFED perf tests. Both longer than the multicast test, but interestingly, in this case the RDMA write is quicker than the SEND.
(Notes the OFED perf test numbers have to be doubled to obtain comparable RTT's)
It's interesting to see your point about the efficiency of the Poll operation in these circumstances. One of the advantages of InfiniBand is the network is fatter than the CPU can transmit so these calls could be non-blocking providing you implemented some higher level error recovery to deal with exceptions. Most market data distribution systems add this to the (unreliable) raw multicast.
It looks like we're have to check the code on both tests to look at how blocking and polling are respectively handled so we can construct a comparable test since it is not clear to me at the moment which approach provides the lowest latency. We're also using the nanoclock for timing whereas the OFED code is using CPU cycle counting which may also account for a difference.
This test configuration had four InfiniBand switches in path between the servers under test.