3.2 Datagrid Communications

There is no set limit to the number of receivers (nodes) that can participate in the datagrid or in a multicast operation. Indeed, multicast is rarely more efficient when the number of receivers is small. Any type of file or file hierarchy can be distributed via the datagrid.

The datagrid uses both a TCP/IP and IP multicast protocols for file transfer. Unicast transfers (the default) are reliable because of the use of the reliable TCP protocol. Unicast file transfers use the same server/node communication socket that is used for other job coordination datagrid packets are simply wrapped in a generic DataGrid message. Multicast transfers use the persistent socket connection to setup a new multicast port for each transfer.

After the multicast port is opened, data packets are received directly. The socket communication is then used to coordinate packet resends.Typically, a receiver will loose intermittent packets (because of the use of IP multicast, data collisions, etc.). After the file is transferred, all receivers will respond with a bit map of missed packets. The logically ANDing of this mask is used to initiate a resend of commonly missed packets. This process will repeat a few times (with less data to resend on each iteration). Finally, any receiver will still have incomplete data until all the missing pieces are sent in a reliable unicast fashion.

The data transmission for a multicast datagrid transmission is always initiated by the Orchestrator Server. Currently this is the same server that is running the grid.

With the exception of multicast file transfers, all Data Grid traffic goes over the existing connection between the agent/client and the server. This is done transparently to the end user or developer. As long as the agent is connected and/or the user is logged in to the grid, the Data Grid operations function.

3.2.1 Multicast Example

Multicast transfers are currently only supported through JDL code on the agents. Doing it via the command line client interface Um, would be far too messy. In JDL, after you get the "datagrid" object, you can enable and configure multicasting like this:

    dg.setMulticast(true)

Additional multicast tuneables can be set on the object as well, such as the following example:

    dg.setMulticastRate(20000000)

This would set the maximum data rate on the transfer to 20 million bytes/sec. There are a number of other options as well. Refer to the JDL reference for complete information.

The actual multicast copy is initiated when a sufficient number of JDL joblets on different nodes issue the JDL command:

    dg.copy(...) 

to actually copy the requested file locally. See the 'setMulticastMin' and 'setMulticastQuorum' options to change the minimum receiver count and other thresholds for multicasting.

For example, to set up a multicast from a joblet, where the data rate is 30 million bytes/sec, and a minumum of five receivers must request multicast within 30 seconds, but if 30 receivers connect, then start right away, use the following script:

    dg = DataGrid()
    dg.setMulticast(true)
    dg.setMulticastRate(30000000)
    dg.setMulticastMin(5)
    dg.setMulticastQuorum(30)
    dg.setMulticastWait(30000)
    dg.copy('grid:///vms/huge-image.dsk', 'image.dsk')

In the above example, if at least five agents running the joblet request the file within the same 30 second period, then a multicast is started to all agents that have requested multicast before the transfer is started. Agents requesting after the cutoff have to wait for the next round. Also, if fewer than 5 agents request the file, then each agent will simply fall back to plain old unicast file copy.

Furthermore, if more than 30 agents connect before 30 seconds is up, then the transfer begins immediately after the 30th request. This is useful for situations where you know how many agents will request the file and want to start as soon as all of them are ready.

3.2.2 Grid Performance Factors

The multicast system performance is dependent on the following factors:

  • Network Load: As the load increases, there is more packet loss, which results in more retries.

  • Number of Nodes: The more nodes (receivers) there are, the greater the efficiency of the multicast system.

  • File Size: The larger the file size, the better. Unless there are a large number of nodes, files less than 2 Mb are probably too small.

  • Tuning: The datagrid facility has the ability to throttle network bandwidth. Best performance has been found at about maximum bandwidth divided by 2. Using more bandwidth leads to more collisions. Also the number of simultaneous multicasts can be limited. Finally the minimum receiver size, receiver wait time and quorum receiver size can all be tuned.

Access to the datagrid is typically performed via the CLI tool or JDL code within a job. There is also a Java API in the Client SDK (on which the CLI is implemented). See ClientAgent.

3.2.3 Plan for Datagrid Expansion

When planning your datagrid, you need to consider where you want the Orchestrator Server to store its data. Much of the server data is the contents of the datagrid, including ever-expanding job logs. Every job log can become quite large and quickly exceed its storage constraints.

In addiion, every deployed job with its job package—JDL scripts, policy information, and all other associated executables and binary files—is stored in the datagrid. Consequently, if your datagrid is going to grow very large, store it in a directory other than /opt.