The job and joblet grid objects provide several facts for controlling the robustness of job and joblet operation.
The default setting of these facts is to fail the job on first error, since failures are typical during the development phase. Depending on your job requirements, you adjust the retry maximum on the fact to enable your joblets either to failover or to retry.
The fact job.joblet.maxretry defaults to 0, which means the joblet is not retried. On first failure, the joblet is considered failed. This, in turn, fails the job. However, after you have written and tested your job, you should introduce fault tolerance to the joblet.
For example, suppose you know that your resource application might occasionally timeout due to network or other resource problems. Therefore, you might want to introduce the following behavior by setting facts appropriately:
On timeout of 60 seconds, retry the joblet.
Retry a maximum of two times. This may cause a retry on another resource matching your resource and allocation constraints.
On the third timeout, fail the joblet.
To configure this setup, you use the following facts in either the job policy (using the Orchestrator console to edit the facts directly) or within the job itself:
job.joblet.timeout set to 60 job.joblet.maxretry set to 2
In addition to timeout, there are different kinds of joblet failures for which you can set the maximum retry. There are forced (job errors) and unforced connection errors. For example, an error condition detected by the JDL code (forced) might require more retries than a network error, which might cause resource disconnections. In the connection failure case, you might want to lower the retry limit because you probably do not want a badly setup resource with connection problems to keep retrying and getting work.