failover.job

A test job that demonstrates handling of joblet failover.

Usage

> zos login --user zenuser
Please enter current password for 'zenuser':
 Logged into grid as zenuser

> zos jobinfo --detail failover
Jobname/Parameters    Attributes
------------------    ----------
failover           Desc: This test jobs can be used to demonstrate joblet
                         failover handling.

    sleeptime      Desc: specify the execute length of joblet before failure in
                         seconds
                   Type: Integer
                Default: 7

    numJoblets     Desc: joblets to run
                   Type: Integer
                Default: 1

Description

Schedules one joblet, which fails, then re-instantiates in a repeating cycle until a specified retry limit is reached and the Orchestration Server does not create another instance. This example demonstrates how the orchestration server can be made more robust, as described in Section 7.13, Improving Job and Joblet Robustness.

The files that make up the Failover job include:

failover                                    # Total: 94 lines
|-- failover.jdl                            #   64 lines
`-- failover.policy                         #   30 lines

failover.jdl

 1  # -----------------------------------------------------------------------------
 2  #  Copyright © 2008 Novell, Inc. All Rights Reserved.
 3  #
 4  #  NOVELL PROVIDES THE SOFTWARE "AS IS," WITHOUT ANY EXPRESS OR IMPLIED
 5  #  WARRANTY, INCLUDING WITHOUT THE IMPLIED WARRANTIES OF MERCHANTABILITY,
 6  #  FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGMENT.  NOVELL, THE AUTHORS
 7  #  OF THE SOFTWARE, AND THE OWNERS OF COPYRIGHT IN THE SOFTWARE ARE NOT LIABLE
 8  #  FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
 9  #  TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE
10  #  OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
11  # -----------------------------------------------------------------------------
12  #  $Id: failover.jdl,v 1.3 2008/02/27 20:50:00 john Exp $
13  # -----------------------------------------------------------------------------
14
15  # Test job to illustrate joblet failover and max retry limits
16  #
17  # Job args:
18  #    numJoblets - specify number of Joblets to run
19  #    sleeptime -- specify the execute length of joblet before failure in seconds
20  #
21
22  import sys,os,time
23
24  #
25  # Add to the 'examples' group on deployment
26  #
27  if __mode__ == "deploy":
28      try:
29          jobgroupname = "examples"
30          jobgroup = getMatrix().getGroup(TYPE_JOB, jobgroupname)
31          if jobgroup == None:
32              jobgroup = getMatrix().createGroup(TYPE_JOB, jobgroupname)
33          jobgroup.addMember(__jobname__)
34      except:
35          exc_type, exc_value, exc_traceback = sys.exc_info()
36          print "Error adding %s to %s group: %s %s" % (__jobname__, jobgroupname, exc_type, exc_value)
37
38
39  class failover(Job):
40
41       def job_started_event(self):
42            numJoblets = self.getFact("jobargs.numJoblets")
43            print 'Launching ', numJoblets, ' joblets'
44            self.schedule(failoverjoblet,numJoblets)
45
46
47  class failoverjoblet(Joblet):
48
49       def joblet_started_event(self):
50            print "------------------ joblet_started_event"
51            print "node=%s joblet=%d" % (self.getFact("resource.id"), self.getFact("joblet.number"))
52            print "self.getFact(joblet.retrynumber)=%d" % (self.getFact("joblet.retrynumber"))
53            print "self.getFact(job.joblet.maxretry)=%d" % (self.getFact("job.joblet.maxretry"))
54
55            sleeptime = self.getFact("jobargs.sleeptime")
56            print "sleeping for %d seconds" % (sleeptime)
57            time.sleep(sleeptime)
58
59            # This will cause joblet failure and thus retry
60            raise RuntimeError, "Artifical error in joblet. node=%s" % (self.getFact("resource.id"))
61
62
63
64

failover.policy

 1  <!--
 2   *=============================================================================
 3   * Copyright © 2008 Novell, Inc. All Rights Reserved.
 4   *
 5   * NOVELL PROVIDES THE SOFTWARE "AS IS," WITHOUT ANY EXPRESS OR IMPLIED
 6   * WARRANTY, INCLUDING WITHOUT THE IMPLIED WARRANTIES OF MERCHANTABILITY,
 7   * FITNESS FOR A PARTICULAR PURPOSE, AND NON INFRINGMENT.  NOVELL, THE AUTHORS
 8   * OF THE SOFTWARE, AND THE OWNERS OF COPYRIGHT IN THE SOFTWARE ARE NOT LIABLE
 9   * FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
10   * TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE
11   * OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
12   *=============================================================================
13   * $Id: failover.policy,v 1.2 2008/02/27 20:50:00 john Exp $
14   *=============================================================================
15   -->
16
17  <policy>
18      <jobargs>
19            <fact name="sleeptime" description="specify the execute length of joblet before failure in seconds" value="7" type="Integer"  />
20            <fact name="numJoblets" description="joblets to run" value="1" type="Integer" />
21      </jobargs>
22
23      <job>
24            <fact name="description" value="This test jobs can be used to demonstrate joblet failover handling." type="String" />
25
26           <!-- Number of times to retry joblet on failure -->
27           <fact name="joblet.maxretry" type="Integer" value="3" />
28      </job>
29  </policy>
30

Classes and Methods

Definitions:

Class failover in line 25 of failover.jdl is derived from the Job class; and the class failoverjoblet in line 33 of failover.jdl is derived from the Joblet class.

Job

A representation of a running job instance.

Joblet

Defines execution on the resource.

MatrixInfo

A representation of the matrix grid object, which provides operations for retrieving and creating grid objects in the system. MatrixInfo is retrieved using the built-in getMatrix() function. Write capability is dependent on the context in which getMatrix() is called. For example, in a joblet process on a resource, creating new grid objects is not supported.

GroupInfo

A representation of Group grid objects. Operations include retrieving the group member lists and adding/removing from the group member lists, and retrieving and setting facts on the group.

test

Class test (line 42 in dgtest.jdl is derived from the Job class.

testnode

Class testnode (line 73 in dgtest.jdl is derived from the Joblet class.

Job Details

The following sections describe the Failover job:

zosadmin deploy

In failover.policy, in addition to describing the jobargs and default settings for sleeptime and numJoblets (lines 2-5), the <job/> section (lines 7-12) describes static facts (see Section 5.1.2, Facts). Note that the joblet.maxretry attribute in line 11 has a default setting of 0 but is set here to 3. This attribute can also be modified in the failover.jdl file by inserting a line between line 27 and 28, as shown in the following example:

 27       def job_started_event(self):
 ++            self.setFact("job.joblet.maxretries", 3)
 28            numJoblets = self.getFact("jobargs.numJoblets")

job_started Event

After the Orchestrator server deploys a job for the first time (see Section 7.5, Deploying Jobs), the job JDL files are executed in a special “deploy” mode. When the job is deployed (line 13, failover.jdl, it attempts to find the examples jobgroup (lines 15-16), creates it if is missing (lines 17-18), and adds the failover job to the group (line 19).

Jobs can be deployed using either the Orchestrator console (zoc) or the zosadmin deploy command. If the deployment fails for some reason, an exception is thrown (line 20), which prints the job name (line 22), group name, exception type, and value.

job_started Event

In failover.jdl, the failover class (line 25) defines only the required job_started_event (line 27) method. This method runs on the Orchestrator server when the job is run to launch the joblets.

On execution, the job_started_event simply gets the number of joblets to create (numJoblets in line 28), then schedules that specified number of instances (line 30) of the failoverjoblet class.failoverjoblet. The failoverjoblet class (lines 33-46) defines only the required joblet_started_event (line 35) method.

When executed on an agent node, the joblet_started_event prints some helpful information for tracking execution (lines 36-39). The first output is where the joblet is running and which instance is running (line 37). The current joblet retry number (line 38) is displayed, followed by the job’s static joblet.maxretry (line 39) that was specified in the policy file.

The joblet then sleeps for jobargs.sleeptime seconds (lines 41-43) and on waking raises an exception of type RuntimeError (line 46).

This is the point of this example. After a RuntimeError exception is thrown, the zos server attempts to run the same instance of the joblet again if job.joblet.maxretry (default is 0) is less than or equal to joblet.retrynumber.

Configure and Run

You must be logged into the Orchestrator Server before you run zosadmin or zos commands.

  1. Deploy failover.job into the grid:

    > zosadmin deploy failover.job
    JobID: zenuser.failover.269
    

    The job appears to have run successfully, now take a look at the log and see the joblet failure and being relaunched until finally the "maxretry" count is exceeded and the job exits with a failure status:

  2. Display the list of deployed jobs:

    > zos joblist
    

    failover should appear in this list.

  3. Run the job on one or more resources using the default values for numJoblets and sleeptime, specified in the failover.policy file:

    > zos run failover sleeptime=1 numJoblets=2
    JobID: zenuser.failover.269
    

The job appears to have run successfully, now take a look at the log and see the joblet failure and being relaunched until finally the maxretry count is exceeded and the job exits with a failure status:

> zos log zenuser.failover.269Launching  2  joblets
[melt] ------------------ joblet_started_event
[melt] node=melt joblet=1
[melt] self.getFact(joblet.retrynumber)=0
[melt] self.getFact(job.joblet.maxretry)=3
[melt] sleeping for 1 seconds
[melt] Traceback (innermost last):
[melt]   File "failover.jdl", line 46, in joblet_started_event
[melt] RuntimeError: Artifical error in joblet. node=melt
[freeze] ------------------ joblet_started_event
[freeze] node=freeze joblet=0
[freeze] self.getFact(joblet.retrynumber)=0
[freeze] self.getFact(job.joblet.maxretry)=3
[freeze] sleeping for 1 seconds
[freeze] Traceback (innermost last):
[freeze]   File "failover.jdl", line 46, in joblet_started_event
[freeze] RuntimeError: Artifical error in joblet. node=freeze
[melt] ------------------ joblet_started_event
[melt] node=melt joblet=0
[melt] self.getFact(joblet.retrynumber)=1
[melt] self.getFact(job.joblet.maxretry)=3
[melt] sleeping for 1 seconds
[melt] Traceback (innermost last):
[melt]   File "failover.jdl", line 46, in joblet_started_event
[melt] RuntimeError: Artifical error in joblet. node=melt
[freeze] ------------------ joblet_started_event
[freeze] node=freeze joblet=1
[freeze] self.getFact(joblet.retrynumber)=1
[freeze] self.getFact(job.joblet.maxretry)=3
[freeze] sleeping for 1 seconds
[freeze] Traceback (innermost last):
[freeze]   File "failover.jdl", line 46, in joblet_started_event
[freeze] RuntimeError: Artifical error in joblet. node=freeze
[melt] ------------------ joblet_started_event
[melt] node=melt joblet=1
[melt] self.getFact(joblet.retrynumber)=2
[melt] self.getFact(job.joblet.maxretry)=3
[melt] sleeping for 1 seconds
[melt] Traceback (innermost last):
[melt]   File "failover.jdl", line 46, in joblet_started_event
[melt] RuntimeError: Artifical error in joblet. node=melt
[freeze] ------------------ joblet_started_event
[freeze] node=freeze joblet=0
[freeze] self.getFact(joblet.retrynumber)=2
[freeze] self.getFact(job.joblet.maxretry)=3
[freeze] sleeping for 1 seconds
[freeze] Traceback (innermost last):
[freeze]   File "failover.jdl", line 46, in joblet_started_event
[freeze] RuntimeError: Artifical error in joblet. node=freeze

See Also