Access Gateway mod_balancer not failing over fast enough when web server crashes

  • 7016710
  • 28-Jul-2015
  • 28-Jul-2015

Environment

NetIQ Access Manager 3.2
NetIQ Access Manager 4.0
NetIQ Access Manager 4.1
NetIQ Access Gateway Appliance or Service

Situation

The Access Gateway Proxy service is configured to accelerate multiple web servers, and as a result the Apache mod_balancer is running load balancing requests to these back end web servers per user session.

The tcp timeouts that one can pass into mod_balancer via the 'Web server' UI options are 'timeout=120 keepalive=off ttl=180'. When a web server goes down, these options react too slowly ie. users have to wait until to 30 secs before the Access Gateway sends the request to the next Web server in the configuration.

Apache mod_balancer has the ability to define a connecttimeout parameter to drop the connection timeout to 5 secs. Manually adding this connecttimeout parameter to the vhost CONF file for this proxy service (not available via the UI) and setting it to 5 secs. did allow the AG to send the request to another back end web server after 5 seconds as shown in following logs:

 May 6 12:29:28 mag32app-vm httpd[3526]: [debug] mod_proxy_balancer.c(1077): proxy: byrequests selected worker "http://147.2.16.155" : busy 0 : lbst atus 1
May 6 12:29:28 mag32app-vm httpd[3526]: [debug] mod_proxy_balancer.c(603): proxy: BALANCER (balancer://bal_rewriter) worker (http://147.2.16.155) r ewritten to http://147.2.16.155/rewriter/phpinfo.php
May 6 12:29:28 mag32app-vm httpd[3526]: [debug] mod_proxy.c(1043): Running scheme balancer handler (attempt 0)
May 6 12:29:28 mag32app-vm httpd[3526]: [info] AM#504600000 AMDEVICEID#ag-45B6586EB94FC2A7: AMAUTHID#: AMEVENTID#13: balancer cookie is ZNPCQ003-33 393000=11982cff; Path=/rewriter; Domain=.lab.novell.com
May 6 12:29:28 mag32app-vm httpd[3526]: [debug] mod_proxy_http.c(2162): proxy: HTTP: serving URL http://147.2.16.155/rewriter/phpinfo.php
May 6 12:29:28 mag32app-vm httpd[3526]: [debug] proxy_util.c(2041): proxy: HTTP: has acquired connection for (147.2.16.155)
May 6 12:29:28 mag32app-vm httpd[3526]: [debug] proxy_util.c(2097): proxy: connecting http://147.2.16.155/rewriter/phpinfo.php to 147.2.16.155:80
May 6 12:29:28 mag32app-vm httpd[3526]: [debug] proxy_util.c(2223): proxy: connected /rewriter/phpinfo.php to 147.2.16.155:80
May 6 12:29:28 mag32app-vm httpd[3526]: [info] proxy: HTTP: fam 2 socket created to connect to 147.2.16.155
May 6 12:29:33 mag32app-vm httpd[3526]: [error] (70007)The timeout specified has expired: proxy: HTTP: attempt to connect to 147.2.16.155:80 (147.2 .16.155) failed May 6 12:29:33 mag32app-vm httpd[3526]: [error] ap_proxy_connect_backend disabling worker for (147.2.16.155)
May 6 12:29:33 mag32app-vm httpd[3526]: [error] AMEVENTID#13: failed to connect to webserver

The web server is also flagged down

Reverse Proxy
bal_sles11
SSes Timeout Method
ZNPCQ003-32393400 0 byrequests Sch Host Stat Route Redir F Set Acc Wr Rd
http 147.2.16.154 Err 370b886c 1 0 6 1.7K 355
http 147.2.16.155 Ok a0795fd9 1 0 15 5.1K 1.0K

This addresses the issue where the Apache takes the web server out of rotation, but within 60 secs, the web server is up and running again according to balancer despite it still being dead. There's another mod_balancer timeout called retry that can be used to workaround this - by making it larger.

There are two main issues here

a) there are no options in the Web server UI to change these settings - they must be modified manually in the CONF files and
b) the settings are lost with each restart or update to the AG

Resolution

Add the following parameters to the Proxy Services Advanced Options so that the Apache mod_balancer parameters can be used persistently by the Access Gateway

AdditionalBalancerMemberOptions connectiontimeout=30
AdditionalBalancerMemberOptions retry=5


You can set any one of the values "min", "max", "smax", "acquire", "connectiontimeout", "disablereuse", "flushpackets", "flushwait",  "ping", "loadfactor", "redirect", "retry"

Note:

- the keepalive, route, lbset and ttl parameters are all taken care of by the AG via the UI
- you cannot use keepalive, lbset, route, timeout in this Advanced Options setup as it will be overwritten by the UI settings