How to see when filter fails

Topics: User Forum
Oct 8, 2008 at 8:26 AM

We are experiencing a problem with the filter. The symptoms are that the traffic is normal, around 300 requests per second, and then the incoming requests to the application suddenly drops to nearly zero per second. This must be because the requests are held up in the filter somehow. Quite often, the server responds with a http status 500 error page.

Now I have two questions:

1) How can I see how much downtime my users have experienced before I deactivated the filter?

2) Does the filter throw an error or log an event or whatever when it has problems with the load?

I suspect that our rules are too many and too complex in relation to our load. This must of course be addressed before we activate the filter again.

Oct 8, 2008 at 4:15 PM
I don't know how to answer the first question.
For the 2nd, The filter logs all kinds of stuff - first thing I would do is check the log file.

There are no special log messages for when  the filter "has problems with the load". There's no logic in the filter to detect when it has a problem.
It just logs as usual.

what version of the filter are you using?
Oct 9, 2008 at 6:42 AM
Thanks for the fast reply.

Nevermind question no 1. That was just me, writing before thinking.

Regarding question no 2, I thought there might be a value for "average processing time/request" or something in the filter that it could log.

We are using v1.2.14 final R3. Should we upgrade?
Oct 14, 2008 at 4:14 AM
No - the filter does not emit an average processing time/request metric.  You could write a perl script to compute it based on the log messages.

Upgrade to R4 if you are processing really long URLs.  R4 has a change where very long URLs (above 500 characters) does not cause a popup dialog box .  
It's quite possible that the activity dropping to ZERO is caused by this problem.  check your logs to see if you have long URLs coming into the server.  
Oct 14, 2008 at 6:42 AM
Ok, that does sound suspicious. We recieve urls longer than 500 chars with about 1-10 seconds in between. But on the other hand, we only experienced drops in the traffic about every 5-10 minutes. Could this still be the problem?

I will try the upgrade and see if it affects test results.

Oct 14, 2008 at 2:05 PM
yes, this could be the problem.

another possible cause is worker process recycling.   You have a website with 300 requests per second, you say? 
In IIS6 and above, it is possible to configure a recycle time on a worker process.  Also you can configure a max # of requests handed to the worker process before it is recycled (Stopped and restarted by IIS).  Is it possible that you are hitting one of these thresholds and the restart of the process is what is causing the delay or drop in traffic?

If this is the case then you need to back off your recycle thresholds, or use a web garden approach, where you have multiple worker processes on the same app.  But be careful before trying either of these.  You should make sure you know what you are doing before making changes.
Oct 14, 2008 at 3:28 PM
I tried the update 1.2.14 R4, which of course doesn't show the buffer size error popup-thingy, but I couldn't really see any difference in performance. We are having problems recreating the drop in traffic in our test environment because we don't have the tools to generate enough traffic. One server has an average of about 300 req/sec (max ~500 req/sec), but the website as a whole has a lot more. It's a big website. Do you have any decent ideas for generating this amount of traffic against the filter?

A couple more question:
1) The IIS has a flag in the application pool setting called Isapi Reported Unhealthy. Is this used by the filter? Could it be of any help to us?
2) Can this error message (below) be caused by the filter? It does not occur without the filter.

Error Summary
HTTP Error 503.2 - Service Unavailable
The serverRuntime@appConcurrentRequestLimit setting is being exceeded. Detailed Error InformationModule IIS Web Core
Notification BeginRequest
Handler StaticFile
Error Code 0x00000000
Requested URL http://...
Physical Path C:\inetpub\...
Logon Method Not yet determined
Logon User Not yet determined
 Most likely causes:
Setting is too low.
The application is taking too long for the current load.
Oct 14, 2008 at 5:45 PM

Ah, ok, now that is good information.

With IIS6 (or 7), the way web applications work - they run in what is called a "worker process".  The name of the exe is w3wp.exe.  This worker process loads in the code for the application.  If it is an ASP.NET application, then the worker process loads in the ASP.NET ISAPI.  If you are running IIRF, then the worker process also loads in that DLL. 

IIS then does health monitoring of the worker process.  Every so often, IIS will send a message to the worker process, "Are you healthy?" and the worker process will respond, "sure am!" and everything is fine. This happens every 30 or 60 seconds (I forget exactly).    If the worker process fails to respond in a certain amount of time, then IIS decides that the worker process is not healthy, and IIS kills the workerprocess and starts a new one.  During the stop and restart, your website will appear to "hang".  And then service returns to normal as the new process is brought online.  All of this health monitoring has nothing to do with IIRF.  it is completely independent of IIRF.  IIRF does not "use" the "reported unhealthy" thing you mentioned.  

Now the specific error message you have published shows that the concurrent request threshold for the worker process has been exceeded.  This means the worker process is serving more than the configured maximum for concurrent requests (concurrent = at any one moment ).    Along with the health monitoring, there are a bunch of metrics that the worker process monitors, and you as the administrator can configure thresholds for those things.  Concurrent requests is one such metric.  The worker process is telling IIS  "I have 500 concurrent requests right now" (or whatever the number is) and IIS is looking at the configured limit and comparing the two numbers and concluding that 500 is too many.  

You will see in the error message you posted that there are 2 likely causes for the error you received: the setting is too low, or the application is too slow.

So how do you pick the right number for the setting?  How many requests exactly is "too many" ?
It is not easy to define, but it is easy to determine.  A high number of concurrent requests, in itself, is not always a sure sign of "bad health" for a worker process. In some applications, the work required for an individual request is very lightweight, and so it is ok to have 500 of them happening at any one time.  

In other applications, 500 concurrent requests is a sign of trouble. It means there is a backlog, and the process is not keeping up with the incoming request load.   You as the administrator have to make the decision as to the request threshold that indicates bad health.  This is typically a matter of experimentation.  Look at the response time per request, and correlate that to the number of concurrent requests.  With only 1 request in the server, you may get a response time per request of 5 ms (for example).  Then  with 100 concurrent requests, the average response time may be 7 ms.  Then with 500, it may be 8 ms.  But with 650 concurrent requests, the response time could be 200 ms.  Obviously there is a threshold there that has been crossed, so as an administrator, you can say - look I want 500 concurrent requests to be the maximum for this app, based on the observations I made on response time. So you would configure that variable to 500 in the IIS Administrative UI.  

The other thing I recommended is "web gardens" - within IIS it is possible to set up multiple worker processes for a single application.   One implication is that if any worker process reports in as "unhealthy", it will be restarted while the other worker process continues to run.  This means you don't get a hard interruption in service.

So you have some experimentation to do.

Ps: none of this has anything to do, directly, with IIRF.  IIRF may introduce some latency into the web request, which may cause the number of concurrent requests to be higher than it would without the IIRF filter, for the same incoming web request load.  So IIRF may be a contributing factor, but it does not appear to me, to be the primary cause here.  You need to do some response time analysis to set your thresholds properly.

analysis may be as simple as writing a perl script and examining the response time in the IIS logs, or even in the IIRF logs.  If you see a flat response time (relatively constant irrespective of # of concurrent requests) then it is probably true that your "max concurrent requests" threshold is too low. 

a quick and dirty way to approach this would be to just double the concurrent requests number, and see if the problem goes away.  But this is sort of unscientific. I wouldn't be satisfied with that kind of approach, long term.

Oct 15, 2008 at 9:17 AM
Edited Oct 15, 2008 at 9:19 AM
Great thanks for your quick and exhaustive answer.

Unfortunately, we're not sure that the 503.2 error is directly connected to the drops in traffic. We simply don't have the logs to prove it. There is an application pool setting under "Recycling", named "Request Limit Exceeded", which I suspect will trigger if this occurs. This setting was disabled when the filter was active so we can't really see the connection. It could also very well be, as you point out, that IIRF merely pushes the response time over its limit value and therefore the application pool restarts.

We have not set the "appConcurrentRequestLimit" in the applicationHost.config file (this is the only way to set it?). Therefore, the default value of 5000 should be used. The 503.2 error message more probably indicates that the request processing time is too high.

We now have some good ideas as how to recreate and solve our problem, but it feels like it all comes down to being able to test the application with heavy load. Which we can't today. We should probably set up a load testing environment before we continue. Without it, we can only guess what will solve this problem and improve performance.
Oct 16, 2008 at 5:41 AM
yes, I think that sounds right - you need a test environment.
It is pretty remarkable that you're getting 5000 concurrent requests. 

Could there be a cycle in your IIRF ini file where, for a given incoming URL, you generate a series of Redirects back to the same server?
A poorly designed set of rules could cause a single HTTP request to turn into 7, using a series of redirects.  you don't want that.

Yes, the config file is the way to set the appConcurrentRequestLimit .
Oct 24, 2008 at 11:37 AM
Edited Oct 24, 2008 at 2:45 PM

we now have some test results. It looks like one particular page in our application generates really long urls on postback because it sends ViewState in-url. IIRF seems to have an upper url length limit of 1024 characters. It looks like the rest is cut off and interpreted as a new url. For us, this results in more than 5 application errors in under 5 minutes and the rapid fail protection in IIS shuts down the application pool.

What's the best solution for this? We now tested with the IIRF v 1.2.14 R4, should we upgrade to 1.2.15? Also, the large ViewState should probably not be sent in-url and I will try to change this.


Edit: I think it's the groups in regular expressions that can't be larger than 1024 characters, is that right?
Oct 24, 2008 at 3:55 PM
Hmm, I will have to look at the code.
I think URLs can be 2048 characters. 
when you say "it looks like the rest is cut off and interpreted as a new URL" can you show me the log file so I can see what you are talking about?

Yes, I would recommend 1.2.15.
Oct 25, 2008 at 2:30 AM
I just made some changes to v1.2.15 to better handle very long URLs. 
I don't know if this will correct what you are seeing, but please try it.

if this still does not work, then I need your rules and a sample URL that causes it to fail.  Can you give me that? 
Also, what platform are you running on ?  WS2003 WS2008 Vista XP ?
Oct 27, 2008 at 8:15 AM
Hi, we'll try v 1.2.15 and run the same tests. I'll let you know how it goes.
Nov 4, 2008 at 12:46 PM
Edited Nov 4, 2008 at 12:51 PM

it's working much better with v 1.2.15 R3. Our test environment isn't perfect but we believe we managed to reproduce the error and now we no longer see it. Previously, with Testdriver.exe, we got the following result with a URL containing 1341 characters:

REWRITE '/abcde/12345678.aspx?__VIEWSTATE=%2FwEPDwULLTE2... (exactly 1023 chars)' ==> '/12345768.aspx?__VIEWSTATE=%2FwEPDwULLTE2... (exactly 1017 chars)'

NO ACTION '%3A_ctl8=on&... (exactly 318 chars)' ==> --

The filter cut the URL after 1023 characters.

Nov 4, 2008 at 5:22 PM
Good! I am glad this is working better for you.
Please let me know if there is anything else.  I'd love to hear some metrics on your deployment - how many servers, how many rules you have in your INI, how many sites, how many requests per second, why you are using IIRF (why rewriting?), all that stuff.  if you care to share.
Nov 18, 2008 at 9:15 AM
Edited Nov 18, 2008 at 9:33 AM
Sad to say, the story isn't over... We still see dips in traffic when we "activate" IIRF. See screenshot from PerfMon log from test here.

We have over 60 rules in the INI file, and less than ten of these are flagged with [L]. Some URLs iterate up to 15-16 times, therefor I raised the iteration limit to a maximum of 30.

We have one main website that is distributed on several servers with a round-robin load balancing proxy up front. Requests per second per server is about 300, with some variation depending on weekday and time of day. When we perform maintenance on a server, we remove it from the live environment, which elevates traffic on the other servers. Each server should support a lot more than 300 req/sec, but it's hard to specify an exact number. All in all, it's a major website with a constant high load.

The test environment is just a separate server with a mirror of the website and the same setup as a live server. Unfortunatly, this server is also the only resource for testing and therefor we have run JMeter tests locally on this machine. Not the best setup, but still better than nothing. JMeter manages to stress the server with over 300 req/sec, and I think we managed to reproduce the problem with dips in traffic. I may have missed it before since I was looking at a bigger timespan in PerfMon logs and the dips weren't visible then. And the results are much more stabil after we upgraded to 1.2.15 R3. It looked really flaky before the upgrade, with peeks and dips constantly.

We want to rewrite our URLs to increase the readability for both users and most of all, search engine robots. We have also done a lot more to optimize the website, but rewriting the URLs remain a problem. Before I personally got involved in the project, Helicons filter was tried with the same results.

What could possibly casue the filter to stop forwarding requests for periods of 20-40 seconds at a time? Could it be a faulty rule? I will anonymize our rules and post them here. There are no log entries of application restarts or worker process recycling in IIS and we can't see anything in perfmon logs regarding processor or memory that would lead us to believe there's a problem in the application itself.
Nov 18, 2008 at 10:54 AM
Edited Nov 19, 2008 at 11:43 AM
These are the rules we have. I had to anonymize them.

RewriteLog c:\temp\iirflogs\
RewriteLogLevel 0
IterationLimit 30

# Redirect all incoming traffic to to
RewriteCond %{SERVER_NAME} ^$
RedirectRule (.*)$1 [I,R=301]

# Static links in main menu
RewriteRule ^/hilaris/$ /bellum.aspx?Showfurtume=false
RewriteRule ^/furtum$ /bellum.aspx?Showfurtume=true
RewriteRule ^/disco$ /bellum.aspx?toolbars=banner
RewriteRule ^/expeto$ /bellum.aspx?toolbars=banner

# START: URL Rewrite för converto
# Ex /converto/this_o_that -> /converto/?convertis=this+that

# No rewrite if filname exists in URL.
RewriteRule /converto/(.*\..*) /converto/$1 [L]

# Support for multiple languages.
RewriteRule /converto/(.+)/(.+)(?:\_)(.+) /converto/$1/$2+$3 [I]
RewriteRule /converto/(.+)/(.+)(?:\Q+o+\E)(.+) /converto/?LCID=$1&bibo=$2+%26+$3 [I]
RewriteRule /converto/(.+)/([^\?].*) /converto/?LCID=$1&bibo=$2 [I]
RewriteRule /converto/(.+)/ /converto/?LCID=$1 [I]
RewriteRule /converto/(.+)/(.*\?)(.*) /converto/?LCID=$1\&$3 [I]

# Replace "_" with "+" and "+o+" with "+&+"
RewriteRule /converto/(.+)(?:\_)(.+) /converto/$1+$2 [I]
RewriteRule /converto/(.+)(?:\Q+o+\E)(.+) /converto/?bibo=$1+%26+$2 [I]

# No more changes to URL with QueryString.
RewriteRule /converto/(.*\?.*) /converto/$1 [I,L]

# Finish handling converto URLs.
RewriteRule /converto/(.+) /converto/?bibo=$1 [I,L]

# Replace 'porta:postea' with 'sono' and 'porta:posteri' with 'sordeo'
RewriteRule (.*)porta:postea(.*) $1sono$2
RewriteRule (.*)porta%3apostea(.*) $1sono$2
RewriteRule (.*)porta:posteri(.*) $1sordeo$2
RewriteRule (.*)porta%3aposteri(.*) $1sordeo$2

# Remove /hilaris/ if requesting w or p..
RewriteRule ^(?:/hilaris)(/w/.*)$ $1
RewriteRule ^(?:/hilaris)(/p/.*)$ $1

#Special case
RewriteRule ^/hilaris/(\?.*)$ /bellum.aspx$1

# Sort out relative links, if a file is requested, path should start att root.
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/w/|/p/|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/|/sono/|/sordeo/)(?=.*\.)(?:.*)((?:/pic/|/css/|/images/|/js/|/xml/)[^\?]*)\??$ $1 [I,L]
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/w/|/p/|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/|/sono/|/sordeo/)(?=.*\..*\?.*).*(\/[^\.]+\..*)$ $1 [I,L]

# Replace "&" with "%26" if request is not for a file (contains dot)
# unless it's a request for a file under /upload or /images
RewriteRule (^\/upload/.*$) $1 [I,L]
RewriteRule (^\/images/.*$) $1 [I,L]
RewriteRule (.*)_o_(.*) $1%26$2
RewriteRule (.*)_e_(.*) $1%2f$2
RewriteRule (.*)_l_(.*) $1%3d$2
RewriteRule (.*)_p_(.*) $1%2e$2

# URLs that starts with sono goes to xiphias.aspx
RewriteRule ^/sono/([^\?\/]+)/sordeo/([^\?\/]+)/*$ /xiphias.aspx?sono=$1&sordeo=$2 [I,L]
RewriteRule ^/sono/([^\?\/]+)/*$ /xiphias.aspx?sono=$1 [I,L]

# URLs that starts with sordeo goes to bellum.aspx
RewriteRule ^/sordeo/([^\?\/]+)/*$ /bellum.aspx?sordeo=$1 [I,L]

# /p/this+that/IEfwNOr%2048MdEe4e6PO%20Ig==?sono=this+that&id=51310712  ->  /voraxvulgo.aspx?sono=this+that&id=51310712&dbid=IEfwNOr 48MdEe4e6PO Ig==
RewriteRule ^(?!.*vorax.*)/w/([^\/\?]*)/*$ /voraxvulgo.aspx?dbid=$1
RewriteRule ^(?!.*vorax.*)/p/([^\/\?]*)/*$ /voraxvomer.aspx?dbid=$1
RewriteRule ^(?!.*vorax.*)/w/[^\/\?]*/([^\/\?]*)/*$ /voraxvulgo.aspx?dbid=$1
RewriteRule ^(?!.*vorax.*)/p/[^\/\?]*/([^\/\?]*)/*$ /voraxvomer.aspx?dbid=$1
RewriteRule ^(?!.*vorax.*)/w/[^\/\?]*/([^\/\?]*)/(.*)$ /voraxvulgo.aspx$2&dbid=$1
RewriteRule ^(?!.*vorax.*)/p/[^\/\?]*/([^\/\?]*)/(.*)$ /voraxvomer.aspx$2&dbid=$1
RewriteRule ^(?=.*vorax.*)/w/[^\/\?]*/([^\/\?]*)/(.*)$ /$2&dbid=$1
RewriteRule ^(?=.*vorax.*)/p/[^\/\?]*/([^\/\?]*)/(.*)$ /$2&dbid=$1
# When only dbid, no objectname provided
RewriteRule ^/w/([^\/\?]*)/$ /voraxvulgo.aspx?dbid=$1==
RewriteRule ^/p/([^\/\?]*)/$ /voraxvomer.aspx?dbid=$1==
# Remove unnecessary '=' if there are four of them in the end, should only be two
RewriteRule ^(?=.*====)(.*)==$ $1
# When only dbid, no objectname provided. With more parameters.
RewriteRule ^/w/([^\/\?]*)/\?*(.*)$ /voraxvulgo.aspx?dbid=$1&$2
RewriteRule ^/p/([^\/\?]*)/\?*(.*)$ /voraxvomer.aspx?dbid=$1&$2
# If double dbid, remove the last one.
RewriteRule ^(.*dbid=.*)(?:&dbid=[^\&]*)(.*)$ $1$2

# Add '?' if not found.
RewriteRule ^(?!^.*\?.*$)(?=/d/*|/websitename/*|/websitenamekomb/*|/universi/*|/vespillo/*|/hilaris/*|/territo/*|/furtum/*|/route/*|/ustilo/*)(.*)$ $1?

# First time, when there is no QueryString.
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/)/([^\?\/]+)/([^\?\/]+)/([^\?\/]+)/([^\?]+)/*\?$ /$1/$4?$2=$3 [I]

# Rewrite all parameters, as long as there is on more after the current. Iterative.
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/)/([^\?\/]+)/([^\?\/]+)/([^\?\/]+)/([^\?]+)/*\?(.+)$ /$1/$4?$5&$2=$3 [I]

# When only one parameter left .
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/)/([^\?\/]+)/([^\?\/]+)/([^\?\/]+)/*\?([^\?]+)$ /$1?$4&$2=$3 [I]

# When there is only one parameter to begin with.
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/)/([^\?\/]+)/([^\?\/]+)/([^\?\/]+)/*\?$ /$1?$2=$3 [I]

# If there is an uneven no of parameters it means that one parameter has no value. Then remove it.
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/)/([^\?\/]+)/(?:[^\?\/]*)/*(\?[^\?]*)*$ /$1$2

# Remove the first parameter if there are more. Occurs in a special use case.
RewriteRule ^(/hilaris/.*)(?:sordeo/[^\/]*)(.*sordeo/.+)$ $1$2

# Rewrite to the proper file.
RewriteRule ^/default(\?.*)$ /default.aspx$1
RewriteRule ^/websitename(\?.*)$ /xiphias.aspx$1
RewriteRule ^/websitenamekomb(\?.*)$ /quandrum.aspx$1
RewriteRule ^/ustilo(\?.*)$ /voraxSpace.aspx$1
RewriteRule ^/vdc(\?.*)$ /voraxPutus.aspx$1
RewriteRule ^/hilaris(\?.*)$ /bellum.aspx$1
RewriteRule ^/territo(\?.*)$ /promptus.aspx$1
RewriteRule ^/furtum(\?.*)$ /promitto.aspx$1
RewriteRule ^/pictor(\?.*)$ /pictor.aspx$1
RewriteRule ^/universi(\?.*)$ /nefasvulgo.aspx$1
RewriteRule ^/vespillo(\?.*)$ /nefasvomer.aspx$1

# Quickfix, remove extra postea created by the app when it thinks the root is at wrong level because of rewrite.
# For example, it thinks the root is /sm/sono/objectname/sordeo/
# Which makes the postback URL look like /sm/sono/objectname/sordeo/xiphias.aspx?...
RewriteRule ^(.*porta%3Apostea=.+)(?:porta(?:%3A|:)postea=[^\/\&]*)(.*)$ $1$2
RewriteRule ^(.*)(?:sordeo=bellum.aspx)$ $1

Nov 19, 2008 at 12:00 PM
Edited Nov 19, 2008 at 12:03 PM
Okay, I think I might have some more useful input. I retested with a really long URL, which earlier versions of IIRF had problems with. Now it's rewritten as it should be and it works fine with TestDriver.exe, but when I stress the server with this URL the dips in traffic are obvious.

This is the URL (anonymized on the same format as the rules in the post above):


And when using only this URL to stress the server, it looks like this:
Screenshot from perfmon log with when stresstesting with long URL

This URL works with the rules above and it matches this rule:
RewriteRule ^(?=/default/|/websitename/|/websitenamekomb|/w/|/p/|/hilaris/|/territo/|/furtum/|/route/|/ustilo/|/universi/|/vespillo/|/sono/|/sordeo/)(?=.*\.)(?:.*)((?:/pic/|/css/|/images/|/js/|/xml/)[^\?]*)\??$ $1 [I,L]

Any ideas?