How to optimize iirf to handle huge list of rewrites

Topics: Developer Forum, Project Management Forum, User Forum
May 5, 2010 at 10:58 AM

Hi,

We have a huge amount of rewrites - the rewrites are all predefined and unique, and can't use regex -  for example:

www.domain.com/hello www.domain.com/pages/page.aspx?id=1

www.domain.com/goodbye www.domain.com/pages/page.aspx?id=2

etc. etc.

the number of rewrites can reach millions.

1. Is the fact that we are not using regex helping performance or making it worse?

2. Does the IIRF handle this amount of rewrite rules well?

3. What is the maximal number of rules for reasonable performence (yeah i know it's hard to estimate, but even rough estimates will help us in this stage).

4. When having so much lines to write into the iirf.ini - the writing and reading of the files becomes a lengthly process - will this cause delays in the iirf every time it attempts to read the file, or is the reading done in a seperate thread? btw, what is the frequency of the reading attempts?

5. Would writing code that connects to our SQL database and reading the table that contains all the rewrites be a good idea or a bad one?

Thanks in advance,

Eytan

Coordinator
May 5, 2010 at 1:22 PM

1. I have no idea.  You'd have to test it, but I don't know what you would compare to.  "worse" than what?

2. I have never tested it this way.  Millions of URLs on a single server is out of the normal range, in my experience.

3. I don't know.  I surely don't test IIRF this way.  It sounds like you could, though.

4. IIRF reads the vdir-specific ini file on the request thread.  For each request, IIRF checks the timestamp on the file. If the file is changed, then IIRF re-reads the configuration.  The latency of reading the configuration will be incurred on the HTTP Request.  If you have millions of lines, then you'll want to avoid changing the ini file during normal server operation.

5. I don't know exactly what you're asking here.   "Would writing the code... be a good idea or a bad one?"  I don't know what code you mean, where it would run, and I'm not sure how you're judging whether it's a good idea.

 

If I were you I would first try RewriteMap.  It is designed specifically for cases where regex is inappropriate.  If your map of incoming to outgoing URLs is defined in the data of  SQL Server table, then I'd try to automatically generate the map txt file ini the format required by IIRF, from a SQL Server query.  If this doesn't work - if the performance of IIRF when using millions of lines in the map is unacceptable -  then the next thing I'd do, if I were you, is modify IIRF source code to do the SQL Server query directly.  This would involve defining a new map type.  Currently there are two: "txt", implying a text file with a simple lookup, and "rnd" implying a text file with random mapping between options.  I'm imagining a new map type "sql", which does a lookup in the database for each incoming URL.  Maybe this is what you were intending by your question #5.

Performing a sql query to map one URL to another will surely be faster than the simple text lookup that is currently supported in IIRF.  Let me describe how the txt map works today:  The map is a simple table of keys and values.  All maps in IIRF are kept completely in working memory.  If you have 1 million rows in your map, in IIRF, there will be 1 million pointers.  (Will it scale to this level?  I don't know.  Each item in the map consumes 14 bytes, plus the length of the input and output strings you map.  If you have a million items, and each input string and output string averages 32 bytes, that implies 80mb in memory in that data structure. It's large but not impossibly so. )

A map is used in a RewriteRule, to generate the replacement or "output" string.  For each URL that is processed with a map, IIRF successively works through each map item. For each item, IIRF compares the incoming URL (or a portion of the incoming URL) to the key in the map item.  If there's a match, then IIRF emits the value in that map item.  If it doesn't match, then IIRF goes to the next map item.   Read the documentation for more info on RewriteMap. 

Doing a SQL query to perform this map will likely be much faster than IIRF, as you scale to the millions of rows.  SQL has an intelligent query engine, indexing, caching, and optimization strategies that are not implemented in IIRF's simplistic approach.  There's a downside of using multiple processes, and the corresponding context-switch implied.  I'm guessing the performance benefit of SQL Server would more than compensate.   As I said, I haven't tested this so I don't know for sure.

Here's an old article that describes how to connect from a C application like IIRF to SQL Server: http://msdn.microsoft.com/en-us/library/ms811006.aspx

The last thing I'd suggest is that, if after trying the SQL-driven map engine, you find that the performance is still not acceptable, you will want to scale out to multiple IIS servers. This is easily done with request routing, a thing you can also do with IIRF.  All the requests for URLs that begin with [A-E] can go to server 1,  All requests beginning with [F-M] can go to server 2, etc.  You can use as many clones as you need, and bucket the requests in any way you like.  Then, the same URL mapping can be done on each particular "clone" server.

 

May 7, 2010 at 12:32 AM

Cheeso, that's a *really* good reply, I'm impressed.  However I'm trying to think of a situation where you'd want to map millions of URLs that cannot be modeled in a reasonable set of rules.  Isn't that the point of regex, that it's so flexible that it can handle almost any setup (it better have some benefit, because it's sure nearly impossible to read ;-) )?.  Usually the only situations I've come across that have millions of URLs are database driven sites, and those shouldn't be too hard to setup standard regex rules for.  Even if you're using random-like IDs in the database to identify objects, you should be able to setup db queries that use other keys for lookup.

Coordinator
May 7, 2010 at 2:27 AM
Edited May 7, 2010 at 2:28 AM

Ha, thanks Randy.  I just started thinking about it and off I went.

Ya, I also had trouble envisioning a situation that would really require millions of records. 
I'm with you, I figure regex could be used to group common URLs.  But I've been surprised before.  So,...

The most plausible scenario I can imagine is some kind of a unauthorized link or content syndicator.

Just to keep track, I wrote up the workitem for the database map - http://iirf.codeplex.com/WorkItem/View.aspx?WorkItemId=26978 .
I haven't committed to doing the work, though. Not yet anyway.