Space character in URL: rewrite appears correct, 404 returned

Topics: Developer Forum, User Forum
Mar 3, 2010 at 6:08 PM

I'm having a problem when the URL I'm parsing includes spaces. Based on what's being reported in the log files I believe the rules are being parsed correctly and the URL IIRF says it is returning looks correct. However, instead of seeing the document I get a 404 error. The final URL logged by IIRF works fine if I access it directly.

The ruleset if fairly straighforward.

# If the URL references a file or directory in the webroot then do nothing
RewriteCond %{APPL_PHYSICAL_PATH}$1 -f [OR]
RewriteCond %{APPL_PHYSICAL_PATH}$1 -d
RewriteRule ^/path/to/cakesite/(app/webroot.*)$ - [I,L]

# If the URL does not reference a file or directory, and if the URL does not point to the webroot
# rewrite the URL to insert the path to the webroot
# note: this rule is not needed if the site/vdir root points to the CakePHP webroot
RewriteCond %{APPL_PHYSICAL_PATH}$2 !-f
RewriteCond %{APPL_PHYSICAL_PATH}$2 !-d
RewriteCond $1 !^/path/to/cakesite/app/webroot.*$
RewriteRule ^(/path/to/cakesite/(.*))$ /path/to/cakesite/app/webroot/$2 [I]

So to review what the ruleset is doing, if I try to access "/path/to/cakesite/some random.pdf" IIRF rewrites the URL to "/path/to/cakesite/app/webroot/some random.pdf". I can't see anything in the logs that would indicate what's going wrong. Is there any way to see exactly what URL is being accessed after IIRF has finished parsing the rulset? Could there be some discrepancy between what's reported in the log and what's returned to the server?

My problem seems to be related to a problem reported in a previous discussion, but the solution there does not work. I've tried turning of URL decoding, but this causes the file check to fail because the path check includes the encoded characters. I've also tried, with encoding enabled, using an encoded space (%20 or +), but am seeing no improvement.

Mar 4, 2010 at 3:33 AM

The URL you see in the IIRF logfile is the final URL returned to the server. There's no discrepancy possible.

I do think there's a problem with handling URLs with spaces - According to IETF RFC2396 (URI), which is referenced by IETF RFC 2616 (HTTP), spaces must be escaped as %20 or encoded as +.   I think I may have said this in the other discussion you referenced.  But if you're getting them then you may be forced to deal with them.

One way to work around this is to use different rules for URLs with spaces. It's not a general solution but it might be good enough for your purposes.

# handle URLs without spaces
RewriteRule ^(/path/to/cakesite/(?!app/webroot)([^\x20]+))$ /path/to/cakesite/app/webroot/$2  [I,L]

# handle URLs with one space
RewriteRule ^(/path/to/cakesite/(?!app/webroot)(([^\x20]+)(\x20([^\x20]+))))$ /path/to/cakesite/app/webroot/$3+$5  [I,L]

Some explanation: The first rule includes a subpattern like this: (?!app/webroot).  This is a non-capturing negative lookahead pattern.  It doesn't capture.  It's negative, and it looks ahead.  It evaluates to true when the test string DOES NOT match app/webroot, which is what I think you're trying to do with the RewriteCond $1 thing.  So you could use that non=capturing negative lookahead in place of the final RewriteCond of the three.

Ok, now, getting to the important stuff:  the subpattern ([^\x20]+) matches any sequence of one or more characters that is not a space (\x20).  Therefore, the first rule above works for URLs that have no spaces, and rewrites it appropriately. 

The second rule above, uses the same non-capturing negative lookahead. It then uses the same beginning capturing subpattern - a series of one or more characters, none of which is space.  This is then followed by another capturing group, which begins with space (\x20), and follows with another series of one or more characters, none of which is space, and this series is captured into its own group via parens ().  The effect is to match and tokenize a URL that has a single embedded space.  The capture groups we want to use in the replacement string are $3 and $5.  Capture group $2, is everything after /cakesite.  We don't want to to use that in the replacement string, because $2 includes the space, which is what you're having trouble with.  Capture group $4 gets the space, as well as everything that follows the space.  In other words, $4 is a space, plus $5.  We don't want that either.  We'll use $3 and $5 in the replacement, and insert a + between them. The plus replaces the space in the original pattern.  It will be decoded properly as necessary by IIS.

That works for URLs with a single space.  You'd do the same thing with URLs containing more than one space.  This is what I mean by "not general".  You have to have a dedicated rule, for each number of spaces in the incoming URL.

The best thing to do is to use URLs with no spaces, or encode them and parse them properly, in compliance with RFC 2396.  But if you can't do that, these rules might help.


Mar 4, 2010 at 3:34 PM

Thanks for the feedback. Your skills with the RewriteRule regex are impressive!

I couldn't ever get the document to load, even with your suggested ruleset. The log file indicated that the URL was being parsed correctly, but I was still getting the 404 error. I think I'll just plan on taking the sane route and rename the files so that they don't contain spaces.

Thanks for your advice on the issue. If I have more time in the future I may try to tackle this again.