Replace characters in URL and then redirect to modified URL

Topics: Developer Forum, User Forum
Jan 10, 2011 at 10:02 PM

The format of my URL's have changed, and I don't want to render my previous URL's dead, so I want to reformat the old URL's to the new format and issue a permanent 301 redirect response.

The reformatting needs to look like this:

  1. Underscores "_" must be replaced with forward slashes "/"
  2. However, the first undescore (after the base URL) "/_" must be changed to /index/
  3. Sometimes the URL will contain three hyphens "---" and this must be changed to a single hyphen "-"

Here are example before and after URL formats:

BEFORE:
http://www.example.com/_category-1-with-text_cateogory-2---with-text_category-3-with-text_some-product---name.htm

AFTER:
http://www.example.com/index/category-1-with-text/cateogory-2-with-text/category-3-with-text/some-product-name.htm

I've tried for hours reading through documentation, but cannot figure it out...  Thank you in advance for any help or instruction.

-Scott

Coordinator
Jan 10, 2011 at 11:20 PM
Edited Jan 10, 2011 at 11:21 PM

ok, what did you come up with after the hours of studying the doc?  show me what you tried.

Jan 10, 2011 at 11:45 PM
Edited Jan 10, 2011 at 11:45 PM

Well, I've been reading through the discussions and documentation and really eliminate most possibilities before actually "trying them" out.   My problem has been that I think only the multiple rewrites can do what I want, but I want it to be done in one redirect with a 301 so that the search engines see the 301 for that particular URL.

There's not a lot to show for my studying, as I'm having a tough time grasping the concept.  I haven't been able to figure out where to start, much less if it's even possible (but I'm sure it is, I just can't think of it).

About an hour ago I found the section in the documentation on the "RewriteCond" which I think might be a clue for me, but .

So far, I'm tried the following to eliminate the first /_, but I can't get rid of the other -, since the URL is now redirected.  I don't want to create multiple redirectRules, since I want the first one to send the 301 response so the search engines update the record.

Replace /_ with /index/
RedirectRule ^/_(.*) /index/$1

I tried applying a similar approach to get rid of the others, but the strings don't match the URL
RedirectRule ^/_(.*)^!/_(.*)^---(.*) /index/$1$2$3

I'm not trying to understand the RewriteCond section, and this is where I'm starting:
#RewriteCond $1 /_ OR
#RewriteCond $2 _ OR
#RewriteCond $3 --- OR

#RedirectRule ^/_(.*)^!/_(.*)^---(.*) $1$2$3

Anyway, I'm trying to figure this out, any help would be appreciated.


Coordinator
Jan 11, 2011 at 2:09 AM

I think something like this ought to get you close:

RewriteRule ^/(.+)---(.+)$   /$1-$2
RewriteRule ^/(.+)_(.+)$     /$1/$2
RedirectRule ^/_(.*)$        /index/$1   [R=301]

Let's look at how that ruleset works, using an input URL of http://www.example.com/_category-1-with-text_cateogory-2---with-text_category-3-with-text_some-product---name.htm.

The first RewriteRule substitutes one dash for three dashes. It replaces only the first sequence of three dashes, so the result is http://www.example.com/_category-1-with-text_cateogory-2-with-text_category-3-with-text_some-product---name.htm.

In IIRF, if a rule fires (when I say "fires", I really mean, the rule is applied because the regex for the rule matches the incoming URL), then the result of the rewrite gets passed through the filter again, starting at the first rule.  So the result of the first rewrite is run through the rules again.  Because there is still a sequence of three dashes in the resulting URL, the first rule fires again. Result: http://www.example.com/_category-1-with-text_cateogory-2-with-text_category-3-with-text_some-product-name.htm.

Again, that value is run through the rules. The first rule no longer fires - the regex doesn't match - since there is no remaining sequence of three dashes. At this point, IIRF will evaluate the URL against the second rule. The 2nd rule fires, because its regex matches.  This rule replaces any underscore with a slash (*as long as the underscore doesn't appear at the beginning of the URL, because that's what the regex says).  This happens several times in succession, just as the replacement with the three dashes. After all the underscores (except for the first one) are replaced, the url stands at: http://www.example.com/_category-1-with-text/cateogory-2-with-text/category-3-with-text/some-product-name.htm.

ok, then that URL is run through the rules, and finally the 3rd rule - this one a 301 Redirect - fires, because it matches any URL with a leading underscore.  It replaces the initial underscore with "index/".  The result is http://www.example.com/index/category-1-with-text/cateogory-2-with-text/category-3-with-text/some-product-name.htm.  Because it's a redirect, processing of rules stops immediately.  The 301 is returned to the browser.

I'd suggest you understand and test this ruleset thoroughly before you actually use it. 

Jan 11, 2011 at 2:20 AM

Wow.  Thank you, I really appreciate your thoughtful response.  I'll give your ideas a try. I  didn't realize the rewrite rules could be applied without actually returning the values to the browser, and I didn't think the redirect rule would be applied to a modified rewrite URL.

I'll give this a try and let you know how it works.

Of course, I wonder if this process actually send the 301 response to the Googlebot or whatever search engine sent the visitor to my site for the old URL, or will it be sending a 301 response for a modified URL.  I'm not sure how that part works, but I'll do some testing on that as well and post my results.

Thank you again for addressing this and providing a specific and detailed explanation.

-Scott

Coordinator
Jan 11, 2011 at 3:58 AM

You should read up on the basics of Rewriting and Redirecting.  There's a good overview in the IIRF documentation. It explains how rewriting works and why googlebot (or whoever) is not aware that a rewrite has occurred.   That would seem to pretty clearly answer the thing you are wondering about.

But wait, you said you read the documentation.  So.... gee, how did you miss that page? 

I'm not sure how you could have missed that page, because you spent hours reading through the documentation.  It's one of the first few pages. Also, there are maybe 20 different places in the documentation that link back to that page, because the concepts on that page are so fundamental.  Without understanding those basics, you can't get very much done with a rewriter, and because I was aware of that, the documentation material includes many references back to that page.  It's really puzzling to me that you would have missed that important page, because no matter where you read in the doc, if you spent 10 minutes, let alone several hours, you would have come upon that page.  I'm sure you did read the documentation for hours, which is why it's so puzzling that you would have missed this fundamental page.

Another thing you weren't clear on - that a redirect rule gets applied to the output of a rewritten URL - is also covered in one of the first few pages of the IIRF documentation.  This is another fundamental concept, and just as with the idea behind a rewrite, there are many locations in the doc that refer back to this description.  Many many locations, including many of the examples that I have spent time creating.  Here again, I am absolutely baffled that you could have missed this, after spending, as you said, literally hours reading the documentation.   

By the way - after you put so much time into reading it, I'm sure you have suggestions on making the IIRF documentation more comprehensible;  I'll be very interested to hear them.  I know you spent a lot of time already, but if you could just provide some feedback, I'd appreciate it.  It's so nice when people put some of their own effort into community projects, rather than depending on other people to do all the work!  It's just a matter of basic respect!  I'm sure you'd agree.