Nested Capturing Groups

Topics: Developer Forum
Dec 11, 2009 at 10:13 PM

How do i access nested Capturing Groups?

 

For example:

CondSubstringBackrefFlag ~

RewriteCond %{HTTP_URL} ^(?!(/[a-z]{2}(-|%2D)[a-z]{2}/|/App_Themes/|/app/|/__utm.gif|/setCookie.aspx)).*$    [I]
RewriteCond %{QUERY_STRING}    ^$
RewriteCond %{HTTP_COOKIE} ^.*Tld=.*(/(fl)-be/|/(fr)-ca/).*$    [I] #add check here for non default langs
RewriteRule ^/(.*)$    ~1$1   [I,L]

In the third RewriteCond there is a capture group containing an or with a capture group in each or.

In the RewriteRule how can i access the inner two capture groups? Are they referenced as ~2?

Thanks

 

Coordinator
Dec 12, 2009 at 2:51 AM
Edited Dec 12, 2009 at 2:54 AM

Ryan, yes, you just increment the index, forever. 

I would have to check the PCRE doc to verify, but I believe it will assign a group number to the outer group, then any inner capture groups, then the next outer group, and so on.

(1(2))(3)(4(5)) 

In some cases the captures are optional  - for example in the case where you have a ? quantifier.  In that case the $n reference will return the empty string, while $(n+1) may return a non-empty string.

In your case, there's an OR.  So the group that captures fl does not have the same index as the group that would capture fr.  I believe fl would be placed in ~2, while fr would be placed in ~3.  One of them will be empty, regardless what matches.

You should be able to verify this pretty easily.

 

Coordinator
Dec 12, 2009 at 3:10 AM

I just checked the PCRE doc, and I was correct about the indices. The wording PCRE uses is:

Opening parentheses are  counted  from left  to  right  (starting  from 1) to obtain numbers for the capturing subpatterns.

In your example, if you want to retrieve either fl or fr, then you would have to use ~2~3 in the replacement string .  ~2 will always be either fl or nothing (empty).  ~3 will always be (empty) or fr.  Therefore when you concatenate them, ~2~3 will always be either fl or fr.

 

Coordinator
Dec 12, 2009 at 3:15 AM
Edited Dec 12, 2009 at 3:16 AM

Wait, there's more.  I'm reading the pcre documentation as I provide updates here.

       [There is] a feature whereby each alternative in a subpattern
       uses the same numbers for its capturing parentheses. Such a subpattern
       starts  with (?| and is itself a non-capturing subpattern. For example,
       consider this pattern:

         (?|(Sat)ur|(Sun))day

       Because the two alternatives are inside a (?| group, both sets of  cap-
       turing  parentheses  are  numbered one. Thus, when the pattern matches,
       you can look at captured substring number  one,  whichever  alternative
       matched.  This  construct  is useful when you want to capture part, but
       not all, of one of a number of alternatives. Inside a (?| group, paren-
       theses  are  numbered as usual, but the number is reset at the start of
       each branch. The numbers of any capturing buffers that follow the  sub-
       pattern  start after the highest number used in any branch.

So, you could use  /(?|(fl)-be|(fr)-ca)/   .  In this case, the outer parens are non-capturing.  Regardless which branch of the OR matches, the fl or fr is always captured by ~1.

 

Dec 14, 2009 at 2:57 PM

Perfect! that is exactly what i was looking for. Thanks again for the excellent support!!

Ryan