Rewriting Google Queries

Topics: User Forum
Apr 12, 2011 at 4:34 AM

Yesterday i learned about Regular Expressions and my head hurt. Today I found this project and I think this will do what I want, if i can combine the two. 

My plan was to have our internal DNS server serve a bogus record for Google, Google Images and Google Video.  The bogus record would point to one of our IIS6 servers.  I'm hoping that when they hit that internal site that it will immediately "bounce" them out to google in such a way that their next actions will also follow this path.

In my head I have come up with a couple scenarios which would meet my needs none of them I have the fogiest idea how to achieve

1. Once at google if the user tries to disable safe search and perform a query, their http request will contain &safe=off I want to replace that with &safe=strict or &safe=active

2. If the user query contained &safe=off I want to redirect them to a "splash/landing page" where i can just display a message saying this violates corp policy

3. If the domain is google and the response includes images (jpg, jpeg) or videos (mpg,avi) drop those particular parts of the response

An example of a google search with active filtering is:

http://www.google.com/webhp?hl=en&tbo=1&num=10&lr=&cr=&safe=active&tbs=

Google Image Search with Strict filtering on:

http://www.google.com/images?hl=en&safe=active&biw=1280&bih=868&gbv=2&tbm=isch&sa=1&q=none&btnG=Search&aq=f&aqi=&aql=&oq=

Google Image search with Moderate filtering:

http://www.google.com/images?hl=en&biw=1280&bih=868&gbv=2&tbm=isch&q=none&btnG=Search&aq=f&aqi=&oq=&uss=1

Same Search turned off:

http://www.google.com/images?hl=en&safe=off&biw=1280&bih=868&gbv=2&tbm=isch&sa=1&q=none&btnG=Search&aq=f&aqi=&aql=&oq=

Again I am regular expression noob and would appreciate any advice about how I might solve this problem.  Google is the first domain, eventually I need to lock down yahoo, ask and bing.  We already filter domain names against blacklists, but recently discovered how easy it is to circumvent those using google.

Thanks very much

Ron

Coordinator
Apr 13, 2011 at 1:59 AM

> My plan was to have our internal DNS server serve a bogus record for Google, Google Images and Google Video.  The bogus record would point to one of our IIS6 servers.  I'm hoping that when they hit that internal site that it will immediately "bounce" them out to google in such a way that their next actions will also follow this path.

OK, My understanding is that you want to use IIRF on your internal corporate network, to manage and potentially modify search requests that internal people send to google. is that right?

One way to do that would be to use the ProxyPass feature of IIRF; all access to google.com will go through your IIRF server, and from there to an actual Google server.  It's simple  to do this in IIRF:

  ProxyPass   ^/(.*)$    http://www.google.com

> 1. Once at google if the user tries to disable safe search and perform a query, their http request will contain &safe=off I want to replace that with &safe=strict or &safe=active

> 2. If the user query contained &safe=off I want to redirect them to a "splash/landing page" where i can just display a message saying this violates corp policy

Well, I don't understand this part.  It seems you want to do 2 distinct things if the user is employing safe=off.  You want to replace that part of the URL implicitly, but you also want to redirect to a  warning page that states the corporate policy.  This seems like 2 distinct things, for the same scenario.  Which do you want?

Supposing you want to redirect them to a policy warning page, you would do this with a RewriteCond checking for safe=off in the ${QUERY_STRING}. Like this:

  RewriteCond  ${QUERY_STRING}  safe=off   RedirectRule  ^/.*$           /CorporateSearchPolicy.htm

These 2 lines would need to appear *before* the ProxyPass line I showed above, in the IIRF.ini file.

Supposing you want to implicitly replace that setting, you would need to do this:

  RewriteRule  ^/(.*)safe=off(.*)$    /$1safe=active$2

> 3. If the domain is google and the response includes images (jpg, jpeg) or videos (mpg,avi) drop those particular parts of the response

I don't know what "drop those particular parts of the response" means. If you use IIRF to proxy a request, then the entire request is relayed to the original requester.  A web proxy like IIRF does not modify the proxied response in the way you are describing.

It is possible to simply prevent image or video searches on google.com, using IIRF. This means the outgoing search for images (or videos, etc) is simply never performed.  Supposing that searches for images on google ALWAYS go to www.google.com/images (Not sure if this is how it works, but just as an example) , then.... the following rule will prevent image searches, provided you have the bogus DNS record you described above:

  RedirectRule  ^/images/.*$      /NoImagesOrVideos.htm

It is also possible to prevent retrieval of .AVI and .MPG URLs from a google URL, if you use the bogus DNS record, and set up IIRF appropriately.  For example,

  RedirectRule  ^/.*\.(avi|jpg|mpg).*$      /NoImagesOrVideos.htm

Once again, these lines must appear before the ProxyPass line.

This would work only if the requests for these .jpg or .avi (etc) URLs went to the google.com domain.

I don't know the transaction protocol for an image search on google. You showed some sample query requests, but I don't know what the responses look like, and how you'd need to filter those responses. For that reason the above is just a guess; I can't make guarantees on the IIRF rules to use. But I do believe that URLs included in search results from google, point back to google (for example, something like http://www.google.com/actualUrl=http://www.cnn.com/news/8273/image6272.jpg). Therefore, if you are proxying all google access, there are IIRF rules you can use to prevent those requests from being made, and it won't be difficult to figure them out.

These rules, however, would not prevent a script-savvy person from creating a greasemonkey script that extracted the actual URL from the google results and retrieved it directly. Direct retrieval would bypass the IIS+IIRF standing at the bogus google.com ip address in your network.

 

> Again I am regular expression noob and would appreciate any advice about how I might solve this problem.  Google is the first domain, eventually I need to lock down yahoo, ask and bing.  We already filter domain names against blacklists, but recently discovered how easy it is to circumvent those using google.

I think yahoo, bing, and ask.com use differently-shaped URLs, and the search results you obtain from those services do not all point back to the search domain. Instead, they point back to the original URL (for example, http://www.cnn.com/news/8273/image6272.jpg).  For this reason a simple proxy like IIRF will not be able to meet your requirements regarding "drop those particular parts of the response."

Bottom line, I don't think the fact that you are a "regular expression noob" is your main challege here. Instead, I think you need to read up on proxy design and proxy architecture, and think more about your requirements, to figure out what you need to do, and if IIRF will be satisfasctory for your purposes.

In most enterprises I am familiar with, they use an http proxy for all inside-to-outside web access.  All requests to the external web (not only requests to google) from the internal corporate network pass through the http proxy server, which performs validation and authorization checks.  Inside that http proxy server, administrators can set up blacklists not only on DNS names (playboy.com, isohunt.com, etc), but also IP addresses backing those DNS names. (your approach of using bogus DNS records wouldn't meet this requirement). And the administrator can apply other rules of the sort you are describing, at that http proxy server.

At one time, Microsoft had a thing called "ISA Server" - I think it may now be called something else.  This is the kind of tool you need to meet all those requirements.  There are other vendors that provide this sort of product for this purpose.