Controlling recursion: -recReferenceSyntax of URLsSetting up mappings: -mapfrom and -mapto

Setting up mappings: -mapfrom and -mapto

Mappings are a mechanism which allows URLs to be rewritten before being checked. The most typical use of mappings is to let Big Brother bypass a Web server and read documents directly from disk, as shown in the example below.

You can specify any number of mappings. All mappings are applied to the URL being checked, after it has been resolved (that is, turned into an absolute URL, if it was relative). A mapping is specified as follows:

-mapfrom regexp -mapto replacement

The mapping is applied by finding the first substring of the URL at hand which matches the specified regular expression, and replacing it with the replacement string. (If no substring matches the regular expression, the mapping has no effect.) The replacement string can contain $1, $2, etc; these sequences will be replaced by the text matched by the corresponding group in the regular expression. $0 stands for the text matched by the whole regular expression.

Here is a simple and realistic example. Suppose I have a Web site available at http://www.users.com/~tom/, whose files are stored on my hard disk in the directory /home/tom/web/. When checking my site, I want Big Brother to read the documents directly off my disk, instead of requesting them from the Web server. So, I set up a mapping:

-mapfrom "^http://www\.users\.com/~tom/" -mapto "file:///home/tom/web/"

Now, suppose I ask Big Brother to check the URL http://www.users.com/~tom/index.html. The mapping applies, so the URL is rewritten and becomes file:///home/tom/web/index.html. Thus, Big Brother will read the file from disk, rather than request it from the server.

Let us explore this example a bit further. Assume the above index file contains a link to ../~amy/. Amy's Web site is not stored on my hard disk. Will Big Brother be smart enough to request it from the server? Yes! Although Big Brother applies the mapping to a URL when trying to access it, it remembers the original URL and uses it as the base URL when resolving relative URLs. In slightly less technical terms, here is what this means: when it finds the relative link ../~amy/, Big Brother resolves it. It is resolved with respect to the unmapped URL of the current document, which is http://www.users.com/~tom/index.html. So, the resolved URL is http://www.users.com/~amy/. At this point, the mapping is applied to this URL, but it does not match, so the URL remains unchanged. As a result, Big Brother properly sends a request to the Web server to retrieve this document.

So, to sum up, here's how to bypass a Web server using mappings. First, set up a mapping which maps http URLs to file URLs appropriately. (Have a look at the URL syntax rules.) Second, ask Big Brother to check the remote URL, as usual.

Now, here comes a more elaborate example, which shows how powerful mappings can be. I'm still Tom and I still have a Web site, but this time, only the HTML documents are stored on my hard disk - other files, such as images, are available only on the server. So, I want the mapping to apply only to HTML files. Here's the way to do it:

-mapfrom "^http://www\.users\.com/~tom/(.*\.html)$" -mapto "file:///home/tom/web/$1"

The regular expression matches only documents whose name ends with .html. Besides, the document name is enclosed by a group using ( and ), which allows referring to it by $1 in the replacement string. So, http://www.users.com/~tom/index.html is still turned into file:///home/tom/web/index.html, but URLs of image files, such as http://www.users.com/~tom/tom.jpg, are unaffected.
François Pottier, May 5, 2004

Controlling recursion: -recReferenceSyntax of URLsSetting up mappings: -mapfrom and -mapto