Working in an environment that relies on web-based communication is a double-edged sword. On one hand, you have world-wide connectivity allowing employees outside the country to connect with the rest still in-country. On the other, you have world-wide connectivity allowing the rest of the world access if you’re not careful.
This was hammered home today when it became apparent that our internal weblogs, private resources not intended for public consumption, were showing up on Google. We’d taken care to hide them via the usual suspects:
robots.txt disallowed indexing all along, and they’ve never not been password protected. So what gives?
A few quick Google queries elucidated: the URLs were being indexed, but content wasn’t. And a bit of refining the search results confirmed why: referrals.
Every time a visitor hits a link, both the server they’re leaving (the referrer) and the server they’re jumping to (the destination) record the transaction. Both servers are aware of this transaction, and in almost every case it’s stored somewhere.
Some people choose to make public the list of sites linking to them. Visit this Daring Fireball article and scroll to the bottom for a real-time example.
This is the crux of it: if you link to a resource, the mere fact that you’ve linked it (and that someone has followed that link, this part is essential) is a piece of data that you have no control over. The transaction needs two servers; if the destination server is out of your control, that piece of data exists outside of your influence. If the destination chooses to publicize this information, you have no way of stopping that from happening.
Naturally, there are ways of minimizing the impact this might have on your system. In this case, we’re going to give a redirect script a shot. By creating a generic
redirect.php in a public-facing directory, and parsing each and every single link in the protected directory to bounce through the redirect first, and then on to the destination, the referral will appear to come from that script. We can’t mask that it’s coming from the domain completely, but we can prevent the directory structure of our internal weblogs from being exposed. This is good enough in our case.