SHA-1 and DB Integrity

April 24, 2005 1PM PST

Applying cryptographic functions for fun and profit.

I’ve got two books sitting on my desk, PHP and MySQL for Dynamic Web Sites by Larry Ullman which is absolutely fantastic for bringing yourself up to speed if you already have rudimentary programming skills in other languages (though it could stand for an update), and the PHP Cookbook, an O’Reilly book which (so far) seems like a good reference for getting beyond the basics.

So, any guess on what I’ve been working with lately? I’ve been running this site on a pretty basic PHP foundation for a while now, and by basic I mean aside from some PHP includes, passing a couple of variables from page to page, and a slightly more complex spaghetti mess of a comment preview page, there’s not much PHP to be found here.

Relational databases, on the other hand, are an entirely new ball of wax. I’ve worked with people who use them. I’ve studied the flowcharts. I’ve formatted the data coming out of them. But I’ve never had to get my hands dirty before and actually make those connections myself. It’s been an interesting learning process as I’ve put together an application I’ve needed for a while, which really is the best way to learn. Reading a book and not applying the knowledge immediately is a good way to ensure you don’t retain it, I find. Anyway, on to the point of this article, which otherwise probably would have appeared out of left field if not for the above.

This is a particular problem I’ve run into. Up until recently, all user interaction with the database was limited to adding records. Aside from sloppy code, there were no real security risks to speak of. The new functionality I wanted to build would expose records for editing by the user, which is obviously problematic if the wrong people get access to them.

So the editing interface should be hidden behind a layer of verification. There are a few ways of going about this, a login form being the most obvious. Generating a username/password doesn’t make much sense in this context though; the ideal solution would be just a simple text link to get back into the interface and retrieve the data, which requires a unique querystring. Luckily, the data they had previously submitted provides ample doses of uniqueness, so generating one isn’t a problem. But this is also where the dilemma comes to light.

Let’s assume the unique data in this case is their name. So ‘Jane Smith’ is told to re-enter the application with a URI like so:

http://www.example.com/manager/?returnID=janesmith

Simple enough. Except guessing a different returnID value isn’t exactly difficult, so some records may potentially be exposed to the wrong people. Not good.

So it seemed the most effective way to fix this lay in using a cryptographic hash function like MD5 or SHA-1. (And yes, both have been broken by now, but collisions seem pretty much irrelevant when the hashes exist in a one-to-one relationship with only a few thousand records.) By running the name through the sha1 function:

<?php
  $hash = sha1("janesmith");
?>

We get a value that makes guessing pretty much impossible:

http://www.example.com/manager/?returnID=
40ab67d0840a91170a96c60614b9f72202ae22b8

But we’re not quite done yet. The last step is actually verifying the hash once the user returns, and this poses a choice between two equally not-very-good options. First, we can check the hash against those of each record calculated on the fly:

<?php
  $hash = $_GET['returnID'];
  $query = "SELECT name FROM table";
  $result = @mysql_query ($query);

  if ($result) {
   while ($row = mysql_fetch_array($result, MYSQL_NUM)) {
   if ($hash == sha1($row[0])) $successfulReturn = true;
  }
?>

There’s probably better syntax for that loop, but you get the idea. Presumably, though, this won’t scale very well. I didn’t bother testing, but I could easily imagine a few thousand dynamically-calculated SHA-1 hashes bogging down the server in a hurry.

So instead, storing a pre-calculated hash value for each record seems like a good way to ease up the load. Each new record inserted gets a value in the hash column, encoded with the sha1 function of course. Verifying is now as easy as a straight up string compare:

<?php
  $hash = $_GET['returnID'];
  $query = "SELECT record_id FROM table WHERE hash='$hash'";
  $result = @mysql_query ($query);
?>

Due to SHA-1 hashes being 40 bytes, there are minor storage space issues, but considering 10,000 records only add up to 4MB, it’s not worth obsessing over. It bothers me to have that dead weight for a somewhat minor function, but it’s the best I came up with. I’d be curious to compare notes with anyone else who’s grappled with similar problems — is SHA-1 overkill? Is there an easier way to do it that I’ve overlooked?