Mobile version (Display Regular Site)

Skip to: Navigation | Content | Sidebar | Footer


Weblog Entry

Automaton

August 04, 2005

A question on automating database population for you server-side experts.

Let’s say you have a web site that you’ve been updating manually for a few years. Let’s also say that you’re sick to death of doing it this way, have finally taken the steps necessary to automate this thankless task, and now it’s finally time to throw all that manually-input data into a database. For the sake of argument, let’s also assume that adding the 700+ items by hand just isn’t going to happen.

So then my question to you is, can you see any way of taking multiple pages of static, well-formed (and consistent) HTML like, say, this, and getting it to automatically post to a form that looks, well, like this?

I’m sure to some out there, it’s the easiest thing in the world. To me, however, it’s not. So, what can be done? My preference would be a single PHP file that crawls the various category pages (and the sub-pages they link to) under the ‘All Categories’ heading on the ‘All Designs’ page and either a) posts it to the form (assuming the example values aren’t there in the final form), or my preference, b) stores it in logically-named PHP arrays (following the naming convention in the example function below) so that I can bypass the form completely.

Here are a few more supporting pieces of information that might be relevant to this task:

  • Not all fields in the new form are represented in the static data. For example, currently each submission only receives one category, whereas multiple categories exist on the form. So the script should detect which category the static data is coming from, assume a corresponding entry is in the list of categories (even though they aren’t at the moment), and assign accordingly. Multiple categories can always be re-assigned later.

    Fields which are considered absolutely essential/cannot be discarded/must have a value are: Name, URL (when it exists), CSS File, Submission Title, Category, Official Number (where it exists). Fields which exist in the database for other use, but don’t appear anywhere in the static data (and therefore can be ignored) are: E-Mail Address, Zip File, Windows Browsers, Mac Browsers, Comments. See note about submission/publication dates below.

  • The order in which the submissions are displayed in the static files is important, in that the further along in the list they are, the earlier they were submitted, and therefore the earlier they should be posted. I’ll probably have to do some manual jiggering of the submission/publishing dates because I’ve pretty much discarded that data all along, so as much as possible, the script should try to preserve this implicit chronology. The more relevant of the two is publishing date, as it’s what will ultimately determine listing order on the site.
  • The form flattens out the data quite a bit; in reality, this data spans multiple tables in the database. This shouldn’t matter, but in case you think better this way, it might be relevant to know that the form itself basically posts to a PHP function:

    save_submission($subName, $subEmail, $subUrl, $subNation, $subTitle, $subCssfile, $subZipfile, $subWindows, $subMac, $subNotes, $subStatus, $subDate, $pubDate, $subOfficialNumber, "populate");

    But a separate category handler function also needs to be called:

    change_categories($subId, $subCategories);

    …where $subId is the submission id just created, and $subCategories is an array of the values selected.

Free copy of The Zen of CSS Design for the best answer. Dan Reedy’s multiple scripts get me close enough to where I can run with it, so the book is his. Thanks to everyone else for your suggestions.


1
chris says:
August 04, 01h

Don’t know if you noticed Dave but in FireFox there is two purple/pink lines that go across the screen.

If you did, sorry.

2
August 04, 02h

Maybe you would think it crazy, but Javascript may work quite well. It can pull each page via Ajax/XHR and walk the data using the DOM interface to extract the necessary data. Once it has the data it can place the values into the submission form and submit it, just as you would manually.

It would be easier to implement if you could place ID values into the UL tags to assist with attaching to the elements using the document.getElementById method.

The trick is retaining state after submitting the form. The form may have to be contained within an iframe and when the iframe loads it would trigger the Javascript to process the next page on the list.

I’d be interested in at least getting you started.

3
August 04, 03h

Can’t help you on the PHP, but man that zen garden admin panel sure looks sweet! (drool)

4
August 04, 05h

If your pages are all well-formed xhtml, you could do the entire thing with an XSLT stylesheet, and have it output PHP or SQL code.

5
Dan says:
August 04, 05h

If anyone checks out the example code in comment 15, you’ll notice that the actual page is a little different. In the example code I was using the nationality flag’s title tag rather than the alt tag. I fixed that and all the nations should show up now.

6
Tom W.M. says:
August 04, 05h

Brennan, that’s just what I was thinking. I know it’s too late, but here it is:

http://freecog.net/zenparse.html

It uses an iframe to loop through the Zen Garden pages, collecting information, which is placed in an array, and outputted at the end to a textbox in JSON (http://json.org/ ) format, which should be easy enough to parse with PHP. To avoid XSS (cross-site scripting) restrictions, it is activated via a bookmarklet while viewing a mezzoblue.com page. Only known to work in Firefox 1.06. Will fail in Internet Explorer.

I was trying to do a similar thing with my library’s web site (to monitor my holds) a couple of weeks ago, so I had a bit of a head start.

I’ve had some problems getting the bookmarlet to work when it’s not hosted on localhost (security restrictions), so the extracted information is available at http://freecog.net/zenparsed.txt

7
Tom Weir says:
August 04, 10h

My suggestion would be to do some web spidering using Perl’s WWW::Mechanize library. The script can take the extracted data & do one of two things
1) Directly dump the data into the database of your choice.
2) Write a PHP file with a series of function calls/array definitions.

I’ve done this in the past, and it’s not all that difficult to set up. I should have some free time in a week or so if you like some help getting it going.

8
August 04, 10h

I had to do something very similar recently (with Java/ATG, not with PHP). The approach I took was:

- a crawler walks the filesystem starting from a directory
- for each file, the crawler extracts the relevant HTML code with a pair of regexps, builds something equivalent to an SQL “insert” statement and inserts the content into the database

The general principle shouldn’t change in your case, anyway I think you shouldn’t bother to pass thru the form.

9
Dave S. says:
August 04, 11h

Tom - A direct DB dump probably isn’t an option since I’d have to get it to establish and link all the various foreign keys as a part of that. My save_submission function does all that for me, so no sense in duplicating that work. The PHP function call/array file, on the other hand, would be perfect.

I’d love to do this without installing a new library (or learning Perl), but depending on other answers, WWW::Mechanize could work. Thanks for the suggestion.

10
Dan says:
August 04, 11h

I have the unfortunate task of having to do similar things at work sometimes. Typically if I have to pull information from well formed XHTML documents I load the page in PHP and using an XML parser (I typically use miniXML) I can then convert the HTML into a mutli-dimensional array. You should then have a single array item for each of your entries, depending on the HTML of course.

11
August 04, 11h

Dave,

I have not made anything that actually crawls a web page, but I have created a simple form that I copy and paste a large chunk of pre-determined text (schedules for a client) and it breaks the string into array chunks and formats it line by line outputting a final xhtml file.

I don’t think what you’re trying to accomplish is easy, or difficult, but it most likely will be time consuming.

Most likely you will be using PHP’s preg_split function, a loop of some sort and the next(),strpos() and substr() functions as well.

to make this actually open a file, you can always use the fopen() function and do some file handling.

While I don’t claim to be an expert, I do think my approach will do what you want. The problem being is that you will have to write it to explicitly read and understand your formatting.

I’d be happy to continue this conversation off the comments.

12
August 04, 11h

I haven’t used it before, but O’Reilly has an article on Sprog that may help. Basically Perl for non-Perl people. The example in the article (http://www.perl.com/pub/a/2005/06/23/sprog.html ) scrapes data from a page and makes a useful file out of it. You could probably easily adjust a couple of things to get it to fit your needs.

I think I’ve seen posts that indicate you have a Mac… I haven’t seen any indication that anyone has installed it on one, but I don’t see why you couldn’t. You’d just need to get the Perl modules first. (Trying it out now on my Mac as it looks like an interesting tool to have)

13
August 04, 11h

Parsing the pages as XML is the way to go. You could use pure PHP to parse every file in a directory, looking for certain tags with certain classes (kind of like microformats), etc. which would be sucked into the DB (minding the duplicates, of course). The bonus of that is that it could be run from the prompt for added speed.

14
Dan says:
August 04, 11h

I’ve thrown together what I was talking about, pulling HTML into an array. Check it out at http://www.dan-reedy.com/example/zengarden-parse.php

Here is the code:

[code]
<?php
include_once(“minixml/minixml.inc.php”);
$url = “http://www.mezzoblue.com/zengarden/alldesigns/official/007/”;
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($curl, CURLOPT_URL, $url);
$xml = curl_exec($curl);
$minixml = new MiniXMLDoc();
$minixml->fromString($xml);
$array = $minixml->toArray();
echo “<pre>”;
print_r($array);
echo “</pre>”;
curl_close($curl);
?>

[/code]

15
Dan says:
August 04, 11h

This is also a completely raw import and output of the HTML page as an array. With minixml (and any other XML package) you can do much more.

16
Dallas Pool says:
August 04, 11h

I remember doing this for [site removed]… they had 2700 static pages with basic information in the pages relating to businesses. We had to crawl each page in the directory (fortunately they were all in one dir) and strip the content based on the tags surrounding the content… problem was, since they were static, some pages had different tags….

Trust me that was NOT fun to mess with.

Basicly, we had to get all that matched a certain regex, strip that data, post to a db, and remove the files. What was left had to be configured to a new regex crawl with different rules to adjust for the tag differences.

I would recomend going with a page crawl, which would save all data found in all pages to a single array structure, then when the crawling is done, loop the bigass array and pass the data found to your insert functions.

17
Tom Weir says:
August 04, 11h

That’s actually why I suggested the option of auto-generating a PHP script that can call your PHP function.

I’ve used this approach a few times when I’ve been working with multiple systems with differing configurations. Let Perl do what it does well, and then leverage your existing work.

I’m using a very similar technique to populate the wiki at http://fmepedia.com
Using both Perl & Python, I parse several different text files to extract a variety of useful information. The end result is a php include file with a series of createPost() calls. I include that file in a separate PHP file that defines the createPost function, and handles the MediaWiki article editting.

18
Dave S. says:
August 04, 11h

Dan - I like where you’re going with the miniXML thought. Your example file gives me something I can work with, and if I get rid of the header/footer include files, it’ll simplify the array a LOT more.

I’m pretty sure I can piece together the rest, including walking the directory tree. Barring a more complete solution, I think the book is yours.

19
Dan says:
August 04, 11h

Check out http://www.dan-reedy.com/example/zengarden-parse2.php

It strips out the unnecessary. Here is the code for that:

[code]
$list = $array[“html”][“body”][“ul”][2][“li”];
print_r($list);
[/code]

20
Dan says:
August 04, 11h

And finally: http://www.dan-reedy.com/example/zengarden-parse3.php

This is pulling out the basic information from that list, you can tweak as much as needed. Here is the full code:

[code]
<?php

ini_set(“include_path”,ini_get(“include_path”) . “:/home/dreedy/php/lib/php”);
include_once(“minixml/minixml.inc.php”);
$url = “http://www.mezzoblue.com/zengarden/alldesigns/official/007/”;
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($curl, CURLOPT_URL, $url);
$xml = curl_exec($curl);
$minixml = new MiniXMLDoc();
$minixml->fromString($xml);
$array = $minixml->toArray();
$list = $array[“html”][“body”][“ul”][2][“li”];
for($i = 0; $i < count($list); $i++) { ?>
<dl>
<dt>Name: <?= $list[$i][“a”][0][“_content”] ?></dt>
<dd class=”author”>Author: <?= $list[$i][“a”][1][“_content”] ?></dd>
<dd class=”file”>CSS File: <?= $list[$i][“a”][0][“_attributes”][“href”] ?></dd>
<dd class=”homepage”>Author Homepage: <?= $list[$i][“a”][“1”][“_attributes”][“href”] ?></dd>
</dl>
<?
}
curl_close($curl);

?>
[/code]

21
Dan says:
August 04, 12h

This will be my last post, since I’ve spent too much time on it already. Here is a file that parses all of the official design pages:
http://www.dan-reedy.com/example/zengarden-parse5.php

[code]
<?php
include_once(“minixml/minixml.inc.php”);
$urls[] = “http://www.mezzoblue.com/zengarden/alldesigns/official/”;
for ($i = 9 ; $i > 0 ; $i–) {
$urls[] = “http://www.mezzoblue.com/zengarden/alldesigns/official/00” . $i . “/”;
}
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 5);
$j = 1;
foreach($urls as $url) {
curl_setopt($curl, CURLOPT_URL, $url);
$xml = curl_exec($curl);
$minixml = new MiniXMLDoc();
$minixml->fromString($xml);
$array = $minixml->toArray();
$list = $array[“html”][“body”][“ul”][2][“li”];
for($i = 0; $i < count($list); $i++) {
if (!empty($list[$i][“a”][0][“_attributes”][“href”])) { ?>
<dl>
<dt><strong>#<? echo $j; $j++ ?></strong>: <?= $list[$i][“a”][0][“_content”] ?></dt>
<dd class=”author”>Author: <?= $list[$i][“a”][1][“_content”] ?></dd>
<dd class=”file”>CSS File: <?= $list[$i][“a”][0][“_attributes”][“href”] ?></dd>
<dd class=”homepage”>Author Homepage: <?= $list[$i][“a”][“1”][“_attributes”][“href”] ?></dd>
<dd class=”nationality”>Nationality: <?= $list[$i][“img”][“_attributes”][“title”] ?></dd>
</dl>
<?
}
}
}
curl_close($curl);

?>
[/code]

22
Dave S. says:
August 04, 12h

Thanks Dan, I’ll email you about the book.

23
Dan says:
August 04, 12h

Thanks very much Dave, I look forward to receiving the book. Please feel free to email me any additional questions you may have about the code, I’d be more than happy to help.

24
August 04, 12h

Note to self: Check bloglines more often.

As I’m reading the article, I’m thinking along the exact same lines as Dan. Then I go to write the comment, and see… you know the rest.

THAT BOOK SHOULD BE MINE!

25
James says:
August 05, 01h

I’m sure this has been mentioned a million times… but seems an appropriate place to mention it anyway. A page with thumbnails of the various CSS Zen Garden designs would be really cool, especially considering how many designs there are in there now. If you’re automating the submission process, maybe you could add a field for a thumbnail image generated by the author along with the zip and CSS?

Of course by doing this I guess the rest of the site would need thumbnailing too… (a fun job no doubt), but it’d be great to be able to browse the CSS Zen Garden gallery with thumbnails.

26
Dan says:
August 05, 07h

In response to comment #25:

There is a site that is somewhat up to date with thumbnails of the zen garden. http://antenna.readalittle.net/thumblink/zenGarden/

It’s using a couple of applications to create these thumbnails. The primary one being webnail2. You specify a list of the sites, the filenames and it does the rest.

Unfortunately it is a windows only application and I cannot find a similar app for linux or mac osx. I suppose you could use this app to grab the shots of all the old sites and then just create thumbnails as the new entries are submited.

27
Will says:
August 05, 09h

How cool is this? A flip. Visiting a forum? “I dont need no stinkin forum. This is mezzoblue!” People are always whining in the comments ‘why dont you do it like this…’ or ‘I’ve been doing that for years’. Might as well use it.

Best hack yet dave.

Will

28
August 06, 09h

I’ve been asked to do this a few times, unfortunately normally with some not so well formated code.

The PERL and php suggestions are all good ones. The one thing I would chip in is to break the task up into steps. i.e. one script to get the content, another to move it into a common format, another to populate the database. In the long run it will make the whole thing a lot easier.

29
Paul says:
August 08, 09h

Nice hack, you should get together with dave from http://www.raven.za.net and put something together as he wrote something similar.

peace

paul
http://www.thegoogleblog.com

30
J. M. P. C. says:
December 02, 15h

About the Webnail2 that Dan was refering to,
for Mac you can use a program called webkit2png or another called paparazzi!
check the following URL for more info:
http://www.paulhammond.org/webkit2png/

by the way, it was f****ing hard finding anything about Webnail2 in plain english, does anyone have any more information on it? Dan…? anyone?

31
Saha says:
December 25, 05h

I would recomend going with a page crawl, which would save all data found in all pages to a single array structure, then when the crawling is done, loop the bigass array and pass the data found to your insert functions.