Retroactive Tagging With TagThe.Net

Hacky hack hack.

Ever since I enabled tags on taint.org, I’ve been mildly annoyed by the fact
that there were thousands of older entries deprived of their folksonomic chunky
goodness. A way to ‘retroactively tag’ those entries somehow would be cool.

Last week, Leonard posted a link on his linkblog to
TagThe.net, a web service which offers a nifty REST API;
simply upload a chunk of text, and it’ll suggest a few tags for that text, like
this:

echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
<memes>
  <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
    <dim type="topic">
      <item>robot</item>
    </dim>
    <dim type="language">
      <item>english</item>
    </dim>
  </meme>
</memes>

This looked promising.

Anyway, I’ve now implemented this — it worked great! If you’re curious, here’s details of how I did it. It’s a bit hacky, since I’m only going to be doing this once — and very UNIXy
and perlish, because that’s how I do these things — but maybe somebody will
find it useful.

How I Retroactively Tagged taint.org

This weblog runs WordPress — so all the entries are stored in a MySQL database. I took the MySQL dump of
the tables
, and a quick
script figured out that out of somewhere over 1600-ish posts, there were 1352
that came from the pre-tag era, requiring tag inference. A mail to the
TagThe.Net team established that they were happy with
this level of usage.

I grepped the post IDs and text out of the SQL dump, threw those into a text
file using the simple format ‘id=NNN text=SQLHTMLSTRING’ (where SQLHTMLSTRING
was the nicely-escaped HTML text taken directly from the SQL dump), and ran
them through this script.

That rendered the first 2k of each of those entries as a URL-encoded string,
invoked the REST API with that, got the XML output, and extracted the tags into
another UNIXy text-format output file. (It also added one tag for the
‘proto-tag’ system I used in the early days, where the first word of the entry
was a single tag-style category name.)

Next, I ran this script, which
in turn took that intermediate output and converted it to valid PHP code, like
so:

cat suggestedtags | ./taglist-to-php.pl  > addtags.php
scp addtags.php my.server:taint.org/wp-admin/

The generated page ‘addtags.php’ looks like this:

<?php
  require_once('admin.php');
  global $utw;
  $utw->SaveTags(997, array("music","all","audio","drm-free",
      "faq","lunchbox","destination","download","premiere","quote"));
  [...]
  $utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
  $utw->SaveTags(999, array("oses","eek","longhorn","ram",
    "winsupersite","windows","amount","base","dog","preview","system"));
?>

Once that page was in place, I just visited it in my (already logged in) web
browser window, at
http://taint.org/wp-admin/addtags.php,
and watched as it gronked for a while. Eventually it stopped, and all those
entries had been tagged. (If I wasn’t so hackish, I might have put in a little UI text here — but I didn’t.)

The results are very good, I think.

A success: http://taint.org/tag/research has picked up a lot of the
interesting older entries where I discussed things like IBM’s Tieresias
pattern-recognition algorithm. That’s spot on.

A minor downside: it’s not so good at nouns. This
entry
talks about Silicon Valley and geographical
insularity, and mentions “Silicon Valley” prominently — one or both of those
words would seem to be a good thing to tag with, but it missed them.

Still, that’s a minor issue — the tags it has suggested are generally very
appropriate and useful.

Next, I need to find a way to auto-generate titles for the really
old entries ;)

Tags:

This post was written by Justin, source: Retroactive Tagging With TagThe.Net

Comments are closed.

Creative Commons License
This work is licensed under a Creative Commons License.