Wednesday, January 16, 2008

Six Degrees of Wikipedia Automation

Well, it's been a while since my last post, been waiting till I got the six degrees of wikipedia stuff put together. There are basically 5 things that I had to do to get the automation in working order.

  • Create a blog that could be updated via email
  • Create an email address to administer the game through
  • Setup mutt
  • Write a new revision of the generation script
  • Add a line to my crontab

The first step was one of the easier ones. Blogger fits the bill pretty well, letting me create an email address that anything mailed to it will be posted.

The second step was even easier, gmail fit the bill perfectly. The only real parameters I had were that it supported pop3 or imap and smtp.

Mutt is a text-based mail client which can send emails automatically when text is received from STDIN. Although simple in concept, I had a bitch of a time getting it set up correctly. First I found a simple configuration file and made a minor tweak.

set imap_user = ""
set imap_pass = "*******"

set smtp_url = "smtp://"
set smtp_pass = "*******"
set from = ""
set realname = "Six Degrees of Wikipedia"
set content_type = "text/html"

set folder = "imaps://"
set spoolfile = "+INBOX"
set record="+[Gmail]/Sent Mail"
set postponed="+[Gmail]/Drafts"

set header_cache="~/.mutt/cache/headers"
set message_cachedir="~/.mutt/cache/bodies"
set certificate_file=~/.mutt/certificates

set move = no

set sort = 'threads'
set sort_aux = 'last-date-received'

ignore "Authentication-Results:"
ignore "DomainKey-Signature:"
ignore "DKIM-Signature:"
hdr_order Date From To Cc Content-Type

You'll note that the only real tweak I had to make was set content_type = "text/html". This was necessary because if the content_type of the email were "text/plain" then blogger would replace all of my angle brackets with < and > and change the urls into hyperlinks. Figuring out how to set the content-type was very poorly documented and I only really found out because I went onto the irc channel and asked the developers directly. It is a little annoying and not at all obvious that the way to set 'Content-Type' is to set the variable 'content_type'.

Once I got the content-type set correctly, I found out that revision 1.5.17-1 had an incredibly annoying bug where it segfaulted whenever you try to send an email using stdin. I was able to sync to revision 1.5.17-2, but until that's put into the debian testing branch I have to keep an eye on that file in my apt repository.

There were two problems with the wikipedia script:

  • wget creates a file whenever it's run, which was polluting my file system
  • the name in the anchor tag isn't a true title

The first problem was easy, a simple rm took care of it fairly directly. Before I removed the file, though, I was able to extract the title tag from the file that was created.


echo "This round's challenge:<br>"
for ((i=0;i<2;++i))
    wgetout=($(wget -nv 2>&1 \
      | sed 's/^.* URL:\([^ ]*\) .* "\([^\"]*\)".*$/\1 \2/'<br>))
    echo -n "<a href='${wgetout[0]}'>"
    sed -n "/<title>.*<\/title>/s/.*<title>\(.*\) - Wikipedia, the free encyclopedia<\/title>.*/\1/p" ${wgetout[1]}
    echo "</a><br>"
    rm ${wgetout[1]}

One final piece that I needed was to keep track of the round number. To this end, the file ~/round holds the number of the next round, and is incremented in the crontab line.

0 0 * * 1,3,5 round=$(cat ~/round);/home/sixdegreesofwikipedia/bin/generategame | mutt ******* -s "Round $round";echo $(($round + 1)) > ~/round

Although this represents the work required to get a new game generated and posted every couple days, posting the victor from the previous day isn't complete. My friend Mike has written a script to verify that a list of links is correct, and I need to write a script that collects all of the mail for a single game together and passes that information into his script. In the meantime, you can play the generated games at