Archives Imported, Neat Perl Script

With the help of Morbus Iff, I was able to scrape my years of archives and get them into a format that MT could import. I thought about assigning every entry an ID number but instead opted to treat each day’s entries as a single entry. This reduces the load on MT when rebuilding. Currently, with nearly 1000 entries in the database, it takes a full 20-30 minutes to rebuild all of the individual archives.

For those who are interested, I’ve written up the process Morbus and I used to get my archives into MT.

To start, you need to read the MT import instructions and understand its formatting and syntax. This is what you need to know so that the Perl script can output a correctly formatted file that will then be read by the Import/Export utility in MT.

The Perl script that is being discussed is here. It can be easily modified to scrape any site that has flat-file archives stored in a uniform format.

Once you have modified the Perl script (get some help if you need it) to match your flat-file archives format, you should change the full path to the directory your archives files are being stored. For instance, my 2002 flat-file archives (an example) were located in:

/usr/www/users/username/includes/archives/2002/

This path can be changed for every directory you want to scrape each time you run the script. Because I did not want to overload my server I scraped one year at a time, changing the path to match once I had successfully output a formatted MT import file for each year.

To run this script, go into your shell and run the following command from the directory you have saved the Perl script to:

perl converter.pl > mtoutput.txt

Because there is a one-second delay (so as to not overload the server) between reading each .html file, this may take a few minutes. When the script is done running, you should have a file in the same directory called mtoutput.txt.

Move this file to a directory inside your MT install called /import. Then, go to the MT Import/Export utility and configure it. For my site, I chose to check “Import entries as me”. Choose a category and a post status, and then click the “Import” button. If everything is correct, you will see the utility print to screen the status of each entry it has successfullly imported (look for an ID number and the word “ok”).

Make sure there are no other files inside the /import directory or the MT Import/Export utility will try to read them as well and most likely generate errors.

Repeat this process for each archive directory you want to scrape. Note that you should delete mtoutput.txt (or move it and and save it to act as a backup) once you have imported it ito MT or you will end up with duplicate entries the next time you run the Import/Export utility.

Posted by Cameron Barrett at May 20, 2003 12:15 AM

Leave a Reply

Your email address will not be published. Required fields are marked *