Migrating your blog’s content to WordPress
Whether you’re starting up a new blog or thinking about migrating an existing one, WordPress is a great publishing platform to consider. Deployment is simple, the user interface is easy to use, thousands of plugins offer rich extensibility, and the user community is always there to help.
Setting up a new instance of WordPress is easy to figure out, just take a look at the online documentation. Importing data from an existing blog can be a challenge however. Despite the fact that WordPress supports data imports from major publishing platforms such as Blogger, Movable Type, LiveJournal, and others, the import process is not perfect — tags, categories, even posts are often lost in the translation. Sometimes you will have to manually restore your content and its metadata.
In this article I will show you how to import HTML content into WordPress using open source tools with a little bit of scripting to automate the process. Specifically, we will be using the Atom Publishing Protocol (APP) support in WordPress to create weblog entries via an APP client.
Step 1: Configure WordPress as an Atom Publishing Protocol server
We’ll start by configuring our target WordPress instance as an Atom Publishing Protocol server. First, log in to the WordPress administrative control panel as your administrative user. We will use WordPress’s default ‘admin’ user. Now, navigate to the ‘Writing Settings’ page under ‘Settings’ (http://wp-host.local.server/wp-admin/options-writing.php). Under the ‘Remote Publishing’ section of this page, click the box for ‘Atom Publishing Protocol’ (see Figure 1 below), then click the ‘Save Changes’ button at the bottom of the page.

Figure 1: Enabling Atom Publishing Protocol in WordPress
The APP server is now available as a service. Later we’ll create blog entries by accessing the service URL (http://wp-host.local.server/wp-app.php/posts).
Step 2: Generate a list of URLs pointing to content to be imported
Now we will pull down an RSS feed of the weblog entries we’d like to import from our current blog. Then we will use awk to extract the URLs for each entry and save them in a text file. The following shell script does the trick.
Listing 1: Bash script to extract URLs from RSS feed
1 2 3 4 5 6 7 | #!/bin/bash SOURCE_RSS_FEED='http://www.example.com/blog/rss' URLS_TO_IMPORT='urls-to-import.txt' lynx -source $SOURCE_RSS_FEED | \ awk '/<guid>/ { gsub(/<\/?guid>|[[:space:]]*/,""); print }' > $URLS_TO_IMPORT |
In lines 3 and 4 we setup two variables — $SOURCE_RSS_FEED points to the blog’s RSS feed and $URLS_TO_IMPORT points to a text file that will contain URLs extracted from the source RSS feed. Next, on line 6, we use Lynx to access http://www.example.com/blog/rss.
Then we pipe its output through awk on line 7. In the RSS feed, the URL for each entry is embedded between a pair of <guid></guid> tags. For example,
<guid>http://www.example.com/blog/archive/2008/05/14/review-ubuntu-8-04-hardy</guid>
The awk command matches lines containing the <guid> tag, then removes both <guid></guid> tags and any whitespace, leaving the URL.
Step 3: Use an HTML parser to extract content from each URL
We will use the Hpricot HTML parser to read the contents of the selected URLs from Step 2. Using XPath expressions to extract specific HTML elements from the resulting page, we populate an Atom Entry document. The Atom Entry document is then published to WordPress via an HTTP Post request.
Listing 2 represents the kind of HTML markup of each weblog entry that we’re importing.
The XPath expressions in Listing 3 (lines 92-96) match the HTML elements in the sample source document shown in Listing 2. You’ll need to modify the XPath expressions to match your specific HTML source document. Also, note that you’ll need to modify the regular expression used in line 42 of Listing 3, since it is specific to the format of the date/time string in the sample source document in Listing 2.
Listing 2: HTML markup for source weblog entry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <div> <h1 class='entry_title'>Lorem ipsum dolor</h1> <span class='entry_author_name'>Charles Dickens</span> <span class='entry_date'>January 14, 2009 at 10:15 am</span> <div class='entry_summary'> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p> </div> <div class='entry_body'> <p>Suspendisse sodales, felis ut malesuada convallis, magna dui euismod quam, in dictum turpis lectus a nisl. Cras interdum luctus enim.</p> <p>Proin rhoncus, dolor eget pulvinar sodales, erat metus pulvinar nunc, eu molestie sem magna eu quam.</p> </div> </div> |
The Ruby script in Listing 3 reads in the text file (urls-to-import.txt) created in Listing 1. As already mentioned, this text file contains the URLs for the weblog entries we want to import. Next, we parse the HTML document at each URL using Hpricot to extract specific HTML elements that match our XPath expressions (lines 92-96). With the parsed results, we set up an Atom Entry document. Finally, we create a new blog post on our target WordPress instance (wp-host.local.server) by publishing the Atom entry document via an HTTP Post request.
Listing 3: Ruby script to import HTML content into WordPress using Hpricot and APP
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | #!/usr/bin/env ruby require 'date' require 'rubygems' require 'open-uri' require 'net/http' require 'hpricot' require 'atom/entry' # sudo gem install atom-tools require 'atom/collection' # helper methods class Helpers class << self def create_document args entry = Atom::Entry.new entry.title = args[:title] entry.summary = args[:summary] entry.content = args[:content] entry.content.type = args[:content_type] entry.published = args[:time_stamp] entry.updated = args[:time_stamp] entry end def create_author args author = Atom::Author.new author.name = args[:name] author.uri = args[:uri] author end def create_http_request args req = Atom::HTTP.new req.user = args[:user] req.pass = args[:pass] req.always_auth = args[:always_auth] req end def parse_date arg parts = /(\w*)\s*(\d*),\s*(\d*)\s*at\s*(\d*):(\d*)\s*(\w*)/.match(arg) mon, day, year, hour, minute, ampm = parts.captures month = Date::MONTHNAMES.index(mon) # convert to 24-hour notation if ampm == 'pm' if hour.to_i < 12 hour = (hour.to_i + 12).to_s end else if hour.to_i == 12 hour = '00' end end # Time object from parsed date t = Time.local(year.to_i, month, day.to_i, hour.to_i, minute.to_i) end # replace newlines/carriage returns with spaces def cleanup(str) str.gsub(/\r\n/, '') end end end def method_missing(*args) m = args.shift Helpers.send m, *args end Urls_To_Import = "urls-to-import.txt" Blog_Host = "wp-host.local.server" Blog_Uri = "http://#{Blog_Host}" Base = "http://#{Blog_Host}/wp-app.php" Authors = { 'Charles Dickens' => {'user' => 'cdickens', 'password' => 'secret'} } urls = Array.new # read in URLs from text file urls = File.readlines(Urls_To_Import).map { |line| line.chomp } # parse each HTML document from list of URLs urls.each { |target| doc = Hpricot(open(target)) # extract HTML within element matching XPath expression hTitle = cleanup((doc/"div/h1[@class='entry_title']").inner_html) hAuthor = cleanup((doc/"div/span[@class='entry_author_name']").inner_html) hDatestr = cleanup((doc/"div/span[@class='entry_date']").inner_html) hExcerpt = cleanup((doc/"div[@class='entry_summary']").inner_html) hContent = cleanup((doc/"div[@class='entry_body']").inner_html) # Atom Author element author = create_author :name=>hAuthor, :uri=>Blog_Uri # Atom Entry document entry = create_document :title=>hTitle, :summary=>hExcerpt, \ :content=>hContent, :content_type=>"html", \ :time_stamp=>parse_date(hDatestr).iso8601 entry.authors << author # Atom HTTP request http_req = create_http_request :user=>Authors[hAuthor]['user'], \ :pass=>Authors[hAuthor]['password'], :always_auth=>:basic # Atom Collection c = Atom::Collection.new(Base + "/posts", http_req) res = c.post! entry puts "Imported URL: #{target}, #{res.message}\n" } |
With this article I’ve shown you how to import content into WordPress using its Atom Publishing Protocol interface. With the help of open source tools like Lynx, awk, Hpricot, and Ruby, we’ve been able to convert a sample HTML document into a WordPress blog post with minimal effort.

Copyright © 2009 Technetra. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pings are currently closed.