Technetra

Migrating your blog’s content to WordPress

Nilayan Sharma,  February 24th, 2009 at 3:15 am

Whether you’re starting up a new blog or thinking about migrating an existing one, WordPress is a great publishing platform to consider. Deployment is simple, the user interface is easy to use, thousands of plugins offer rich extensibility, and the user community is always there to help.

Setting up a new instance of WordPress is easy to figure out, just take a look at the online documentation. Importing data from an existing blog can be a challenge however. Despite the fact that WordPress supports data imports from major publishing platforms such as Blogger, Movable Type, LiveJournal, and others, the import process is not perfect — tags, categories, even posts are often lost in the translation. Sometimes you will have to manually restore your content and its metadata.

In this article I will show you how to import HTML content into WordPress using open source tools with a little bit of scripting to automate the process. Specifically, we will be using the Atom Publishing Protocol (APP) support in WordPress to create weblog entries via an APP client.

Step 1: Configure WordPress as an Atom Publishing Protocol server

We’ll start by configuring our target WordPress instance as an Atom Publishing Protocol server. First, log in to the WordPress administrative control panel as your administrative user. We will use WordPress’s default ‘admin’ user. Now, navigate to the ‘Writing Settings’ page under ‘Settings’ (http://wp-host.local.server/wp-admin/options-writing.php). Under the ‘Remote Publishing’ section of this page, click the box for ‘Atom Publishing Protocol’ (see Figure 1 below), then click the ‘Save Changes’ button at the bottom of the page.

Figure 1: Enabling Atom Publishing Protocol in WordPress

Figure 1: Enabling Atom Publishing Protocol in WordPress

The APP server is now available as a service. Later we’ll create blog entries by accessing the service URL (http://wp-host.local.server/wp-app.php/posts).

Step 2: Generate a list of URLs pointing to content to be imported

Now we will pull down an RSS feed of the weblog entries we’d like to import from our current blog. Then we will use awk to extract the URLs for each entry and save them in a text file. The following shell script does the trick.

Listing 1: Bash script to extract URLs from RSS feed

1
2
3
4
5
6
7
#!/bin/bash
 
SOURCE_RSS_FEED='http://www.example.com/blog/rss'
URLS_TO_IMPORT='urls-to-import.txt'
 
lynx -source $SOURCE_RSS_FEED | \
awk '/<guid>/ { gsub(/<\/?guid>|[[:space:]]*/,""); print }' > $URLS_TO_IMPORT

In lines 3 and 4 we setup two variables — $SOURCE_RSS_FEED points to the blog’s RSS feed and $URLS_TO_IMPORT points to a text file that will contain URLs extracted from the source RSS feed. Next, on line 6, we use Lynx to access http://www.example.com/blog/rss.

Then we pipe its output through awk on line 7. In the RSS feed, the URL for each entry is embedded between a pair of <guid></guid> tags. For example,

<guid>http://www.example.com/blog/archive/2008/05/14/review-ubuntu-8-04-hardy</guid>

The awk command matches lines containing the <guid> tag, then removes both <guid></guid> tags and any whitespace, leaving the URL.

Step 3: Use an HTML parser to extract content from each URL

We will use the Hpricot HTML parser to read the contents of the selected URLs from Step 2. Using XPath expressions to extract specific HTML elements from the resulting page, we populate an Atom Entry document. The Atom Entry document is then published to WordPress via an HTTP Post request.

Listing 2 represents the kind of HTML markup of each weblog entry that we’re importing.

The XPath expressions in Listing 3 (lines 92-96) match the HTML elements in the sample source document shown in Listing 2. You’ll need to modify the XPath expressions to match your specific HTML source document. Also, note that you’ll need to modify the regular expression used in line 42 of Listing 3, since it is specific to the format of the date/time string in the sample source document in Listing 2.

Listing 2: HTML markup for source weblog entry

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<div>
  <h1 class='entry_title'>Lorem ipsum dolor</h1>
  <span class='entry_author_name'>Charles Dickens</span>
  <span class='entry_date'>January 14, 2009 at 10:15 am</span>
 
  <div class='entry_summary'>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
  </div>
 
  <div class='entry_body'>
    <p>Suspendisse sodales, felis ut malesuada convallis, magna
    dui euismod quam, in dictum turpis lectus a nisl. Cras interdum
    luctus enim.</p>
    <p>Proin rhoncus, dolor eget pulvinar sodales, erat metus
    pulvinar nunc, eu molestie sem magna eu quam.</p>
  </div>
</div>

The Ruby script in Listing 3 reads in the text file (urls-to-import.txt) created in Listing 1. As already mentioned, this text file contains the URLs for the weblog entries we want to import. Next, we parse the HTML document at each URL using Hpricot to extract specific HTML elements that match our XPath expressions (lines 92-96). With the parsed results, we set up an Atom Entry document. Finally, we create a new blog post on our target WordPress instance (wp-host.local.server) by publishing the Atom entry document via an HTTP Post request.

Listing 3: Ruby script to import HTML content into WordPress using Hpricot and APP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#!/usr/bin/env ruby
 
require 'date'
require 'rubygems'
require 'open-uri'
require 'net/http'
require 'hpricot'
require 'atom/entry' # sudo gem install atom-tools
require 'atom/collection'
 
# helper methods
class Helpers
  class << self
 
    def create_document args
      entry = Atom::Entry.new
      entry.title = args[:title]
      entry.summary = args[:summary]
      entry.content = args[:content]
      entry.content.type = args[:content_type] 
      entry.published = args[:time_stamp]
      entry.updated = args[:time_stamp]
      entry
    end
 
    def create_author args
      author = Atom::Author.new
      author.name = args[:name]
      author.uri = args[:uri]
      author
    end
 
    def create_http_request args
      req = Atom::HTTP.new
      req.user = args[:user]
      req.pass = args[:pass]
      req.always_auth = args[:always_auth]
      req
    end
 
    def parse_date arg
      parts = /(\w*)\s*(\d*),\s*(\d*)\s*at\s*(\d*):(\d*)\s*(\w*)/.match(arg)
      mon, day, year, hour, minute, ampm = parts.captures
      month = Date::MONTHNAMES.index(mon)
 
      # convert to 24-hour notation
      if ampm == 'pm'
        if hour.to_i < 12
          hour = (hour.to_i + 12).to_s
        end
      else
        if hour.to_i == 12
          hour = '00'
        end
      end
 
      # Time object from parsed date
      t = Time.local(year.to_i, month, day.to_i, hour.to_i, minute.to_i)
    end
 
    # replace newlines/carriage returns with spaces
    def cleanup(str)
      str.gsub(/\r\n/, '')
    end
 
  end
end 
 
def method_missing(*args)
  m = args.shift
  Helpers.send m, *args
end
 
Urls_To_Import = "urls-to-import.txt"
Blog_Host = "wp-host.local.server"
Blog_Uri = "http://#{Blog_Host}"
Base = "http://#{Blog_Host}/wp-app.php"
Authors = {
  'Charles Dickens' => {'user' => 'cdickens', 'password' => 'secret'}
}
 
urls = Array.new
 
# read in URLs from text file
urls = File.readlines(Urls_To_Import).map { |line| line.chomp }
 
# parse each HTML document from list of URLs
urls.each { |target|
  doc = Hpricot(open(target))
 
  # extract HTML within element matching XPath expression
  hTitle = cleanup((doc/"div/h1[@class='entry_title']").inner_html)
  hAuthor = cleanup((doc/"div/span[@class='entry_author_name']").inner_html)
  hDatestr = cleanup((doc/"div/span[@class='entry_date']").inner_html)
  hExcerpt = cleanup((doc/"div[@class='entry_summary']").inner_html)
  hContent = cleanup((doc/"div[@class='entry_body']").inner_html)
 
  # Atom Author element
  author = create_author :name=>hAuthor, :uri=>Blog_Uri
 
  # Atom Entry document
  entry = create_document :title=>hTitle, :summary=>hExcerpt, \
    :content=>hContent, :content_type=>"html", \
    :time_stamp=>parse_date(hDatestr).iso8601
  entry.authors << author
 
  # Atom HTTP request
  http_req = create_http_request :user=>Authors[hAuthor]['user'], \
    :pass=>Authors[hAuthor]['password'], :always_auth=>:basic
 
  # Atom Collection
  c = Atom::Collection.new(Base + "/posts", http_req)
  res = c.post! entry
 
  puts "Imported URL: #{target}, #{res.message}\n"
}

With this article I’ve shown you how to import content into WordPress using its Atom Publishing Protocol interface. With the help of open source tools like Lynx, awk, Hpricot, and Ruby, we’ve been able to convert a sample HTML document into a WordPress blog post with minimal effort.

Copyright © 2009 Technetra. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pings are currently closed.

The State of Open Source in India Article Index Introducing Harald - The Ruby Bluetooth Tester

Comments

Be the first to post a comment.

Add a comment

Leave a comment or send a note
  1. (required)
  2. (valid email required)
  3. (required)
  4. Send
  5. Captcha
 

cforms contact form by delicious:days

© 2000-2009 Technetra. All rights reserved. Contact | Terms of Use

WordPress