Importing txt to WordPress

After getting basic wordpress set up, I figured that I should populate it with whatever stuff I had around from previous plan/blog/etc stuff. I had a pretty big set of note-like material in text with either a light wiki or restructured text markup, so I figured that importing them would be good to give the site some real content. There’s some good stuff, and some lame stuff too.

After some quick poking around, the simplest solution appeared to be to export the site using it’s export to xml feature and use that as a template to reformat my text files into for reimport. Turns out it was pretty easy to get working. The code is pretty ugly and relies on the docutils package: http://docutils.sourceforge.net/

I just need the basic restructured text to html conversion, which this person was nice enough to document: http://stackoverflow.com/questions/6654519/python-parsing-restructuredtext-into-html There are a few additional issues, mostly removal of css elements to make it play nice, but some quick regexps clean those up.

It appears to have basically worked and spot checks of various entries show that the conversion appears to have resulted in mostly-readable text, so I’m happy.

The code is below, I did not code it using Test Driven Development style since I intend to only use it once, so it’s a little ugly…

#!python
from docutils.core import publish_parts
import re, glob

def format_file(post_id, filename):

    fh=open(filename)

    title = "notes-"+str(post_id)
    date = ""
    year = 0
    month = 0
    month_string = ""
    day_of_week = "Sat"
    text = ""
    html = ""
    site_url = "http://www.3cats.us/blog"

    match=re.search('(?P<year>\d+)\s+(?P<month_string>\w+)\s+(?P<day>\d+)', filename)
    if(match):
        year = int(match.groupdict()['year'])
        month_string = match.groupdict()['month_string']
        month = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}[month_string]
        day = int(match.groupdict()['day'])

    match=re.search('\s\-\s+(.*)\.[tT][xX][tT]', filename)
    if(match):
        title = match.groups(0)[0]
        #print(title)

    title_no_whitespace = re.sub('\s+', '-', title)

    text = ''.join(fh.readlines())
    html = publish_parts(text,writer_name='html')['html_body']
    striped_html = re.sub('class=".*?"','',str(html))
    striped_html = re.sub('<\/?div.*?>', '', striped_html)

    return """<item>
    <title>%(title)s</title>
    <link>%(site_url)s/%(year)d/%(month)d/%(title_no_whitespace)s/</link>
    <pubDate>%(day_of_week)s, %(day)2.2d %(month_string)s %(year)d 00:00:00 +0000</pubDate>
    <dc:creator>Mike</dc:creator>
    <guid isPermaLink="false">%(site_url)s/?p=%(post_id)d</guid> <description/>
    <content:encoded>
%(striped_html)s
    </content:encoded>
    <excerpt:encoded>
    <![CDATA[]]>
    </excerpt:encoded>
    <wp:post_id>%(post_id)d</wp:post_id>
    <wp:post_date>%(year)d-%(month)2.2d-%(day)2.2d 00:00:00</wp:post_date>
    <wp:post_date_gmt>%(year)d-%(month)2.2d-%(day)2.2d 00:00:00</wp:post_date_gmt>
    <wp:comment_status>closed</wp:comment_status>
    <wp:ping_status>closed</wp:ping_status>
    <wp:post_name>%(title)s</wp:post_name>
    <wp:status>publish</wp:status>
    <wp:post_parent>0</wp:post_parent>
    <wp:menu_order>0</wp:menu_order>
    <wp:post_type>post</wp:post_type>
    <wp:post_password/>
    <wp:is_sticky>0</wp:is_sticky>
    <category nicename="uncategorized" domain="category">
    <![CDATA[Uncategorized]]>
    </category>
</item>
"""% vars()

if __name__ == '__main__':

    print("""
<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="WordPress/3.4.2" created="2012-10-04 17:36" -->
<rss xmlns:wp="http://wordpress.org/export/1.2/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" version="2.0"> -<channel> <title>3Cats.us</title> <link>http://www.3cats.us/blog</link> <description>4 Kids, 3 Cats, 2 Parents, 1 God</description> <pubDate>Thu, 04 Oct 2012 17:36:15 +0000</pubDate> <language>en-US</language> <wp:wxr_version>1.2</wp:wxr_version> <wp:base_site_url>http://www.3cats.us/blog</wp:base_site_url> <wp:base_blog_url>http://www.3cats.us/blog</wp:base_blog_url> -<wp:author><wp:author_id>1</wp:author_id><wp:author_login>admin</wp:author_login><wp:author_email>mikem@3cats.us</wp:author_email>-<wp:author_display_name>
<![CDATA[admin]]>
</wp:author_display_name><wp:author_first_name>
<![CDATA[]]>
</wp:author_first_name><wp:author_last_name>
<![CDATA[]]>
</wp:author_last_name></wp:author> <wp:author><wp:author_id>2</wp:author_id><wp:author_login>Mike</wp:author_login><wp:author_email>mike@3cats.us</wp:author_email><wp:author_display_name>
<![CDATA[Mike]]>
</wp:author_display_name><wp:author_first_name>
<![CDATA[Mike]]>
</wp:author_first_name><wp:author_last_name>
<![CDATA[Miller]]>
</wp:author_last_name></wp:author> <generator>http://wordpress.org/?v=3.4.2</generator>
""")

    files = glob.glob('./Log/*')
    id=1;
    for file in files:
        print(format_file(id,file))
        id+=1
    print("""
</channel> </rss>
""")