{"id":135,"date":"2012-10-08T15:12:02","date_gmt":"2012-10-08T22:12:02","guid":{"rendered":"http:\/\/www.3cats.us\/blog\/?p=135"},"modified":"2012-10-08T15:13:33","modified_gmt":"2012-10-08T22:13:33","slug":"importing-txt-to-wordpress","status":"publish","type":"post","link":"https:\/\/www.3cats.us\/blog\/2012\/10\/importing-txt-to-wordpress\/","title":{"rendered":"Importing txt to WordPress"},"content":{"rendered":"<p>After getting basic wordpress set up, I figured that I should populate it with whatever stuff I had around from previous plan\/blog\/etc stuff. I had a pretty big set of note-like material in text with either a light wiki or restructured text markup, so I figured that importing them would be good to give the site some real content. There&#8217;s some good stuff, and some lame stuff too.<\/p>\n<p>After some quick poking around, the simplest solution appeared to be to export the site using it&#8217;s export to xml feature and use that as a template to reformat my text files into for reimport. Turns out it was pretty easy to get working. The code is pretty ugly and relies on the docutils package: <a href=\"http:\/\/docutils.sourceforge.net\/\" target=\"_blank\">http:\/\/docutils.sourceforge.net\/<\/a><\/p>\n<p>I just need the basic restructured text to html conversion, which this person was nice enough to document:<a href=\" http:\/\/stackoverflow.com\/questions\/6654519\/python-parsing-restructuredtext-into-html \" target=\"_blank\"> http:\/\/stackoverflow.com\/questions\/6654519\/python-parsing-restructuredtext-into-html <\/a>There are a few additional issues, mostly removal of css elements to make it play nice, but some quick regexps clean those up.<\/p>\n<p>It appears to have basically worked and spot checks of various entries show that the conversion appears to have resulted in mostly-readable text, so I&#8217;m happy.<\/p>\n<p>The code is below, I did not code it using Test Driven Development style since I intend to only use it once, so it&#8217;s a little ugly&#8230;<\/p>\n<pre>#!python\r\nfrom docutils.core import publish_parts\r\nimport re, glob\r\n\r\ndef format_file(post_id, filename):\r\n\r\n\u00a0\u00a0\u00a0 fh=open(filename)\r\n\r\n\u00a0\u00a0\u00a0 title = \"notes-\"+str(post_id)\r\n\u00a0\u00a0\u00a0 date = \"\"\r\n\u00a0\u00a0\u00a0 year = 0\r\n\u00a0\u00a0\u00a0 month = 0\r\n\u00a0\u00a0\u00a0 month_string = \"\"\r\n\u00a0\u00a0\u00a0 day_of_week = \"Sat\"\r\n\u00a0\u00a0\u00a0 text = \"\"\r\n\u00a0\u00a0\u00a0 html = \"\"\r\n\u00a0\u00a0\u00a0 site_url = \"http:\/\/www.3cats.us\/blog\"\r\n\r\n\u00a0\u00a0\u00a0 match=re.search('(?P&lt;year&gt;\\d+)\\s+(?P&lt;month_string&gt;\\w+)\\s+(?P&lt;day&gt;\\d+)', filename)\r\n\u00a0\u00a0\u00a0 if(match):\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 year = int(match.groupdict()['year'])\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 month_string = match.groupdict()['month_string']\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 month = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}[month_string]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 day = int(match.groupdict()['day'])\r\n\r\n\u00a0\u00a0\u00a0 match=re.search('\\s\\-\\s+(.*)\\.[tT][xX][tT]', filename)\r\n\u00a0\u00a0\u00a0 if(match):\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 title = match.groups(0)[0]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #print(title)\r\n\r\n\u00a0\u00a0\u00a0 title_no_whitespace = re.sub('\\s+', '-', title)\r\n\r\n\u00a0\u00a0\u00a0 text = ''.join(fh.readlines())\r\n\u00a0\u00a0\u00a0 html = publish_parts(text,writer_name='html')['html_body']\r\n\u00a0\u00a0\u00a0 striped_html = re.sub('class=\".*?\"','',str(html))\r\n\u00a0\u00a0\u00a0 striped_html = re.sub('&lt;\\\/?div.*?&gt;', '', striped_html)\r\n\r\n\u00a0\u00a0\u00a0 return \"\"\"&lt;item&gt;\r\n\u00a0\u00a0\u00a0 &lt;title&gt;%(title)s&lt;\/title&gt;\r\n\u00a0\u00a0\u00a0 &lt;link&gt;%(site_url)s\/%(year)d\/%(month)d\/%(title_no_whitespace)s\/&lt;\/link&gt;\r\n\u00a0\u00a0\u00a0 &lt;pubDate&gt;%(day_of_week)s, %(day)2.2d %(month_string)s %(year)d 00:00:00 +0000&lt;\/pubDate&gt;\r\n\u00a0\u00a0\u00a0 &lt;dc:creator&gt;Mike&lt;\/dc:creator&gt;\r\n\u00a0\u00a0\u00a0 &lt;guid isPermaLink=\"false\"&gt;%(site_url)s\/?p=%(post_id)d&lt;\/guid&gt; &lt;description\/&gt;\r\n\u00a0\u00a0\u00a0 &lt;content:encoded&gt;\r\n%(striped_html)s\r\n\u00a0\u00a0\u00a0 &lt;\/content:encoded&gt;\r\n\u00a0\u00a0\u00a0 &lt;excerpt:encoded&gt;\r\n\u00a0\u00a0\u00a0 &lt;![CDATA[]]&gt;\r\n\u00a0\u00a0\u00a0 &lt;\/excerpt:encoded&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_id&gt;%(post_id)d&lt;\/wp:post_id&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_date&gt;%(year)d-%(month)2.2d-%(day)2.2d 00:00:00&lt;\/wp:post_date&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_date_gmt&gt;%(year)d-%(month)2.2d-%(day)2.2d 00:00:00&lt;\/wp:post_date_gmt&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:comment_status&gt;closed&lt;\/wp:comment_status&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:ping_status&gt;closed&lt;\/wp:ping_status&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_name&gt;%(title)s&lt;\/wp:post_name&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:status&gt;publish&lt;\/wp:status&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_parent&gt;0&lt;\/wp:post_parent&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:menu_order&gt;0&lt;\/wp:menu_order&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_type&gt;post&lt;\/wp:post_type&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:post_password\/&gt;\r\n\u00a0\u00a0\u00a0 &lt;wp:is_sticky&gt;0&lt;\/wp:is_sticky&gt;\r\n\u00a0\u00a0\u00a0 &lt;category nicename=\"uncategorized\" domain=\"category\"&gt;\r\n\u00a0\u00a0\u00a0 &lt;![CDATA[Uncategorized]]&gt;\r\n\u00a0\u00a0\u00a0 &lt;\/category&gt;\r\n&lt;\/item&gt;\r\n\"\"\"% vars()\r\n\r\nif __name__ == '__main__':\r\n\r\n\u00a0\u00a0\u00a0 print(\"\"\"\r\n&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\r\n&lt;!-- generator=\"WordPress\/3.4.2\" created=\"2012-10-04 17:36\" --&gt;\r\n&lt;rss xmlns:wp=\"http:\/\/wordpress.org\/export\/1.2\/\" xmlns:dc=\"http:\/\/purl.org\/dc\/elements\/1.1\/\" xmlns:wfw=\"http:\/\/wellformedweb.org\/CommentAPI\/\" xmlns:content=\"http:\/\/purl.org\/rss\/1.0\/modules\/content\/\" xmlns:excerpt=\"http:\/\/wordpress.org\/export\/1.2\/excerpt\/\" version=\"2.0\"&gt; -&lt;channel&gt; &lt;title&gt;3Cats.us&lt;\/title&gt; &lt;link&gt;http:\/\/www.3cats.us\/blog&lt;\/link&gt; &lt;description&gt;4 Kids, 3 Cats, 2 Parents, 1 God&lt;\/description&gt; &lt;pubDate&gt;Thu, 04 Oct 2012 17:36:15 +0000&lt;\/pubDate&gt; &lt;language&gt;en-US&lt;\/language&gt; &lt;wp:wxr_version&gt;1.2&lt;\/wp:wxr_version&gt; &lt;wp:base_site_url&gt;http:\/\/www.3cats.us\/blog&lt;\/wp:base_site_url&gt; &lt;wp:base_blog_url&gt;http:\/\/www.3cats.us\/blog&lt;\/wp:base_blog_url&gt; -&lt;wp:author&gt;&lt;wp:author_id&gt;1&lt;\/wp:author_id&gt;&lt;wp:author_login&gt;admin&lt;\/wp:author_login&gt;&lt;wp:author_email&gt;mikem@3cats.us&lt;\/wp:author_email&gt;-&lt;wp:author_display_name&gt;\r\n&lt;![CDATA[admin]]&gt;\r\n&lt;\/wp:author_display_name&gt;&lt;wp:author_first_name&gt;\r\n&lt;![CDATA[]]&gt;\r\n&lt;\/wp:author_first_name&gt;&lt;wp:author_last_name&gt;\r\n&lt;![CDATA[]]&gt;\r\n&lt;\/wp:author_last_name&gt;&lt;\/wp:author&gt; &lt;wp:author&gt;&lt;wp:author_id&gt;2&lt;\/wp:author_id&gt;&lt;wp:author_login&gt;Mike&lt;\/wp:author_login&gt;&lt;wp:author_email&gt;mike@3cats.us&lt;\/wp:author_email&gt;&lt;wp:author_display_name&gt;\r\n&lt;![CDATA[Mike]]&gt;\r\n&lt;\/wp:author_display_name&gt;&lt;wp:author_first_name&gt;\r\n&lt;![CDATA[Mike]]&gt;\r\n&lt;\/wp:author_first_name&gt;&lt;wp:author_last_name&gt;\r\n&lt;![CDATA[Miller]]&gt;\r\n&lt;\/wp:author_last_name&gt;&lt;\/wp:author&gt; &lt;generator&gt;http:\/\/wordpress.org\/?v=3.4.2&lt;\/generator&gt;\r\n\"\"\")\r\n\r\n\u00a0\u00a0\u00a0 files = glob.glob('.\/Log\/*')\r\n\u00a0\u00a0\u00a0 id=1;\r\n\u00a0\u00a0\u00a0 for file in files:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print(format_file(id,file))\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 id+=1\r\n\u00a0\u00a0\u00a0 print(\"\"\"\r\n&lt;\/channel&gt; &lt;\/rss&gt;\r\n\"\"\")<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>After getting basic wordpress set up, I figured that I should populate it with whatever stuff I had around from previous plan\/blog\/etc stuff. I had a pretty big set of note-like material in text with either a light wiki or &hellip; <a href=\"https:\/\/www.3cats.us\/blog\/2012\/10\/importing-txt-to-wordpress\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-135","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/posts\/135","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/comments?post=135"}],"version-history":[{"count":5,"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/posts\/135\/revisions"}],"predecessor-version":[{"id":138,"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/posts\/135\/revisions\/138"}],"wp:attachment":[{"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/media?parent=135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/categories?post=135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.3cats.us\/blog\/wp-json\/wp\/v2\/tags?post=135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}