Comments on document scanner

A question came up from somebody interested in document scanners, so I thought it might be good to write it up real quick and post it as well…

I’ve got a Fujitsu ScanSnap 510m http://www.fujitsu.com/us/services/computing/peripherals/scanners/scansnap/s510m.html that I bought over 3 years ago and it’s been really great. I do a scan to PDF+OCR so that they are searchable. Each document (may be multiple sheets) is a single file in a big folder and then I use Spotlight (file search on OSX) to find what I need. I very rarely need to find things, but when I do, it has worked fine. I do keep my tax documents unscanned and in a “tax” folder until I do my taxes and then scan all of them as a group, but that’s just my tax workflow.

Specific questions:

– You can just throw 50 sheets in (I’ve done 100+) but if it’s a group, it will store them as a group, not individual documents. If the sheets are irregular in size that can be an issue and need a little intervention to stabilize them going in.

– I’ve got 3+ years on mine and it’s still working fine. I’ve certainly pushed thousands of pages through each year with a minimal failure rate.

– In my case, it’s just a folder of PDFs, so it backs up like anything else. I happen to use Time Machine cause I’m on a Mac with both an onsite and an offsite disk, but any other backup would be fine.

While a document scanner is fairly expensive (I paid $400). It’s been worth it in terms of the time and effort savings. The real key feature is scanning both sides at once. Other solutions are just too slow to provide good ROI. I just scan stuff and shred it. I got rid of my 4-drawer filing cabinet over 2 years ago, so I’ve got more space and I don’t stress about where to file things, they all go in a big folder and search enables me to find anything I’ve needed.

ifttt and wordpress

http://ifttt.com is a very clever site that basically allows high level event-based scripting between web sites. I’ve known about it for a long time, but didn’t really have a lot of use for it until now that I’ve got a WordPress site up and running. It’s nifty in that I can now cross-post automatically between WordPress and Facebook. I’ve got two recipes, one to post to FB when I do a post on WordPress and one to copy my photo posts from FB to WordPress.

Importing txt to WordPress

After getting basic wordpress set up, I figured that I should populate it with whatever stuff I had around from previous plan/blog/etc stuff. I had a pretty big set of note-like material in text with either a light wiki or restructured text markup, so I figured that importing them would be good to give the site some real content. There’s some good stuff, and some lame stuff too.

After some quick poking around, the simplest solution appeared to be to export the site using it’s export to xml feature and use that as a template to reformat my text files into for reimport. Turns out it was pretty easy to get working. The code is pretty ugly and relies on the docutils package: http://docutils.sourceforge.net/

I just need the basic restructured text to html conversion, which this person was nice enough to document: http://stackoverflow.com/questions/6654519/python-parsing-restructuredtext-into-html There are a few additional issues, mostly removal of css elements to make it play nice, but some quick regexps clean those up.

It appears to have basically worked and spot checks of various entries show that the conversion appears to have resulted in mostly-readable text, so I’m happy.

The code is below, I did not code it using Test Driven Development style since I intend to only use it once, so it’s a little ugly…

#!python
from docutils.core import publish_parts
import re, glob

def format_file(post_id, filename):

    fh=open(filename)

    title = "notes-"+str(post_id)
    date = ""
    year = 0
    month = 0
    month_string = ""
    day_of_week = "Sat"
    text = ""
    html = ""
    site_url = "http://www.3cats.us/blog"

    match=re.search('(?P<year>\d+)\s+(?P<month_string>\w+)\s+(?P<day>\d+)', filename)
    if(match):
        year = int(match.groupdict()['year'])
        month_string = match.groupdict()['month_string']
        month = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}[month_string]
        day = int(match.groupdict()['day'])

    match=re.search('\s\-\s+(.*)\.[tT][xX][tT]', filename)
    if(match):
        title = match.groups(0)[0]
        #print(title)

    title_no_whitespace = re.sub('\s+', '-', title)

    text = ''.join(fh.readlines())
    html = publish_parts(text,writer_name='html')['html_body']
    striped_html = re.sub('class=".*?"','',str(html))
    striped_html = re.sub('<\/?div.*?>', '', striped_html)

    return """<item>
    <title>%(title)s</title>
    <link>%(site_url)s/%(year)d/%(month)d/%(title_no_whitespace)s/</link>
    <pubDate>%(day_of_week)s, %(day)2.2d %(month_string)s %(year)d 00:00:00 +0000</pubDate>
    <dc:creator>Mike</dc:creator>
    <guid isPermaLink="false">%(site_url)s/?p=%(post_id)d</guid> <description/>
    <content:encoded>
%(striped_html)s
    </content:encoded>
    <excerpt:encoded>
    <![CDATA[]]>
    </excerpt:encoded>
    <wp:post_id>%(post_id)d</wp:post_id>
    <wp:post_date>%(year)d-%(month)2.2d-%(day)2.2d 00:00:00</wp:post_date>
    <wp:post_date_gmt>%(year)d-%(month)2.2d-%(day)2.2d 00:00:00</wp:post_date_gmt>
    <wp:comment_status>closed</wp:comment_status>
    <wp:ping_status>closed</wp:ping_status>
    <wp:post_name>%(title)s</wp:post_name>
    <wp:status>publish</wp:status>
    <wp:post_parent>0</wp:post_parent>
    <wp:menu_order>0</wp:menu_order>
    <wp:post_type>post</wp:post_type>
    <wp:post_password/>
    <wp:is_sticky>0</wp:is_sticky>
    <category nicename="uncategorized" domain="category">
    <![CDATA[Uncategorized]]>
    </category>
</item>
"""% vars()

if __name__ == '__main__':

    print("""
<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="WordPress/3.4.2" created="2012-10-04 17:36" -->
<rss xmlns:wp="http://wordpress.org/export/1.2/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" version="2.0"> -<channel> <title>3Cats.us</title> <link>http://www.3cats.us/blog</link> <description>4 Kids, 3 Cats, 2 Parents, 1 God</description> <pubDate>Thu, 04 Oct 2012 17:36:15 +0000</pubDate> <language>en-US</language> <wp:wxr_version>1.2</wp:wxr_version> <wp:base_site_url>http://www.3cats.us/blog</wp:base_site_url> <wp:base_blog_url>http://www.3cats.us/blog</wp:base_blog_url> -<wp:author><wp:author_id>1</wp:author_id><wp:author_login>admin</wp:author_login><wp:author_email>mikem@3cats.us</wp:author_email>-<wp:author_display_name>
<![CDATA[admin]]>
</wp:author_display_name><wp:author_first_name>
<![CDATA[]]>
</wp:author_first_name><wp:author_last_name>
<![CDATA[]]>
</wp:author_last_name></wp:author> <wp:author><wp:author_id>2</wp:author_id><wp:author_login>Mike</wp:author_login><wp:author_email>mike@3cats.us</wp:author_email><wp:author_display_name>
<![CDATA[Mike]]>
</wp:author_display_name><wp:author_first_name>
<![CDATA[Mike]]>
</wp:author_first_name><wp:author_last_name>
<![CDATA[Miller]]>
</wp:author_last_name></wp:author> <generator>http://wordpress.org/?v=3.4.2</generator>
""")

    files = glob.glob('./Log/*')
    id=1;
    for file in files:
        print(format_file(id,file))
        id+=1
    print("""
</channel> </rss>
""")

Cultivating Focus

A couple of things have been going on that are causing me to think about discipline and focus. There’s this really great article on cultivating focus: http://www.aholyexperience.com/2012/09/how-to-cultivate-the-habit-of-focus-in-an-age-of-distraction/ Also, I’ve been working on some unusual and “outside of the box” issues at work that have been particularly challenging and I recently pulled down a little app called “Tiny Tower” in addition to switching to google reader for a lot of web browsing stuff.

Basically, I realized that I’ve been allowing little distractions to creep in and steal my time and attention instead of dealing with the hard issues head on. I’ve listened to way too much of Merlin Mann’s stuff http://www.merlinmann.com/ to not be aware that I’m doing this stuff. Unfortunately, getting out of this mode is really hard.

Sigh…

Things to go do right now:

  • Delete Tiny Tower. This has got to be one of the worst examples of a game that is designed to consume your attention and your time. The game is set up so that you can give them money for the in-game resources that you can otherwise only get by giving it time and attention. It is in the category of “virtual pet” games which all have the same basic tradeoff.
  • Get inboxes to zero. The real difficulity I’m having with this is that I don’t trust my todo lists right now. So I have to go back to classic David Allen http://www.davidco.com/ and get trust back in my system so that I can get my inboxes to zero. This is tricky.
  • Close google reader. This is not a fix-all. The real underlying issue is that I’m polling “inboxes” to find the latest interesting nugget, which isn’t a good use of my time and attention. I need to stay focused on larger, difficult tasks.

Longer term:

  • Need to construct some ceremony in the day/week to reenforce important, but not urgent tasks like a weekly review, spending time in the Bible, other relational activites, reading and writing longer form materials and thinking about them.
  • Particularly at home, I need to not be as driven by urgent issues. Part of this is getting over the push of the urgent things to get at the source of urgent things and getting things clear enought to be able to easily respond to new needs. Check out Proverbs 15:19 “The way of a sluggard is like an hedge of thorns, but the path of the upright is a level highway.” By doing the work to keep things, clean, organized and uncluttered, it is much easier to respond to needs. Around the house, we’ve got a lot of this “technical debt” http://en.wikipedia.org/wiki/Technical_debt that needs to be paid down.

Hello world!

First post…

I’ve been looking around at lots of different options for website creation and hosting. I’m very curmudgeonly about putting all my content on places that I don’t pay for under the “if you aren’t paying for it, you are the product” concept, as well as being paranoid about not being able to extract the content later for future transitions.

Anyway, after creating family websites with handcoded HTML, Gallery 1 and  2 and iWeb, I’m giving WordPress a shot. The real key is being able to very quickly get content up on the web with a minimal amount of effort and long-term overhead.