Wally and the spammers

Last year, when people were talking about putting together a 50th anniversary Cape Town to Cairo Airstream caravan for 2009, somebody also came up with the idea of reprinting Wally Byam’s book “Trailer Travel Here and Abroad.”   Published in 1960, the book has long been out of print and copies are very difficult to find.   As with almost everything written from the glory days of Airstream in the 1950s and 1960s, that book is considered highly desirable by Airstream aficionados.

One of the organizers approached me to see if I could help.   He would donate a sacrificial copy of the book if I could work out how to scan it, reprint it in limited quantities, and distribute it to all of the caravan members.   I say “sacrificial” because in the process of scanning it, the book would likely be severely damaged or even cut to individual pages.

The problem with this idea was that there really is no good way to reproduce old books.   You can copy the pages and reprint them exactly as they appeared (smudges, tears, and all), but this generally results in something fairly crummy looking.   It also forces you to use exactly the same page proportions as the original.

Another method is simply to have a person re-type every word of the book.   That process is so expensive that it usually doesn’t make economic sense for purposes of reprinting an old book in small quantities.

Or, you can try Optical Character Recognition (OCR) to try to turn the printed words into a word-processing document, which can then be edited and reformatted.   But this also doesn’t work well, since the state of OCR technology is far from perfect.   Error rates are often high, which means a human being must go over every word to fix all the errors, and that can be just as bad as re-typing the whole book.

Interestingly, the wizards at Carnegie-Mellon University have found a great solution.   They’re getting you to help with the OCR process.   And they’ve gotten me to do it.   And millions of other people have been recruited as well. In fact, so many of us are helping that up to 150,000 hours of work can be contributed to the project every day.


You know those little text puzzles you have to solve before you can post comments on a blog, like the one above? They are called “captchas.”   The Carnegie-Mellon kids are using them to digitize old books and newspapers. One word in the puzzle is known to the computer, the other one is a word from an old book that the OCR software couldn’t recognize.   When you type the correct answer to the “known” word, the computer assumes your answer to the other word is also correct. Then it checks your answer against other people’s answers.   When enough people confirm the word, it gets added to the digitized version of the book.

Distributed computing projects like the Search for Extra-Terrestrial Intelligence (SETI) have been commonplace for years.   Those projects rely on thousands of people allowing their personal computers to be used to solve tiny bits of very large mathematical problems.   But this is a new sort of distribution: Instead of computers being recruited to do the work, it is being distributed across millions of humans, most of whom have no idea that they’re working on a greater project.

What incredible irony.   We’ve developed a massive computing network that spans the globe, linking billions of people and enabling incredible capabilities, and we’re using it to facilitate a job that only humans can do.   When you solve the captcha, you’re becoming the ultimate worker bee, working toward the greater good but ignorant of the exact nature of the final project. You have to wonder, are the computers working for us, or are we working for the computers?

I suppose another way to look at it is that we are all contributing a tiny bit to eliminate the need for good typists. Thirty years ago, the job of re-setting the book would have been handed to a bunch of typesetters (a job title that no longer exists in the modern age).   But instead of hiring them, we are getting the job done for free by using the Internet to make use out of what would otherwise be wasted effort.

Another irony is that this wouldn’t be possible if the spammers hadn’t forced the need for captchas in the first place.   By relentlessly harassing websites, spammers have enabled this book digitization project.   Next time you encounter a spammer online, remember that someday Wally’s books will be available in print again and it might be thanks to them.


  1. Leo says

    Yea of little faith. Send me a chapter or two! OCR works great these days. The trick is in carefully preparing each page. Scan at 300DPI into Photoshop. Take shades-of-gray lettering and darken to 0. Take off-white page, and brighten to white. Import into Acrobat and OCR. Better than 99.5% capture rate — and yes, I want my PDF copy of that book! ;^) Write me — we’ll get it done!

  2. Forrest says

    If Wally’s book were scanned and digitized would that be a copyright infringement? Project Gutenberg, http://www.gutenberg.org/wiki/Main_Page, has been amassing digitized books for years, but there are strict rules about what they accept because of copyright. I think Airstream Inc. holds the copyright on Wally’s book, Trailer Travel Here and Abroad. Would they allow the distribution of a digitized version?

  3. says

    Leo, I wish I had a copy of my own, but I don’t. Sounds like a lot of work the way you describe, but if I had a copy I’d happily donate it if you were going to tackle the OCR. I could certainly get it published later.

    Forrest, I believe the copyrights on most of Wally’s books were allowed to expire. However, I’m pretty confident that I could get permission to reprint if necessary. Frankly, without a volunteer effort to scan and reproduce/reprint, these books will never see publication again.