You’re embarrassed to question it. People throw out the term “OCR” as though it’s common knowledge. You don’t want anyone to know you missed the memo. Well, a lot of people missed the memo. Genealogy draws heavily on this technology, so let’s take up the question: “What’s OCR?”
I’ve traveled a good bit in the past year, talking to genealogical groups around Alabama. I’m visiting in hopes that most of them will allow the Alabama Bicentennial Commission to digitize their serial publications for inclusion in a statewide database.
In explaining digitization, I’ve learned not to assume people know what the word OCR means. Oh, they won’t raise their hands and ask, “What’s OCR?” It’s that look of relief — that “aha!” expression on faces — that first clued me in that genealogy went OCR without explaining itself.
As I talk about the database, I need them to understand that we are doing more than uploading images of their publication pages. We are making them searchable, to the extent that OCR technology can do that. I feel the need to let them know why volunteers might be needed to do what OCR cannot.
What is OCR?
OCR is an abbreviation for Optical Character Recognition. When you run OCR software on a scanned document, it attempts to recognize patterns on the screen that might be letters. It turns the pictures of letters into usable text.
I know, I know. What’s the difference in a word in text and a word picture? In the image below, the top version of the Anderson name is in text, in a Word document. You see a cursor between the “n” and the “d,” indicating that you can edit the text.
The Anderson name on the bottom is an image, pulled from an old document. You see the aged color behind it. And you cannot retype it.
To this, you say “So what? I can read them both equally well.”
Yes, you can read both, but the computer cannot. In the bottom example, it does not see the letters “Anderson, J. D.” It just sees shades of gray and black. You could not search on the term “Anderson” and find this name.
What does the computer see?
Let’s start zooming in to make the point. You see the letters getting fuzzy?
Yeah, they’re fuzzy, but I see letters, you say. I easily make out the “An” and a part of the “d.” Aren’t computers are supposed to be smart. Smarter than I? If I can read it, why can’t it?
OK, we go a step closer. “I still see an A,” you say, baffled when I tell you the computer still does not.
Let’s zoom one more time. We’re finally getting closer to the computer’s language of images.
An image is made up of “pixels,” tiny squares of different colors. If you’re not seeing the squares, click on the image and blow it up. It’s just line after line of squares. THAT’s what your computer sees.
Computers are as smart as we make them, and some brilliant human minds came up with the software that does “optical character recognition,” or OCR. The software has the computer look at the image to see if the clumps of darker pixels are lying in a pattern that looks like a letter.
If so, the software types the letter in another layer — a text layer that usually lies behind the picture layer, invisible to us, but useful for searching. When an online database appears to highlight a term on a scanned newspaper image, like the one below, It is placing the highlighted box based on the text layer beneath it. It was the text layer, the computer searched — not the image of the newspaper.
If you’ve seen an odd situation in which the highlights don’t seem to be on top of the right word, you are seeing an image in which the text behind it is not lined up exactly. Or, you might be seeing an image for which the text layer was typed by humans, and cannot be matched word for word against the image.
Limits of OCR
We would be able to easily and almost perfectly OCR everything ever written, but for this: every letter has limitless variations as to how it might look. We keep making OCR software more robust, but as fast as we teach it to recognize new fonts, newer fonts are being created.
For us genealogists, it’s not just all the fonts that are; it’s all the fonts that ever have been. And we have to take into account the quality of the print that is left to us.
And, while developers are beginning to teach OCR software to read handwriting, the variations add to the complexity.
Due to this difficulty, the digitization of handwritten documents has required human intervention, rather than OCR, to turn the images of handwritten text into searchable text. While we complain about the inaccuracy of the census indexing, heaven help us if a computer had tried to index them.
Computers can even have trouble with some fonts that have been typed in the last half-century. Remember when we all got excited — was it maybe the early 80s — because we were able to switch out font balls on our electric typewriters? Suddenly, we had the ability to type in italic font and make our documents elegant.
Unfortunately, genealogists saw this as an opportunity to indicate that a segment of text was a transcription of the handwritten original. In some genealogical society publications, page after page is likely to be in this font. When I ran OCR on the document from which I grabbed the image above, the resulting text looked like this:
So, even documents typed with a popular font in the past half-century will not always OCR well. If you have searched a database for a reference to your ancestor — you’ve seen the reference before, you KNOW its’s there — and the database fails to find the document, it will often be because of inadequate OCR or faulty human indexing.
On another day, we’ll talk about how to OCR your own documents. It’s an invaluable tool for genealogists.
And now, you have the memo. You’re in the know.