Several years back, I was working on an imaging project in Java which was going to require some Optical Character Recognition (OCR) functionality. After an exhaustive search, I could find nothing to fit the bill. My requirements were:
- Must be written in Java
- Must be freely redistributable, with or without source code
- Must not be proprietary
- Must be able to recognize the fonts of various printers, even if that means that it has to be trained for each new font
- Must be reasonably fast
I never found anything that met my requirements, so I set about developing something to fit the bill. What I ended up developing, is a generic, trainable OCR package that does a fairly decent job of decoding printed text, as long as it has been trained for the font(s) it is expected to recognize.
How it Works
This OCR engine is implemented as a Java library, along with a demo application which shows the library in action. The core concept, at the character level, is image matching with automatic position and aspect ratio correction, using a least-square-error matching algorithm. It is a very simple yet reasonably effective implementation.
The Training Phase
Training consists of the following steps:
- Printing out the characters which it is expected to recognize
- Scanning those characters into an image
- Cropping the image down so that it includes only the training characters
- Telling the OCR engine to use the resulting training image, and specifying which characters the image contains
Character Recognition
The general steps used by this OCR engine for converting a scanned document to text are:
- Load training images
- Load the scanned image of the document to be converted to text
- Convert the scanned image to grayscale
- Filter the scanned image using a low-pass Finite Impulse Response (FIR) filter to remove dust
- Break the document into lines of text, based on whitespace between the text lines
- Break each line into characters, based on whitespace between the characters; using the average character width, determine where spaces occur within the line
- For each character, determine the most closely matching character from the training images and append that to the output text; for each space, append a space character to the output text
- Output the accumulated text
- If there are any more scanned images to be converted to text, return to step 2
Applications
This is a generic, trainable OCR engine. By default, it knows nothing except how to (attempt to) filter/clean up dust, convert to greyscale, break the document into lines, break the lines into characters, compare each character against known characters in user-supplied training images, and output the closest matches as text.
The engine was originally written to digitize documents (or specific sections of documents) which were printed with a handful of known fonts for which it could be trained, in order to minimize the error. Digitization was not intended to be 100 percent accurate, since the digitized text was to be used mainly for searching the documents by keywords. It was intended to be used in a document imaging system.
Accuracy and Performance
With the simple documents with which it was tested, this OCR engine has compared favorably against the open-source OCR package GOCR. It translated images to text with at least comparable accuracy to GOCR, and was in the same ballpark as far as speed, if not somewhat faster than GOCR. Extensive comparisons were not performed.
Getting Started
The following instructions assume you’re running on a Linux box, with a reasonably recent version of Sun’s JDK installed. You can get the JDK at java.sun.com. Be sure to remove any “fake” java packages that come with your Linux distribution. If you install OpenOffice, chances are you’ll get a counterfeit GNU Java implementation which does not conform to Sun’s Java specification, and is actually quite outdated as well. Unfortunately, OpenOffice has dependencies on this package. To get rid of it, you’ll need to do something like this before installing Sun’s JDK:
rpm -e --nodeps java-1.4.2-gcj-compat
NOTE: This may BREAK your OpenOffice installation, at least until you install the Sun JDK to replace the missing Java functionality. But hey, the OpenOffice guys should know better than to force someone to install an illegitimate Java knock-off, especially since OpenOffice is operated by Sun, who created the real Java in the first place. There’s just no excuse.
As a potential “alternative”, if you’re more skilled than I am with the Linux alternatives package, you could use it to fix up the symlinks under /etc/alternatives to point to the real JDK without uninstalling the GNU Java knock-off. However, you’d have to be careful about software updates to the GNU Java knock-off “accidentally” resetting these symlinks, thereby breaking the real JDK. What a mess. Sun should really go after these guys for creating executables with the same names as Sun’s, and purposely interfering with the distribution of Sun’s legitimite Java implementation. After all, isn’t that what Microsoft did with their fake Java implementation? Bad actions are bad, no matter who’s doing them. But I digress.
So, back to the OCR engine. When you download and unpack the tarball, you’ll have an “ocr” directory. Under it you’ll find these scripts:
- compile – compiles the Java files into class files in the classes directory
- createJars – creates ocr.jar from the compiled classes
- ocrscannerdemo – demonstrates OCR functionality using any of several test images and corresponding training images
Compiling
The source code *should* already be compiled, and there should be an ocr.jar file in the top-level “ocr” directory. If so, you can proceed. If not, or if you need to rebuild after making a change to the source code, just do the following:
./compile && ./createJars
Assuming there are no errors, you’ll get freshly compiled classes and a new ocr.jar with your changes.
Running the Demos
If you look under the ocrTests directory, there are several png and jpg files. Each of these is an image which contains text, and can be used to demonstrate the functionality of the OCR engine. To test the OCR engine on an image, do something like this:
./ocrscannerdemo ocrTests/asciiSentence.png
Notice that there is also a directory named ocrTests/trainingImages. This contains the font samples that are used to train the OCR engine in the demo application, so that it can recognize the fonts that were used to create the test images in the ocrTests directory. If you look at the src/com/roncemer/ocr/OCRScannerDemo.java source file, in the loadTrainingImages() function, you’ll see that the demo app is loading up each of these training images and telling the OCR engine which character ranges are contained in each image. The OCR engine then uses these images to match against each character in the source image, in order to convert the source image into text.
Using the Code in your Program
To use the code in your own program, put ocr.jar into your classpath and follow the usage pattern which is used in the src/com/roncemer/ocr/OCRScannerDemo.java source file.
Feel free to look at the other source files, if you’re interested in the inner workings of the OCR engine. The concepts are fairly simple, yet reasonably effective.
License
I originally released this engine under the GPL license, version 2. However, I felt it would be more commercially friendly if it were re-released under the BSD license. As of may 6, 2010, I’ve created a project page on SourceForge, changed the license to BSD, and uploaded the whole thing to the SourceForge Subversion repository.
SourceForge Project Page
The new JavaOCR SourceForge project is located here: http://javaocr.sourceforge.net
Feedback
As always, I’m interested in your feedback, suggestions for improvement, use cases, success stories, or whatever.
Enjoy!
Nice work. am using windows OS. also can i edit the code and add it to part of my project? my project is commercial.
The original license was BSD. I believe we moved it to MIT. Both licenses allow commercial use, and also allow private modifications without releasing the source code to your modifications. That’s why we picked these licenses — because they are very friendly to commercial software developers, of which I am one.
Hi Ron,
I can see the sourceforge project has further development and customized for android. Can you point me to a java code with the latest android projects where it does the training and matching in a single program like your OCRScannerDemo.java?
Regards,
Sajid.
I have not actively been involved in the project very much since I wrote this article and released the source code on SourceForge. There are other volunteer developers who have done all of the improvements, adoption to android, etc. Konstantin Pribluda would be the one to contact with questions about that. Just post your question on the forum at SourceForge.
where can I download this
When you download and unpack the tarball, you’ll have an “ocr” directory. Under it you’ll find these scripts:
compile – compiles the Java files into class files in the classes directory
createJars – creates ocr.jar from the compiled classes
ocrscannerdemo – demonstrates OCR functionality using any of several test images and corresponding training images
The project has since moved to SourceForge. The project page is http://sourceforge.net/projects/javaocr
I recommend looking over the JavaDocs after downloading the latest code. It’s completely different from when I released the package and wrote this article.
Excuse me sir, I can’t find the javadoc. Where can I find it?? I’m so sorry for the question.
No problem. The volunteers who have taken over the project have adapted it for building using Maven. When you build the project using Maven, there should be a build target which builds the JavaDocs. If the source tarball from SourceForge is missing the JavaDocs, I recommend posting that fact on the SourceForge project page. They’re pretty good at fixing these things.
Very nice work. But this is not working for Multi-Page Tiff’ images.Need guidance !!!!!
Here’s an article which shows how to convert a multi-page TIFF image to a single image using Java Advanced Imaging (JAI). That should fix you up.
http://www.rgagnon.com/javadetails/java-0535.html
I’m looking to do some similar to this and have mo real experiance working with Java. If you don’t mind me asking were would I start looking for intonation an building my own Ocr program. It really didn’t matter what language its in I’m just looking for the theary on how this big of programming works.
Try reading the Wikipedia article on optical character recognition. There are so many different approaches to OCR, each with different benefits and drawbacks.
Hi Ron!
I am using this project as a basis for my graduate research project. I’m having an issue with accuracy running on a Windows 7 box. Do you have any tips that will help in improving the accuracy of the output?
Regardless of operating system, there are a few things which can degrade accuracy.
First of all, you need at least one high-quality sample of the font you want to recognize. This will be your training image. The more of these you have (of the same font), the more accurate it’s likely to be (and slower as well). In your training images, make sure the characters are separated by enough whitespace that the training image loader can easily find the whitespace between any two adjacent characters. In other words, don’t let any characters touch each other in the training image(s).
Secondly, check your white threshold setting. I don’t recall exactly where it is — somewhere in the DocumentScanner class, if I recall correctly. The white threshold setting has an effect on the accuracy of character detection.
Keep in mind, JavaOCR is an experimental library which achieves high levels of accuracy under controlled circumstances. However, it is not a general-purpose OCR engine.
Greetings, Nice Work !! Thanks,
I’m looking for something like your project but for Android
If you know it I appreciate your recomendation.
Bye
J.Q.
The JavaOCR project works on Android. If I’m not mistaken, there may even be some Android demos that the guys have written, right there in the source code. Be sure to get the latest from the git repository, rather than downloading a tarball. Enjoy!
Sorry in advance for my question, but how can I get the result text using everything from Demo?
It gives back chars, lines (so cool), but as images… not text
The text is output to the console. When you run the demo from command line, you should see it outputting the text, I think. I did not write the latest demos though, so you’ll want to direct any further questions to the sourceforge project forum.
Hey Ron!
Can I use this to support more than one font?
Thanks,
Sangamesh
Yes. Just use training images from each of the fonts you want to recognize.
Fantastic set of libraries
I’m using them for my University Final Year Project where I do implement it onto an Android platform and it’s been a treat to work with
I saw you mention that it is under MIT licence now? Will this mean I need to include the MIT Licence Terms spiel into my report or is it actually under the BSD Licence still?
Also would you be against me adding you as a thanks into my acknowledgements? Just as a thankyou really
Cheers!
The new maintainers changed the license to MIT some years ago with my approval.
No problem on the acknowledgement. Whatever you feel comfortable with.
Glad you enjoyed the library!
I’m only commenting to say fantastic work. I am also in the process of writing an open-source collision response system. After all the hair pulling and 10 hour coding sessions I’ve been through I’d just like to say thanks for releasing this code under the BSD license. It is very hard to get yourself to commit to this after the months of work which goes into this sort of project.