Several years back, I was working on an imaging project in Java which was going to require some Optical Character Recognition (OCR) functionality. After an exhaustive search, I could find nothing to fit the bill. My requirements were:
- Must be written in Java
- Must be freely redistributable, with or without source code
- Must not be proprietary
- Must be able to recognize the fonts of various printers, even if that means that it has to be trained for each new font
- Must be reasonably fast
I never found anything that met my requirements, so I set about developing something to fit the bill. What I ended up developing, is a generic, trainable OCR package that does a fairly decent job of decoding printed text, as long as it has been trained for the font(s) it is expected to recognize.
How it Works
This OCR engine is implemented as a Java library, along with a demo application which shows the library in action. The core concept, at the character level, is image matching with automatic position and aspect ratio correction, using a least-square-error matching algorithm. It is a very simple yet reasonably effective implementation.
The Training Phase
Training consists of the following steps:
- Printing out the characters which it is expected to recognize
- Scanning those characters into an image
- Cropping the image down so that it includes only the training characters
- Telling the OCR engine to use the resulting training image, and specifying which characters the image contains
The general steps used by this OCR engine for converting a scanned document to text are:
- Load training images
- Load the scanned image of the document to be converted to text
- Convert the scanned image to grayscale
- Filter the scanned image using a low-pass Finite Impulse Response (FIR) filter to remove dust
- Break the document into lines of text, based on whitespace between the text lines
- Break each line into characters, based on whitespace between the characters; using the average character width, determine where spaces occur within the line
- For each character, determine the most closely matching character from the training images and append that to the output text; for each space, append a space character to the output text
- Output the accumulated text
- If there are any more scanned images to be converted to text, return to step 2
This is a generic, trainable OCR engine. By default, it knows nothing except how to (attempt to) filter/clean up dust, convert to greyscale, break the document into lines, break the lines into characters, compare each character against known characters in user-supplied training images, and output the closest matches as text.
The engine was originally written to digitize documents (or specific sections of documents) which were printed with a handful of known fonts for which it could be trained, in order to minimize the error. Digitization was not intended to be 100 percent accurate, since the digitized text was to be used mainly for searching the documents by keywords. It was intended to be used in a document imaging system.
Accuracy and Performance
With the simple documents with which it was tested, this OCR engine has compared favorably against the open-source OCR package GOCR. It translated images to text with at least comparable accuracy to GOCR, and was in the same ballpark as far as speed, if not somewhat faster than GOCR. Extensive comparisons were not performed.
The following instructions assume you’re running on a Linux box, with a reasonably recent version of Sun’s JDK installed. You can get the JDK at http://www.oracle.com/technetwork/java/index.html. Be sure to remove any “fake” java packages that come with your Linux distribution. If you install OpenOffice, chances are you’ll get a counterfeit GNU Java implementation which does not conform to Sun’s Java specification, and is actually quite outdated as well. Unfortunately, OpenOffice has dependencies on this package. To get rid of it, you’ll need to do something like this before installing Sun’s JDK:
rpm -e --nodeps java-1.4.2-gcj-compat
NOTE: This may BREAK your OpenOffice installation, at least until you install the Sun JDK to replace the missing Java functionality. But hey, the OpenOffice guys should know better than to force someone to install an illegitimate Java knock-off, especially since OpenOffice is operated by Sun, who created the real Java in the first place. There’s just no excuse.
As a potential “alternative”, if you’re more skilled than I am with the Linux alternatives package, you could use it to fix up the symlinks under /etc/alternatives to point to the real JDK without uninstalling the GNU Java knock-off. However, you’d have to be careful about software updates to the GNU Java knock-off “accidentally” resetting these symlinks, thereby breaking the real JDK. What a mess. Sun should really go after these guys for creating executables with the same names as Sun’s, and purposely interfering with the distribution of Sun’s legitimite Java implementation. After all, isn’t that what Microsoft did with their fake Java implementation? Bad actions are bad, no matter who’s doing them. But I digress.
So, back to the OCR engine. When you download and unpack the tarball, you’ll have an “ocr” directory. Under it you’ll find these scripts:
- compile – compiles the Java files into class files in the classes directory
- createJars – creates ocr.jar from the compiled classes
- ocrscannerdemo – demonstrates OCR functionality using any of several test images and corresponding training images
The source code *should* already be compiled, and there should be an ocr.jar file in the top-level “ocr” directory. If so, you can proceed. If not, or if you need to rebuild after making a change to the source code, just do the following:
./compile && ./createJars
Assuming there are no errors, you’ll get freshly compiled classes and a new ocr.jar with your changes.
Running the Demos
If you look under the ocrTests directory, there are several png and jpg files. Each of these is an image which contains text, and can be used to demonstrate the functionality of the OCR engine. To test the OCR engine on an image, do something like this:
Notice that there is also a directory named ocrTests/trainingImages. This contains the font samples that are used to train the OCR engine in the demo application, so that it can recognize the fonts that were used to create the test images in the ocrTests directory. If you look at the src/com/roncemer/ocr/OCRScannerDemo.java source file, in the loadTrainingImages() function, you’ll see that the demo app is loading up each of these training images and telling the OCR engine which character ranges are contained in each image. The OCR engine then uses these images to match against each character in the source image, in order to convert the source image into text.
Using the Code in your Program
To use the code in your own program, put ocr.jar into your classpath and follow the usage pattern which is used in the src/com/roncemer/ocr/OCRScannerDemo.java source file.
Feel free to look at the other source files, if you’re interested in the inner workings of the OCR engine. The concepts are fairly simple, yet reasonably effective.
I originally released this engine under the GPL license, version 2. However, I felt it would be more commercially friendly if it were re-released under the BSD license. As of may 6, 2010, I’ve created a project page on SourceForge, changed the license to BSD, and uploaded the whole thing to the SourceForge Subversion repository.
SourceForge Project Page
The new JavaOCR SourceForge project is located here: http://javaocr.sourceforge.net
As always, I’m interested in your feedback, suggestions for improvement, use cases, success stories, or whatever.
Before posting comments, PLEASE read the existing comments. No, seriously. I’m not kidding. The overwhelming majority of questions I receive have already been answered in the existing comments. As a result, I’ve adopted a policy of silently deleting duplicate questions. If you asked a question and it wasn’t answered, it’s probably because it was already answered in the comments section.
127 responses to “Java OCR”
Nice work. am using windows OS. also can i edit the code and add it to part of my project? my project is commercial.
The original license was BSD. I believe we moved it to MIT. Both licenses allow commercial use, and also allow private modifications without releasing the source code to your modifications. That’s why we picked these licenses — because they are very friendly to commercial software developers, of which I am one.
I can see the sourceforge project has further development and customized for android. Can you point me to a java code with the latest android projects where it does the training and matching in a single program like your OCRScannerDemo.java?
I have not actively been involved in the project very much since I wrote this article and released the source code on SourceForge. There are other volunteer developers who have done all of the improvements, adoption to android, etc. Konstantin Pribluda would be the one to contact with questions about that. Just post your question on the forum at SourceForge.
where can I download this
When you download and unpack the tarball, you’ll have an “ocr” directory. Under it you’ll find these scripts:
compile – compiles the Java files into class files in the classes directory
createJars – creates ocr.jar from the compiled classes
ocrscannerdemo – demonstrates OCR functionality using any of several test images and corresponding training images
The project has since moved to SourceForge. The project page is http://sourceforge.net/projects/javaocr
I recommend looking over the JavaDocs after downloading the latest code. It’s completely different from when I released the package and wrote this article.
Excuse me sir, I can’t find the javadoc. Where can I find it?? I’m so sorry for the question.
No problem. The volunteers who have taken over the project have adapted it for building using Maven. When you build the project using Maven, there should be a build target which builds the JavaDocs. If the source tarball from SourceForge is missing the JavaDocs, I recommend posting that fact on the SourceForge project page. They’re pretty good at fixing these things.
Very nice work. But this is not working for Multi-Page Tiff’ images.Need guidance !!!!!
Here’s an article which shows how to convert a multi-page TIFF image to a single image using Java Advanced Imaging (JAI). That should fix you up.
I’m looking to do some similar to this and have mo real experiance working with Java. If you don’t mind me asking were would I start looking for intonation an building my own Ocr program. It really didn’t matter what language its in I’m just looking for the theary on how this big of programming works.
Try reading the Wikipedia article on optical character recognition. There are so many different approaches to OCR, each with different benefits and drawbacks.
I am using this project as a basis for my graduate research project. I’m having an issue with accuracy running on a Windows 7 box. Do you have any tips that will help in improving the accuracy of the output?
Regardless of operating system, there are a few things which can degrade accuracy.
First of all, you need at least one high-quality sample of the font you want to recognize. This will be your training image. The more of these you have (of the same font), the more accurate it’s likely to be (and slower as well). In your training images, make sure the characters are separated by enough whitespace that the training image loader can easily find the whitespace between any two adjacent characters. In other words, don’t let any characters touch each other in the training image(s).
Secondly, check your white threshold setting. I don’t recall exactly where it is — somewhere in the DocumentScanner class, if I recall correctly. The white threshold setting has an effect on the accuracy of character detection.
Keep in mind, JavaOCR is an experimental library which achieves high levels of accuracy under controlled circumstances. However, it is not a general-purpose OCR engine.
Greetings, Nice Work !! Thanks,
I’m looking for something like your project but for Android 🙂 If you know it I appreciate your recomendation.
The JavaOCR project works on Android. If I’m not mistaken, there may even be some Android demos that the guys have written, right there in the source code. Be sure to get the latest from the git repository, rather than downloading a tarball. Enjoy!
Sorry in advance for my question, but how can I get the result text using everything from Demo?
It gives back chars, lines (so cool), but as images… not text
The text is output to the console. When you run the demo from command line, you should see it outputting the text, I think. I did not write the latest demos though, so you’ll want to direct any further questions to the sourceforge project forum.
Can I use this to support more than one font?
Yes. Just use training images from each of the fonts you want to recognize.
Fantastic set of libraries 🙂
I’m using them for my University Final Year Project where I do implement it onto an Android platform and it’s been a treat to work with 🙂
I saw you mention that it is under MIT licence now? Will this mean I need to include the MIT Licence Terms spiel into my report or is it actually under the BSD Licence still?
Also would you be against me adding you as a thanks into my acknowledgements? Just as a thankyou really 🙂
The new maintainers changed the license to MIT some years ago with my approval.
No problem on the acknowledgement. Whatever you feel comfortable with.
Glad you enjoyed the library!
I’m only commenting to say fantastic work. I am also in the process of writing an open-source collision response system. After all the hair pulling and 10 hour coding sessions I’ve been through I’d just like to say thanks for releasing this code under the BSD license. It is very hard to get yourself to commit to this after the months of work which goes into this sort of project.
Hi, very interesting!
I have one question, I’m working on an OCR for the Amharic language (my native language, I’m from Ethiopia), any ideas on how to do that? Thanks in advance. 🙂
As long as your language can be separated into graphical glyphs representing characters, and those characters can be put together to form words, and the words are separated by spaces, it should be able to work for you. JavaOCR does a left-to-right, top-to-bottom scanning of the document, so it would be a problem for Chinese text, since Chinese text reads bottom-to-top, right-to-left if I remember correctly. If your language does not read left-to-right, top-to-bottom, then you may have to modify the DocumentScanner by creating a special subclass of DocumentScanner to scan your documents in the correct direction. Other than that, you should be fine.
hii.. where i can download sample code to run this lib?
It should be included in the source distribution. Look for OCRScannerDemo.java.
I am trying to ocr an simple number image, but cant get the result.
I can find on the code how to get the text result.
Can any one hlep me ?
Did you mean to say that you cannot find the code?
Have a look for OCRScannerDemo.java. It should be in the source code distribution. It provides an example of how to train the OCR engine and scan an image to extract the text, all in a single, simple command-line application.
How can i use this library in android application..
as i m developing an android app and i want to scan a visiting card and fetch details on the card like phone no, address, email id, etc..
Check out the “OCR Caller” and “OCR Fill-Up” apps by developr Konstantin Pribluda on Google Play Store. Konstantin is one of the contributors and maintainers of JavaOCR. I am told that both of these apps use JavaOCR to do their text recognition and extraction. If you send him a message on SourceForge (user ko5tik), he may offer you some pointers as to where to start writing an Android app using JavaOCR.
scanner.clearTrainingImages(); I am getting error with this function
Exception in thread “main” java.lang.NullPointerException
I disabled above statement then in
I am getting below error.
Exception in thread “main” java.lang.NullPointerException
I am using the default code at
I am unable to get images scanned and getting java.lang.NullPointerException upon using scanner.scan() function.
Do I need to teach/ Train it all the fonts and sizes?
Say I train it a font with font-size 10 will it automatically learn font-size 14 and higher?
The JavaOCR project is not hosted on Google Code. Please go to http://sourceforge.net/projects/javaocr/ for the original project page.
Hi, thanks for this great library, but I’m unable to get the JavaDoc documentation, should I get it using maven or what?
Thanks again, and sorry for my silly question 😛
I believe that you can build the JavaDoc documentation with Maven. Maven support was added after I released the library as open source. If it’s not obvious how to build the JavaDoc documentation, I would recommend posting a question on the SourceForge project page, which is located here: http://sourceforge.net/projects/javaocr/
Hi, I am a student, and I want to use JavaOCR in a faculty related project… Because my project involves making an android application that uses this library, I was wondering if you could please answer me, where can I find a forum, or a discussion group or somewhere to ask a few questions about the current version of the library, because when I download a snapshot from the git page hosting the project, I get a tone of errors, specially with the maven build of the project…
Thanks in advance, and keep up the good work 🙂
Probably the best place to post questions would be the SourceForge project page, which is located here: http://sourceforge.net/projects/javaocr/
I’ve been using your tool yesterday. It does a pretty good job. I’d like to make some comments as a way of providing constructive feedback from a user perspective that you might find useful.
1. The documentation is very poor. As a software maintainer I know it is hard and time consuming to keep it updated and complete but it is a critical step
2. I know there is an OCRScannerDemo.java absolutely hidden deep in many folders in the source code I downloaded. It did help to decrease the impact of the previous item. However, it is unnecessarily hard to read and it even uses deprecated classes such as the “old” PixelImage (I mean, there is a “new” PixelImage class in a different folder). Updating that file and moving it to a decent location seems to be critical too
3. The flags you can control in the DocumentScanner should be more documented. They are just several of them but as they control the output the user should know every detail about what they do.
4. Although the CharacterRange is pretty useful when you have a complete set of characters, whenever you don’t have all the characters from ‘!’ to ‘~’ you have to do a huge work splitting images into consecutive characters and then load each of those subsets. I think some kind of wrapper of that class that would allow the user to load just a String of characters from a file would be pretty useful. That would save a lot of editing time.
5. Think a bit better if there should or should not be a grayscale and filter method call in the OCRScanner.scan() class. I noticed it is being called twice in the demo file as it is explicitly called in that file and in the scan() method. So I guess it should be moved out of one of those classes
Having said that, I insist that the results I got were good.
Now, I do have a question, probably related to item 3. I’ve some GIF images that contain text directly exported from a computer (I mean, nothing has been scanned in the real world). The files have contain extremely clear black text in a white background (see the files here). The font used was monospaced (EG: Consolas) so I thought that it would be an easy task for any OCR. I found out, however, that this was not the case for this library.
The first issue I stumbled with was the fact that I couldn’t load the font in the TrainerImageLoader. After a lot of research (I ended up outputting “1” and “0” to the console to ASCII draw the image that JavaOCR was generating after the grayscale and filter phase performed during the loading) I found out that the text smaller than the library was expecting and hence it mixed letters (eg: rn could have been considered as an m). I fixed this by increasing the size of the images to load by 4 times (originally, they were 15 pixels height).
After fixing that I thought that it was going to be enough to also increase the size of the images to compare by 4 times and that would be it. However, it was not the case. The library was able to translate every single character but the period. It is just being ignored in the results as it wasn’t there. Not even a space is being returned. The period is big enough now, so I guess the library is somehow comparing the size of the period with the size of the other characters and discarding it (remember the font is monospaced, though).
I also tried modifying the code of the OCRScanner.scan() method so that it doesn’t perform the grayscale and filtering processes but that didn’t do the trick. I was actually expecting that to work as I have the 100% exact same period loaded in the trainer so if I don’t modify the input image it should 100% match the one in the trainer.
I bet there must be some kind of value to set in the DocumentScanner so that it doesn’t interpret it as if it was garbage in the image. I played with some pseudo-random-and-guessed values to the different variables but that didn’t do the trick. Can you tell me how could I configure that object to improve the results or if you can think of a different workaround?
Thanks in advance.
PS: Sorry for the long and boring post.
Thanks for your feedback!
I’d like to refer you to the JavaOCR project page here: http://sourceforge.net/projects/javaocr/
The reason is, I haven’t been actively involved in the project for a few years now. I know the volunteers who took it over have done some really nice things with it. Please make sure you’re evaluating the latest version. And if any of these issues still exist, please be sure to file bug reports for anything you see that’s still a problem.
I developed this library many years ago to scan document identifiers for a document imaging system. The task was really simple, because the font was known (hence the font training). But the people who are working on it now are much more knowledgeable in OCR, so they’ve taken it much further than I could have envisioned.
Thanks again for playing around with JavaOCR, and for giving your feedback!
Thanks for your great project.
I’m developing an OCR application on Android.
I want use a least-square-error matching algorithm but I don’t understand this algorithm clearly.
So, would you send me some documents about this algorithm.
Thanks very much,
The algorithm is very simple. As a result, there’s no need to document it extensively. If you read the code, it’s pretty straightforward.
But I’ll describe it in general here.
The first step is loading and pre-processing the training images. The training image loader chops up a series of characters in the image into sub-images, based on how much whitespace there is between subsequent images. You tell it the range of characters in the image, make sure there’s enough whitespace between the characters in the training image(s), and it loads them up, chops them into individual character sub-images, calculates the aspect ratio of each character sub-image, and stores the character sub-images and their aspect ratios away under their respective character codes.
The second and final step is the actual scanning of the document and converting it to text. The document scanner chops the document into rows of text (lines) and characters within those lines, based on a whitespace detection algorithm. It’s VERY important that the document be properly aligned. Any tilt left or right, and scanning will fail miserably. When processing the document, each character’s aspect ratio is calculated. Then the training images are qualified by aspect ratio, so that training images which are way off from the aspect ratio of the candidate are ignored. Candidate image refers to the character subimage extracted from the document being scanned, and corresponds to one character of final text output. After the aspect ratio qualification, each qualified training image (which passed the aspect ratio test) is then scaled to the exact same size (and aspect ratio) as the candidate image and compared, pixel-by-pixel, to the candidate image. The differences in pixel values between the training image and the candidate image are squared and summed. The training image which has the smallest sum of the squared pixel difference is the winner. We then output the corresponding character. Once all of the character cells in all of the lines of text are processed, the document is done.
There can be multiple training images, each containing different (or the same) ranges of characters. However, if they are of widely varying fonts, accuracy may suffer. Also, scanning speed decreases as the number of training images increases.
The algorithm is called a Least-Mean-Squared-Error algorithm because it skips the square root calculations (as required by Pythagorean Theorem). But in every case, it selects the same training image that a Least-RMS-Error algorithm would select. The candidate and training images are basically treated like N-dimensional vectors, where N is width times height. That’s why Pythagorean Theorem works, because it’s a vector distance calculation formula. We’re looking for the training image with the minimum RMS error when compared to the candidate image. However, we don’t need to do the square root portion of the RMS calculation (which is simply Pythagorean Theorem repurposed) because the sums of the squared differences are sufficient when compared to each other. Skipping the square root calculation results in a much faster algorithm.
My requirement is need to convert a scanned document which contain text into a word document.
is this ocr application to do this or not if so what are the changes i need to change.
plz help me… its very argent
JavaOCR is an API library which was designed to enable Java developers to write Java applications with OCR functionality. Unless you are a Java developer with interest in learning how OCR works, and helping to improve the OCR accuracy of JavaOCR, this project probably won’t be of much interest to you.
There is a command-line OCR application, OCRScannerDemo, which, if trained with samples of your fonts, can deliver a decent conversion accuracy. If you understand Java development, I recommend reading the JavaDocs as a starting point, and then looking at OCRScannerDemo.java.
I have just downloaded your distribution from sourceforge. The zip file contains several files with no documentation, so I don’t know how to start.
Could explain the role of each files and give me a sample example to use your software ?
Thanks in advance.
It should all be in the JavaDoc documentation when you build the JavaOCR library. Check on the SourceForge discussion page for this project. If there’s no quick start, please ask the maintainers to write it. You may have to file a bug report. They’re pretty good about quickly following up on issues. There should be at least build instructions, along with instructions on how to run the OCRScannerDemo command-line application.
Keep in mind, this is targeted toward developers rather than end users. In other words, it’s an API library rather than an application.
If you’re interested in contributing some time to help write a README or a quickstart guide, I can guarantee your help would be greatly appreciated.
Thanks for your interest!
I get only jars where is the source code files on sourceforge.
I am not able to find source code.
You can check out the source code using git. The instructions should be here: http://sourceforge.net/p/javaocr/source/ci/master/tree/
Hello Mr. Ron I want to start an android project that makes use of optical character recognition (OCR) I found your library upon searching but I wanted to know from you where could I learn optical character recognition from the basics up to building algorithms, and how could I understand all the classes in your library you have build for this purpose and would it be able to get integrated with android ? I would really appreciate your help.
The easiest thing would be to download the sources from the sourcforge project page, build the project (including the JavaDocs) and then read through the JavaDocs. If you have any questions on how to do this, just post them under the sourceforge project page, and they’ll be happy to help you.
Sounds great, but how do I use it? Is there any documentation? If so, where? I’d love to try it out. 🙂
Please read the other comments for details.
Hello Ron, your Java OCR libraries are really good, but i have some doubt, i downloaded the file from http://sourceforge.net/projects/javaocr. But these are package of jar files. how i import these jar files in maven and create JavaDocs. i want to use OCR in my web application ad recognize handwriting scanned document. this library will support handwriting recoganization or not ?
I think you’ll probably want to start with the full sources. That way, you can build it with full JavaDocs. You can get the sources using git via the git URL provided on the SourceForge.net project page.
Ron, thanks for that, but i have one doubt, the JAVA OCR have support for read Scanned Handwritten Document ?
I’ve never done any work on handwriting recognition, but some of the other maintainers have. I recommend reading the comments in the source code. It really is a small project, so looking over the source code is a very trivial task. I recommend doing that first.
your project which I have downloaded is working fine. just for the documentation part do you have any documents related to software development (SE) cycle related to this project. And any base paper related to the project will be really helpful.
The software development cycle is: develop, compile, test/debug, release. Not sure what you’re asking for there. We don’t have (or tolerate) much bureaucracy or red tape in the open source world.
As far as a base paper goes, the only thing we have is the comments in this thread, and the discussion board on the sourceforge project. It’s a very open development process. If you look through the comments, I’ve described the algorithm in detail.
As far as papers go, several have used this project as an information source and testbed for their degree theses. You’re welcome to do the same. Just don’t claim that you developed the software or came up with the ideas that make it work. Give credit where credit is due, and all of that.
I want to extract text from text image,so can you please help me for that……
Please send me source code for that ,if possible on my mail firstname.lastname@example.org….
It’s all in the demo code. I recommend reading the source code, since it is very small. This is really a very simple project. Also look for main() functions inside the source files. Each of those are runnable. There should be an OCRScannerDemo.java (or OcrScannerDemo.java) file somewhere in there which does everything you want to do. Check the discussion board on the sourceforge project page. The maintainers have dealt with this question there, I believe.
Hi Ron, I am trying to use your OCR in on of my projects. However after I download the released version from SourceForge I can’t find the OCRScannerDemo.java file. Could you give me a pointer to that? Thanks.
Check the message board on the sourceforge project page. I believe the maintainers have dealt with this question there.
I must say you have done great job by making JavaOCR and doimng great job by answering almost all post
Hats off to you !!!!!
I have oen question , I want to develop one android app , which can start mobile camera and from that user can take picture of either laptop screen or paper and later I have to traverse through image to do some sort of code validation
What potion of my project can be handled by your Java OCR and any other guidelines for me
Thanks in advance
In its current state, I’d think JavaOCR might be difficult to use for that type of task, as it doesn’t have what is called “document registration” functionality. That is, it has nothing to tilt or shear the document to make it perfectly square and horizontally aligned before beginning to scan the document for lines of text and character cells within those lines of text. So you’d most likely need to provide that kind of functionality yourself, or get a straight-on, perfectly level, high-resolution camera shot.
Dear Ron Cemer,
We are keen to re-use JavaOCR and have downloaded the 2012 version. There doesn’t seem to have been any activity for two years (http://sourceforge.net/p/javaocr/mailman/message/29898556/ ). We’ve got it working (though we’ve had to strip out the Android stuff and de-modularise the Maven as it no longer builds in some JavaVMs).
It doesn’t seem that the later developers are active (that’s not a moral judgment! – I have left projects in limbo myself). If you have their emails I would be happy to contact them and get the latest position.
If we don’t hear positively that they are working on it we’d like to fork parts of the project – obviously with positive attribution. We’re particularly keen on the Mahalonobis and Hu Moments. We’d like to add some of our own features (the topology of character skeletons) to help resolve difficult characters (e.g. “c” and “o” or “8” and “B”) We’d probably use BoofCV for the filtering as it is active and documented. and modularise JavaOCR to the actual character recognition and not preprocessing – assuming that is reasonable.
Please communicate with the others if you have their addresses.
You should now have my email (see also http://en.wikipedia.org/wiki/Peter_Murray-Rust) and would love to hear from you
All my code is FLOSS – most is Apache2 and I am particularly keen in liberating data both from technological constraints (e.g. pixels) and legal (publishers and corporations)
Thanks for your interest in JavaOCR. You are probably correct about it being unmaintained at this point. I certainly haven’t done much work on it in the past several years. I will contact you separately by email, but I would actually like to add you and your team as maintainers/admins to the JavaOCR project, if you’re interested. That way, you can do all of the improvements right there in the JavaOCR project. I’m not a big fan of forking when it comes to adding features and functionality. I’d love to see JavaOCR become something really useful, and the way I see that happening is through teamwork and iterative improvements. Someone does a bit here, someone else does a bit there, and so on.
First, thanks for making this component! Seems many people have benefited from what you have done. Now, I haven’t used this myself but I would like to know if I could use this to read Seven Segment Led displays. I came across this software (http://www.unix-ag.uni-kl.de/~auerswal/ssocr/) and I was wondering if your software can accomplish the same thing? I needed a java ocr so I prefer to use yours.
Thanks a lot!
If you train it with that font, it should be able to recognize it. The only thing you’d need to do is ensure that the training image and the image to be scanned are both black characters on a white background. So the LED segments need to be black when on, white when off, on a white background. Create at least one sample image of every character you want to recognize, and use those for your training images.
This is a great article. Thanks for that. I’m a student and I’m new to OCR and trying to build one for my own language which is “Sinhalese”. Can I use your development for that. Is that the same approach for a different language ? Can I just know do you have any suggestion on develop a OCR other than English.
Thanks a lot 🙂
It all comes down to the training images. As long as your language is read left-to right, top-to-bottom, and there is sufficient whitespace between characters with a word, it should work fine.
Hi ron, I am glad to find the one who is saying the word ‘Open source’ in OCR field.
I am trying to develop Business card reader engine. I am facing two main challenges. Firstly, free OCR. Secondly, even after the OCR is found, I am unable to finalise a way where I can scan different contents of business cards like Logo and text part. I would like to ignore Logo part and scan the text part of the image using OCR. Is it possible?
JavaOCR deals only with scanning text, once a region of text has been discovered and isolated. If there’s anything other than text in the (sub)image you pass to it, scanning will probably fail. You may be able to combine it with something like OpenCV in order to get what you want.
Thanks! It looks awesome!
Any chance that you have a short example for getting all availabe text from an image?
Look for OCRScannerDemo.java in the source code. It does exactly that.
Thank you for sharing your work, and take the time to write this post for the benefit of the community. I was wondering if this OCR only works with white background ? I’ll have the same “template” of image every time, maybe containing some things I’ll need to crop before passing the part I’m interested in be processed. But I’m may have let’s say black, yellow, or whatever as bacground with a text with good contrast (really easy for the human eye) in any other color.
Do you think your software (with some pre-processing for the crop) may be a fit ?
It does expect the characters to be darker than the background. If you do appropriate pre-processing, you should be able to get the image into a dark-text-on-lighter-background scenario, in which case it should scan it just fine. Cropping is definitely needed. You don’t want it trying to scan anything that is not text.
This is absolutely fantastic. Got it working, no issues. I’m glad people are still contributing useful stuff open source!
Can I change the training images in order to match some symbols I draw?
It’s just an image matching algorithm, so you can at least give it a try. Your mileage may vary, depending on the quality of the training images and the quality of the drawn images.
I have been looking for your project for few days, I need a ocr engine to use in my java projects but i couldn’t figure it out how to use your project. There are 3 different copies on sourgeforge and none of them has a compile script, nor ocr.jar. So i couldnt follow your instructions which are on top of this page. I’m not sure if there is an up-to-date guide for compiling and using this project but if there is one, im sorry for taking your time.
Thanks a lot, both for your help and this open source project
I believe the JavaOCR maintainers have adapted it to use Maven for building and testing.
Thanks for this useful tool. Can this tool recognize(identify) the captcha like this link?
Thank you so much.
Not likely. Sorry about that. It really likes consistent, high-contrast, black (or dark) characters on a white (or light) background. All of that “snow” is probably going to prevent it from being able to find the character blocks. It’s not really a machine learning algorithm, but an image matching algorithm.
Now the guys who have take up the mantle and really improved the JavaOCR project, they may have put some better algorithms in there. I don’t know for sure. But the original JavaOCR I wrote, was a very simple image matching algorithm.
Hi Ron, firstly I want to thanks for all this useful work.
I want to use this but I don´t understand how to create a set of training images. Can you explain me?
Please read my recent reply to Vishwas on this same page. It covers how the training images work. Also, look for OCRScannerDemo.java in the source tree. It shows exactly how the JavaOCR library should be used.
Downloaded the jars like javaocr-core-1.0.jar,javaocr-plugin-awt-1.0.jar,javaocr-plugin-cluster-1.0.jar,javaocr-plugin-fir-1.0.jar etc. No idea how to use?. I was trying to pass the path of image to the loadTrainingImages method of class OCRScannerDemo and then calling the process method of the same class. Can you give a example to use it.
This question has been asked a lot. Look for a file named OCRScannerDemo.java in the source tree. It shows exactly how to use the JavaOCR library.
The project works fine.i was in doubt about what are the expected start char and end char fields need to be filled up with? Please help with it.
Each training image should cover a consecutive range of character codes (ASCII or Unicode). The start and end characters are just the first and last character in the range of characters contained in the training image. For example, if the training image contains the letters ABCDEFGHIJKLMNOPQRSTUVWXYZ, then the first character would be A and the last character would be Z. Note that they must be consecutive character codes. Each character in the training image must be the very next character code (+1) after the character code of the character before it.
The first table on this page http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm is the 7-bit ASCII code. We see that A is character code 65 (decimal) and Z is 90. All of the character codes in the range 65…90 are covered by the letters A-Z. So when you’re building your training images, you want to use only consecutive characters in the ASCII or Unicode character set.
I made training images with A-Z working correctly with the algorithm. They supported the start char and end char values A-Z correctly. My question was specific to the training image that is available in the package (ASCII). I tried entering ~ as start char and ! as end, and the result generated error. How can I rectify it. I tried searching the MSEocr function but it doesn’t clearly sets requirements for the start char and end char value.
The tilde (~) character is hex character code 7E, and the ! character is hex character code 21. 7E is far greater than 21. So your characters are out of order. Thy need to be consecutive in their character codes. Please refer to the chart located here: http://www.ascii-code.com/
Hi I was able to clone the git repository of javaocr and get it up and running with maven. I’m trying to run the OCRScannerDemo.java from the awy/plugins folder. It specifies that I need to provide a “-DTRAINING_IMAGE_DIR=” I can’t seem to figure out what I’m supposed to give it there or how to maybe create this training image directory.
My overall goal is to be able to run your JavaOCR library on a java application (server side) where I feed it images and it outputs detected text. I’m having a bit of difficulty trying to do this by looking at the demo source code because there isn’t a “right out of the box” demo I can just run and see it work.
Is there any help you can give me to get a demo working ?
I just took a look, and found that sample training images exist in legacy/ocrTests/trainingImages under the source tree. Also, in legacy/ocrTests, there are some images which the engine can properly decode, as long as the proper training image is used. For example, if you use legacy/ocrTests/trainingImages/hpljPica.jpg as the training image, then you can decode any text printed in that font, such as the text in legacy/ocrTests/hpljPicaSample.jpg.
OK great! I found it and was able to run the demo. Thanks for that it really helped. Though I must say that I struggled a bit with getting the whole maven thing to run first.
Here’s my other question, the training image. Do they have to be in that format like the ones in legacy/ocrTests/trainingImages ? or can it be in any random order? It should work with hand-written letters to right?
Thanks for replying quickly 🙂
The characters in a training image need to be consecutive within the ASCII or Unicode character set.
I’m not sure how the handwriting recognition functionality works, or whether it works at all. That was added by some others after I open-sourced the code.
So I’m trying to make it train with the font Arial and loading the training image works fine. But when I run it against a simple sentence like
“HELLO THIS IS A TEST” it returns “HELLOTHISISATEST”
Can’t figure out why it’s omitting the spaces. Though in the defined character range between ! and ~ the space character is not included. But with the legacy training images the spaces are detected. Am I missing something here?
No, it sounds as if you’re doing it all correctly. There are some settings in the DocumentScanner class, I believe it is, which you can tweak to control the minimum space width, as a fraction of line height. I’d first look at turning that down a bit. It really depends on your font.
hi, ron, i am a student. And i want to use your javaocr to implement a system which can identify the number on id card. I have never used maven, so how can i use the javaocr in eclipse, could you give me the details, thanks a lot!
I’m not a Maven or Eclipse expert. I’d recommend posing that question on the forums at the project page, which is http://sourceforge.net/projects/javaocr
1. why didn’t you choose GitHub as a source control (rather than SourceForge)? 🙂
2. do you upload releases to maven-repos (e.g. sonatype)
When I released the JavaOCR source code under an open source license, SourceForge was the most popular open-source source code project hosting website. I actually prefer GitHub, and I use it for all new projects.
I don’t play with Maven, so I don’t know what maven-repos is.
You’re welcome to pose these questions on the SourceForge project page. Perhaps the current JavaOCR maintainers would like to move to GitHub and do the maven-repos thing for you.
hi, ron, I just set the argument as “orcTests/digits.jpg” and run the project from the file DCRScannerDemo.java in eclipse, and it shows the error “Please specify -DTRAINING_IMAGE_DIR= on the java command line. ” How do I get the value of DTRAINING_IMAGE_DIR?
I always develop from command line. You’ll want to set the TRAINING_IMAGE_DIR Java constant in Eclipse before running the project. I’m not an Eclipse expert, so I can’t tell you exactly how to do that. But I believe there is some option to set the Java command-line parameters and pre-defined constants. That’s where you’d want to set this.
Hello, You did a great job, but I’m trying to use you proyect demo and I get an error in load method, looks like it required a component. What can I do? thanks
Have a look at OCRScannerDemo.java, and the JavaDocs. JavaOCR is a library, for use by Java developers. It’s really designed for you to write your own applications on top of it, and for experimenting with various OCR algorithms.
Hi Ron, first thank you very much for your all works.
I have a question for line extraction method that can be extracted line images.
First, how I can retrieve only the text without lines of images?
Second, if it can be extract lines, is it possible to extract columns from images?
Look at the DocumentScanner class, and corresponding subclasses. If you’re talking about the outlines which are shown on the screen when characters are being decoded, that’s just for informational purposes, to help debug the DocumentScanner to ensure that it’s separating the lines of text and the character cells within the lines properly.
I have to choose between 4 Java Api (including yours) to deal with OCR.
And i don’t know how to easily integrate your project in mine.
I just want to get the text from a picture, nothing more.
Look for OCRScannerDemo.java, and look at the JavaDocs which should be available after building the JavaOCR library. That should get you started.
hi, ron, it’s a great project.
I am so interested in this work. But i cannot understand the least-square-error matching algorithm, could you guide me how to understand the algorithm, or give some reference documents to me?
thank you so much!
Think of it as taking two images and calculating the difference in pixel values between them. Each pixel delta (difference) is then taken as a vector in an N-dimensional space, where N is the number of pixels in each of the images. In this case, an “image” is actually a single character cell, both in the source document and in the training images. Using the pythagorean theorem, we know that the distance between two points in space is the square root of the sum of the squares of the distances along each axis. This is how the hypotenuse of a right triangle is calculated from the lengths of the other two sides. In the case of JavaOCR, the square root is never calculated; only the sum of the squares. We only care about which training image results in the least sum of the squares when compared with the character we’re trying to decode. So…each training image whose aspect ratio is within a percentage of the actual aspect ratio of the character to be decoded, is compared to the image of the character to be decoded, and the sum of the squares of the differences in the pixels is calculated. Whichever training image results in the smallest sum of the squared pixel differences, is assumed to be the correct character, and that character is then sent to the output.
If you check the article and the other comments, this should be covered pretty thoroughly. Mean Squared Error and RMS Error are industry standards for measuring everything from harmonic distortion in amplifier circuits, to measuring the quality of decompressed audio or video from lossy CODECs.
Thanks for the nice work.
Where can I get java doc? I am trying to implement hand writing detection using javaocr.
It should be available when you do a complete build. They use maven for building, and I’m not a maven expert. Also, look for OCRScannerDemo.java. Be sure to read the other comments in this article, as this subject has been touched on quite a few times in the comments.
I would like to use your Library for my Engineering project. I’ve been able to download and run the OCR engine with good results. I’m currently trying to understand the code and the different algorithms and computer applications that have been used. I would greatly appreciate any form of theory or data that you may be able to provide for further understanding the code.
I can’t seem to understand how the binarisation is taking place and what method has been employed to trace the lines and characters.
Please mail any relevant information to me.
I would be greatly indebted to you.
The DocumentScanner class handles scanning the document for lines of text and character cells. The binarization of the document is handled by one of several binarization plug-ins. Originally, it just selected a brightness halfway between the brightest and darkest pixels in the document, and thresholded the entire document based on that brightness level. Some of the later contributors/maintainers added localized binarization algorithms which work much better, especially with unevenly lighted documents. I recommend poking through the source code and the JavaDocs for more information. Since this is a library, the majority of the documentation is going to be in the JavaDocs.
Hi, where can I find the getting started tutorial or documentation for this?
When the JavaOCR library is built, it should create HTML JavaDoc for you. Also, look for a file named OCRScannerDemo.java. That is a sample application which scans images for text. Be sure to read the other comments in this article. This subject has been covered multiple times.
Hi, can this OCR recognize mathematical formulas?
Should be able to, if you train it with the correct characters.
I already test the OCRDemo and successfully run it. Thanks for that! I just have one question how can we set /get the New CharacterRange based on the new ImageTraining? Can you set an example or point out it in one of your classes.
I already found the solution above I just have another question can guide on how to create our own Training Images? I actually created one in MS Word and select a font a family like Arial with the text from ! to ~ without space but I encountered an error in small ‘p’
trainingImageLoader.processChar: ‘p’ 691,61-706,67
java.io.IOException: Expected to decode 94 characters but actually decoded 80 characters in training: C:\Users\jim.filbert.a.vano\Downloads\JavaApps\tools\C-3PO//src//com//jim//ocr//trainingimages//\arial.jpg
Can you give an example or procedure how to create our own self created training image.
You need enough whitespace between characters for the image decoder to be able to clearly distinguish the character cells. There are also some tuneable parameters in the DocumentScanner class which cover this. You’ll need to dig into the source code for those.
Same as with the exiting ones. See OCRScannerDemo.java.
Downloaded javaOcr from
But in this i didn’t found
compile – compiles the Java files into class files in the classes directory
createJars – creates ocr.jar from the compiled classes
ocrscannerdemo – demonstrates OCR functionality using any of several test images and corresponding training images
They’ve re-worked it all to use Maven. There should be an OCRScannerDemo.java or OcrScannerDemo.java (I forget the case) somewhere in the code. I believe they build a jar file which allows you to test everything. If it’s not documented already in a README.md, go to the project page and ask them to write one up. It should include everything you need to do to go from downloaded JavaOCR source code, all the way through setting up Maven and any other tools you need, to running the command-line and GUI demos. You can create an issue for this if needed.
I know Konstantin has been very busy with work lately, so you may have to write up the README.md yourself. He will respond to an email though, so you can ask him what the steps are, and he will email you back, and you can write up the document and submit it to either him or me, and we’ll make sure it gets added.
I have been playing with JavaOCR and the issue I keep running into is that it keeps failing on certain fonts during the training phase. The fonts I have had trouble with so far are “Times New Roman” and “Book Antiqua”. The issue I keep running into is that I get an exception like: Expected to decode 94 characters but actually decoded 104 characters. It appears to have trouble with certain characters that are wider than others, like “M” and “W”. I have tried to vary character spacing but I can not get past these issues. I have had luck with “Arial” and “Verdana”. Has anyone else ran into issues like this?
Regards – Luke
This is a common issue, or at least it was with the original version of JavaOCR. One easy fix is to increase the whitespace to the left and right of the problem characters in the training image(s). You can also play around with the white threshold. It used to be a setting in the DocumentScanner class, but they’ve since re-factored the code and added a lot of different binarization algorithms. Look for “binarization” “binarize” and the British spelling (replacing “z” with “s”). The same binarizer which is used for documents should also be used for training images. If you get the white threshold right, it should work with nearly 100% reliability. The Savoula algorithm is one of the good ones, IIRC, as it adapts the white threshold automatically to the local mean brightness, which can vary in different parts of the training image or document.
Good job, it would be great if the demos and documentation were easy to find. Your work has been very useful to me, thanks!