Tesseract OCR and Cross Compiling on iOS

Tutorial By 6 years ago

Previously I wrote on using CGPDF to display PDF document pages like you would display images. Well now we are going to get into something that goes beyond what most PDF apps on iOS do, OCR. There are a couple of different options when it comes to OCR libraries, but my research indicates that Tesseract is the weapon of choice here. It was a library developed between 1985 and 1995 by HP Labs, and was one of the top 3 engines in 1995 UNLV Accuracy test. Google has since taken it over and now it resides open source on Google Code. It is written in C++, which is good for iOS developers because it means with just a bit of tweaking we can integrate it seamlessly into Objective-C code.
However before we can get coding with the Tesseract API, we must first compile it for the arm architecture to make it compatible with iOS. Due to the ever changing nature of these open source libraries information can quickly become out of date in regards to tutorials on how to cross compile certain libraries (and this tutorial is no exception), I was following a Tesseract iOS Cross Compiling Tutorial for quite some time in frustration, I kept getting errors when using the build script. I eventually solved the problem using a slightly edited build script you can run in Shell:

#!/bin/sh
# build_fat.sh
#
# Created by Robert Carlsen on 15.07.2009. Updated 24.9.2010
# build an arm / i386 lib of standard linux project
#
# initially configured for tesseract-ocr v2.0.4
# updated for tesseract prerelease v3
# Edited by will@b2cloud.com.au on 5.07.2011

outdir=outdir
mkdir -p $outdir/arm $outdir/i386

libdirs=( api ccutil ccmain ccstruct classify cutil dict image textord training viewer wordrec )
libs=( api ccutil main ccstruct classify cutil dict image textord training viewer wordrec )
count=${#libdirs[@]}

make distclean
unset CPPFLAGS CFLAGS LDFLAGS CPP CXX CC CXXFLAGS DEVROOT SDKROOT LD

export DEVROOT=/Developer/Platforms/iPhoneOS.platform/Developer
export SDKROOT=$DEVROOT/SDKs/iPhoneOS4.3.sdk
export CFLAGS="-arch armv7 -pipe -no-cpp-precomp -isysroot $SDKROOT -miphoneos-version-min=4.0 -I$SDKROOT/usr/include/"
export CPPFLAGS="$CFLAGS"
export CXXFLAGS="$CFLAGS"
export LDFLAGS="-L$SDKROOT/usr/lib/"
export LD="$DEVROOT/usr/bin/ld"
export CPP="$DEVROOT/usr/bin/cpp-4.2"
export CXX="$DEVROOT/usr/bin/g++-4.2"
export CC="$DEVROOT/usr/bin/gcc-4.2"
./configure --host=arm-apple-darwin
make -j3

index=0
while [ "$index" -lt "$count" ]
do
    cp ${libdirs[index]}/.libs/libtesseract_${libs[index]}.a $outdir/arm/libtesseract_${libs[index]}_armv7.a
    ((index++))
done

make distclean
unset CPPFLAGS CFLAGS LDFLAGS CPP CXX CC CXXFLAGS DEVROOT SDKROOT LD

This gave me the libraries I needed to run Tesseract on iOS Devices (again I stress, not on the simulator). The Tesseract SVN also comes with pre-learned language files making the experience much less painful than it otherwise would be without them. The files have the extensions .DangAmbigs, .freq-dawg, .inttemp, .normproto, .pffmtable, .traineddata, .unicharset, .user-words, .word-dawg. I chose to only import English (eng) but you can choose to import as many languages as you want. Now that we have these files in the resources directory, the next thing we need to do is import the header files. I simply got all the headers files in the tesseract directory and copied them into my projects directory, using baseapi.h as a starting point it compiled fine. Now it’s worth noting that XCode has a strange quirk, where it requires you to make a mixed Objective-C/C++ source file’s extension .mm rather than just .m, simply changing the extension may solve your problems if you are having them down the track. Given the .m extensions inability to cope with C++ code, we have to code our header a bit diligently, I used to simply give all C++ objects a void* type and then typecast them when I was using them, however I found a project called Pocket OCR that managed to do this a lot more elegantly by detecting whether the compiler expected C++ and changing the meaning of the TessBaseAPI to suit that. The header file now becomes:

#import 

// We will present different types for TessBaseAPI depending on whether the compiler expects C++ or Objective-C
#ifdef __cplusplus
#include "baseapi.h"
using namespace tesseract;
#else
@class TessBaseAPI;
#endif

@interface OCR : NSObject
{
	TessBaseAPI* tess;
}

-(NSString*) textFromImage:(UIImage*)image;

@end

By using the __cplusplus definition we can determine whether the compiler is expecting C++ or Objective C, and then either declare the type of TessBaseAPI from baseapi.h or a new Objective-C @class. Now we have the header file we have to write the code to run Tesseract on a given UIImage. Firstly we have to load our language files from the resources directory into the Documents directory, to do this I used the +initialize method that gets called when a method from the class gets called for the first time:

// This method gets called when our class is first called
+(void) initialize
{
	// Copy the training data into the documents directory
	NSArray* trainingDataSuffix = [NSArray arrayWithObjects:@"DangAmbigs",@"freq-dawg",@"inttemp",@"normproto",@"pffmtable",@"traineddata",@"unicharset",@"user-words",@"word-dawg",nil];

	// Get the path to the resource files
	NSString* bundlePath = [[NSBundle mainBundle] bundlePath];
	// Hold a potential error
	NSError* error = nil;
	// Get the contents of the resource directory
	NSArray* dirListing = [[NSFileManager defaultManager] contentsOfDirectoryAtPath:bundlePath error:&error];
	// Boolean to determine whether we have already created a directory or not
	BOOL createdDirectory = NO;
	// The path to the documents directory when appended with the tessdata folder
	NSString* documentsDirectory = [[App getHiddenDocumentPath:@""] stringByAppendingPathComponent:@"tessdata"];
	// Loop the resource files
	for(NSString* file in dirListing)
	{
		// Loop the possible extensions we are looking for
		for(NSString* extension in trainingDataSuffix)
		{
			// Check if the extension is one of these extensions we have been looking for
			if([[file pathExtension] isEqualToString:extension])
			{
				// Check if we have created the directory
				if(!createdDirectory)
				{
					// Create the directory
					[[NSFileManager defaultManager] createDirectoryAtPath:documentsDirectory withIntermediateDirectories:YES attributes:nil error:&error];
					// If we have an error tell us what it is
					if(error != nil)
					{
						NSLog(@"Error: %@",error);
						error = nil;
					}
					// If not, tell the loop we have created the directory so we don't have to do it again
					else createdDirectory = YES;
				}
				// Get the path of the file in the tessdata directory
				NSString* fileInDocumentsDir = [documentsDirectory stringByAppendingPathComponent:[file lastPathComponent]];
				// Check if the file already exists
				if(![[NSFileManager defaultManager] fileExistsAtPath:fileInDocumentsDir])
				{
					// If not, copy the file to the tessdata directory
					[[NSFileManager defaultManager] copyItemAtPath:[bundlePath stringByAppendingPathComponent:file] toPath:fileInDocumentsDir error:&error];
					// If we have an error tell us what it is
					if(error != nil)
					{
						NSLog(@"Error: %@",error);
						error = nil;
					}
				}
				// We have found a valid extension, it's unlikely we'll find another so break the loop
				break;
			}
		}
	}

	// set the environment variable TESSDATA_PREFIX to the path before the tessdata folder, in this case it's the documents directory
	setenv("TESSDATA_PREFIX",[[App getHiddenDocumentPath:@""] UTF8String],1);
}

This basically searches for files which have the tell tale suffixes and copies them across to the Documents directory. You’ll notice that App is not defined, this is a class I use in most of apps to do mundane things like return the hidden document path which would usually take about 5 lines of code.
Now that we have the files in a place where they are neatly laid out for Tesseract we can use the init method to setup the actual Tesseract object:

// Override NSObject's init method
-(id) init
{
	// Call init on the NSObject parent object
	[super init];

	// Create a new Tesseract object
	tess = new TessBaseAPI();
	// Initialise the tesseract object with English as it's language
	tess->Init([[[App getHiddenDocumentPath:@""] stringByAppendingPathComponent:@"Tesseract"] UTF8String],"eng");

	// Return ourselves to any entity who called init
	return self;
}

This allocates the memory for TessBaseAPI (Tesseract) and tells it to initialise itself with the english language. This is where you can load more languages into if you desire, however keep in mind that the more languages you load the slower it will get in both the initialisation and the run time of the OCR engine. Now being good memory citizens it’s also worth us making a dealloc method to clean up any allocations we have made:

// Override NSObject's dealloc method
-(void) dealloc
{
	// If we have a Tesseract object in memory
	if(tess != NULL)
	{
		// End it (this includes deallocating the memory for it)
		tess->End();
		// Make it NULL so we don't call it again
		tess = NULL;
	}
	// Call dealloc on the NSObject parent object
	[super dealloc];
}

Finally we have to write the method we came for, textFromImage. This method will use Core Graphics to a grayscale image from the one we provide to it and gain an array of bytes that correspond to the pixels gray levels at that point.

// Method to obtain the text (in NSString format) from an image (in UIImage format)
-(NSString*) textFromImage:(UIImage*)image
{
	// Check if the image is in memory
	if(image != nil)
	{
		// Get the UIImage's CGImage (allowing us to do more to the image)
		CGImageRef imageRef = image.CGImage;

		// Define the number of bits per pixel (8 bits to 1 byte)
		size_t bitsPerPixel = 8;
		// Define the number of bits per component (1 component, 1 byte so 8 bits)
		size_t bitsPerComponent = 8;
		// Define the number of bytes per pixel (it's done using divisions but its really just 1)
		size_t bytesPerPixel = bitsPerPixel / bitsPerComponent;
		// Save the width of the CGImage
		size_t width = CGImageGetWidth(imageRef);
		// Save the height of the CGImage
		size_t height = CGImageGetHeight(imageRef);
		// Save the bytes per row (which is just the width multiplied by the bytes per pixel)
		size_t bytesPerRow = width * bytesPerPixel;
		// The total buffer length will be the number of rows multiplied by the height of the image
		// It's helpful to think of the pixel buffer as a grid
		size_t bufferLength = bytesPerRow * height;

		// Create a gray colour space (gray is all we need for OCR)
		CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceGray();

		// Allocate the memory available for pixels
		uint32_t* bitmapData = (uint32_t*) malloc(bufferLength);

		// Render the bitmap into the memory in grayscale
		CGContextRef context = CGBitmapContextCreate(bitmapData,width,height,bitsPerComponent,bytesPerRow,colorSpace,kCGImageAlphaNone);

		// Release the gray colour space because we don't need it anymore
		CGColorSpaceRelease(colorSpace);

		// Define a CGRect to draw our image into
		CGRect rect = CGRectMake(0.0,0.0,width,height);
		// Draw the gray image into the context using the rect as a space constraint
		CGContextDrawImage(context,rect,imageRef);
		// Obtain the pixel data from the context
		// I'm sure there's a better way to do this
		unsigned char* pixelData = (unsigned char*)CGBitmapContextGetData(context);

		// Run tesseract on the pixel data
		char* text = tess->TesseractRect(pixelData,1,image.size.width,0,0,image.size.width,image.size.height);
		// Convert the UTF8 text to an NSString
		NSString* string = [NSString stringWithUTF8String:text];

		// Deallocate the UTF8 text
		delete [] text;
		// Deallocate the memory holding the pixel data
		delete [] pixelData;
		// Return our string
		return string;
	}
	// Return nothing (nothing comes from nothing)
	return nil;
}

Now you will go to run it, and probably be as disappointed as me. To get any kind of decent results (on an iPad) it takes about 1 minute of processing. The accuracy is stunning, but only decent if you feed it a big enough image. In my previous tutorial when I talked about the size of images being generated from CGPDF, it is vitally important here that they be large otherwise all you will get is garbage.
Having said that, although the project I was working on couldn’t make much use of this, it is a good skill to have under your belt as an iOS developer as it allows you to offer fringe features that only a couple of other developers can offer. Something to take into account here is licensing, Tesseract is released under the Apache License so it is fairly lax on commercial developers using it providing they give it proper credits, however you should always get legal advice before including any kind of open source library into a commercial product.

  • Vlado Bosnjakovic

    I was trying all day long, to build Tesseract 3.01 for iOS.
    I was using different versions of build_fat.sh.

    I am still stuck with following errors:

    make: *** No rule to make target `distclean’. Stop.
    configure: WARNING: If you wanted to set the –build type, don’t use –host.
    If a cross compiler is detected then cross compile mode will be used.
    checking build system type… i686-apple-darwin11.2.0
    checking host system type… arm-apple-darwin
    checking –enable-graphics argument… yes
    checking whether the C++ compiler works… no
    configure: error: in `/Users/vlado/Documents/iOS/Cyril/tesseract-3.00′:
    configure: error: C++ compiler cannot create executables
    See `config.log’ for more details.
    make: *** No targets specified and no makefile found. Stop.
    cp: api/.libs/libtesseract_api.a: No such file or directory
    cp: ccutil/.libs/libtesseract_ccutil.a: No such file or directory
    cp: ccmain/.libs/libtesseract_main.a: No such file or directory
    cp: ccstruct/.libs/libtesseract_ccstruct.a: No such file or directory
    cp: classify/.libs/libtesseract_classify.a: No such file or directory
    cp: cutil/.libs/libtesseract_cutil.a: No such file or directory
    cp: dict/.libs/libtesseract_dict.a: No such file or directory
    cp: image/.libs/libtesseract_image.a: No such file or directory
    cp: textord/.libs/libtesseract_textord.a: No such file or directory
    cp: training/.libs/libtesseract_training.a: No such file or directory
    cp: viewer/.libs/libtesseract_viewer.a: No such file or directory
    cp: wordrec/.libs/libtesseract_wordrec.a: No such file or directory
    make: *** No rule to make target `distclean’. Stop.

    What am I doing wrong ?

  • Ever found a solution? I’m bumping into that also