Skip to main content

Overview

It is common that organizational processes involve an understanding of several types of documents such as Invoices, Purchase Orders, Bills, Forms, and many others. As part of automating these processes, it is required that the processing of these documents is also automated. IntelliBuddies provides various OCR activities to support document processing automation. IntelliBuddies can work with multiple OCR Engines. Here is the list of activities supported by IntelliBuddies under this category.

ActivityDescription
Create Google Cloud OCR EngineCreates a handle to Google Cloud OCR Engine so that it will allow you to work with Google Cloud OCR Engine in order to extract text from Images
Create MODI OCR EngineCreates a handle to MODI (Microsoft Office Document Imaging) OCR Engine so that it will allow you to work with this OCR Engine to extract text from Images
Create Tesseract OCR EngineCreates a handle to Tesseract OCR Engine so that it will allow you to work with this OCR Engine to extract text from Images
Extract Text with OCRExtracts the text from the specified image using the specified OCR Engine handle and returns back the extracted text
Find OCR Text PositionSearches for the specified text in the specified image using the specified OCR Engine handle and returns back true, if the search text was found inside the image along with its position, otherwise false
Find OCR Closest Text PositionThis is similar to Find OCR Text Position except that it utilizes text fuzzy logic to find the closest match by controlling the match tolerance of the search string inside the image
Extract PDF Text With OCRExtracts the text from the specified PDF file using the specified OCR Engine handle. You can specify the page range for text extraction. By default, it will extract the text from the entire document
Identify Document Using OCRIdentifies the type of the specified document based on the specified document processing model
Extract PDF Data With OCRExtracts the data from the specified PDF document based on the specified document processing model
Validate OCR DataPops up Validator UI to show the data extracted by Extract PDF Data With OCR. You can validate the extracted data and correct it before proceeding to the next step in the process

OCR Engine#

IntelliBuddies allows you to work with different OCR Engines. The capabilities of text extraction from images depend upon each respective Engine's capabilities. As of now, IntelliBuddies supports the following OCR Engines:

  • Tesseract OCR Engine
  • Google Cloud OCR Engine
  • MODI (Microsoft Office Document Imaging) OCR Engine

In order to support multiple OCR Engines in all other activities under this category, you need to create the appropriate OCR Engine, obtain a handle to that Engine, and then pass it to the other Activities. IntelliBuddies exposes the handle to these different OCR Engines through the IOCREngine type. IOCREngine will specify the handle to the OCR Engine that was created.

OCR Parameters#

You can control the extraction capabilities of OCR Engine using OCR Parameters. All the OCR activities which perform text extraction can be configured with appropriate OCR Parameters to tune your OCR Engine. You can set these parameters through the OCR Parameters dialog, which can be invoked by clicking on the [...] button available as part of the Properties panel.

The table below details each of the above parameters.

ParameterDescription
Value TypeThe type of value is extracted. IntelliBuddies supports certain pre-defined value types listed in the drop-down.
Value FormatRegular expression pattern to validate the format of the value being extracted. An error would be reported in case the value does not match the pattern expected.
Whitelist CharsThe characters that could appear as part of the value being extracted. This will be set based on the Value Type selected. You can control these character set for the Custom value type.
Blacklist CharsThe characters that should not appear as part of the value being extracted. This will be set based on the Value Type selected. You can control these character set for the Custom value type.
Page Segmentation Mode (psm)Page segmentation mode defines how your text should be treated by OCR Engine. For example, if your image contains a single character or a block of text, you want to specify the corresponding psm so that you can improve accuracy. By default, the psm would be set to Sparse Text.
Preserve Inter Word SpacesWhen this flag is checked, the text will be extracted from the image by preserving the spaces as per the original image. By default, this is unchecked, and thereby the extracted text spaces would be trimmed.

Clipping Region#

The activities such as Extract Text With OCR will, by default, extract the text from the entire image. However, you can specify the region from where you want to extract the text from the image. The region could be specified optionally as part of the Clipping Region property of the corresponding activity. This Clipping Region is of type BoundingRect, which denotes the rectangular region in terms of pixel coordinates inside the image.

TextResult#

All OCR extraction activities return the result in the form of TextResult. TextResult encapsulates the extracted text along with other information related to the extraction. You can get the text extracted from the Text property of this TextResult object.

Document Processing#

IntelliBuddies supports document processing automation through the Trainer Buddy component of IntelliBuddies. You can use Trainer Buddy to create document models, which will be later used by our Buddies to identify and extract data from the documents.

DocumentQueries#

You can use our Trainer Buddy to train with multiple document templates. The Trainer Buddy exports the trained document model in the form of DocumentQueries. The DocumentQueries is a list of DocumentQuery, where each DocumentQuery holds the training information for the corresponding document template that was part of the training.

When DocumentQueries is used as part of Identify Document With OCR activity, the activity will process the specified input document with all the trained document templates embedded as part of specified DocumentQueries. If it finds the specified document matching any of the trained document templates, it will return back the corresponding DocumentQuery model.

PageQueries#

The document might have one or more pages. Hence, it becomes important to identify each of the pages of your interest and extract corresponding data from that page. As part of document training using our Trainer Buddy, you can train documents at the page level to specify how to identify individual pages and what kind of data needs to be extracted from the corresponding pages. The DocumentQuery embeds this page level training into PageQueries property.

You need to pass this PageQueries as input criteria to Extract PDF Data With OCR activity. The activity will then process individual pages of the specified input document against the PageQueries to identify the pages that need to be processed.

PageInfo#

The output of Extract PDF Data With OCR will return backlist of PageInfo. Each PageInfo contains extracted data along with its corresponding metadata information for a corresponding page. The result would be the list of PageInfo based on the number of pages it picked to process based on the PageQueries model.