Skip to main content

Extract PDF Data With OCR

Description#

This activity read the contents of the PDF data, including headers, and extracts the data. It identifies the fields, codes them, and groups them with the data in each field.

Properties#

Input#

  • Criteria – OCR queries to extract the data from PDF pages.
  • From Page Number – Set the page extraction mode into "Range" and specify the page numbers to start the extraction.
  • Image Resize Percentage – Enter the percentage value to rescale an image.
  • OCR Engine – OCR engine instance returned by the activity Create Tesseract OCR Engine. The Tesseract OCR engine creates language-specified training data to recognize words. It biases the words and sentences that often appear together in a specified language as a human brain does. It produces accurate results with the training data.
  • Page Extraction Mode – Set the page extraction mode to "All," "Single," or "Range" to continue the extraction.
  • Password – The password of the PDF file, if necessary.
  • PDF File Path – Specify the name of the PDF file to export as an image.
  • Retain Temp Images – It Specifies to keep the exported images in the staging folder or delete them after the text extraction.
  • Single Page Number – Set the page extraction mode to "Single" and specify the page number to extract text.
  • Staging Folder – It specifies the path of the exported image folder.
  • To Page Number – Set the page extraction mode to “Range” and specify until which page to extract the text.

Misc#

  • DisplayName – Add a display name to your activity.
  • Private – By default, activity will log the values of your properties inside your workflow. If private is selected, then it stops logging.

Optional#

  • Continue On Error – Specifies if the automation should continue even when the activity throws an error. This field only supports Boolean values (True, False). The default value is False.

    Note: If this activity is included in Try Catch and the value of this property is True, no error is caught when the project is executed

Output#

  • Result – The list of pages and its corresponding metadata extracted and returned back by this activity.

Example#