Note: You are looking at a static snapshot of documentation related to Robot Framework automations. The most recent documentation is at https://robocorp.com/docs

How to read PDF files with RPA Framework

First, a word of caution ✋

Extracting text from PDF files is not a simple operation. PDF was never meant to be a format to read data from: its purpose is to provide an accurate way of reproducing documents and make them portable to any system. To further complicate the matter, PDF files can be encrypted, the text in them can actually be "printed" into an image, tables do not have a standard format, the order in which the paragraphs appear on the page might not be the same in the actual code of the PDF file etc.

If you have access to a different file format in your automation than PDF as your data source, we recommend using that instead.

If you have no alternatives, not all hope is lost, however! In this article, we detail possible ways to get to the text contained in a PDF file.

Extracting text from PDF files

The RPA.PDF library includes tools to create and read the contents of PDF files.

Here is an example script that will extract the text in a PDF file and store it in a corresponding text file:

*** Settings *** Documentation Read the text contained in PDF files and save it to a ... corresponding text file. Library RPA.PDF Library RPA.FileSystem *** Tasks *** Extract text from PDF files Extract text from PDF file into a text file simple-text-example.pdf Extract text from PDF file into a text file example-invoice.pdf *** Keywords *** Extract text from PDF file into a text file [Arguments] ${pdf_file_name} ${text}= Get Text From Pdf ${pdf_file_name} Create File ${OUTPUT_DIR}${/}${pdf_file_name}.txt FOR ${page} IN @{text.keys()} Append To File ... ${OUTPUT_DIR}${/}${pdf_file_name}.txt ... ${text[${page}]} END

Robot script explained

*** Settings *** Documentation Read the text contained in PDF files and save it to a ... corresponding text file. Library RPA.PDF Library RPA.FileSystem

In the *** Settings *** section, we add a description of the robot, and we add the libraries we need:

  • The RPA.PDF library allows us to work with the PDF files.
  • The RPA.Filesystem library will be used to create the text files.
*** Keywords *** Extract text from PDF file into a text file [Arguments] ${pdf_file_name} ${text}= Get Text From Pdf ${pdf_file_name} Create File ${OUTPUT_DIR}${/}${pdf_file_name}.txt FOR ${page} IN @{text.keys()} Append To File ... ${OUTPUT_DIR}${/}${pdf_file_name}.txt ... ${text[${page}]} END

Here we define our own keyword Extract text from PDF file into a text file:

  1. [Arguments] ${pdf_file_name} The keyword gets the file name of a PDF file as an argument.
  2. ${text}= Get Text From Pdf ${pdf_file_name} We extract the text from the PDF file using the Get Text From Pdf keyword provided by the RPA.PDF library. Because the keyword returns a dictionary with items for each page of the pdf file, we will need to go over the results with a for loop.
  3. Create File ${OUTPUT_DIR}${/}${pdf_file_name}.txt We create a new empty text file that we will fill with the text later.
  4. We loop over the pages of extracted text. For each of the pages, we call the Append To File keyword to add the text to our file.
    FOR ${page} IN @{text.keys()} Append To File ... ${OUTPUT_DIR}${/}${pdf_file_name}.txt ... ${text[${page}]} END

Finally, in the *** Tasks *** section, we create a task and call the keyword for our PDF files:

*** Tasks *** Extract text from PDF files Extract text from PDF file into a text file simple-text-example.pdf Extract text from PDF file into a text file example-invoice.pdf

Alternatively, you can use the RPA.PDF.Find Text keyword to search for specific text portions in the document, using locators supporting substring and regular expression search strategies. Once such text part has been found, it is set as an anchor, then the search continues for the relative text neighbour parts around it, finally having them returned alongside their anchor into a Match object.

This keyword can be used to extract rows and columns from tables, paragraphs, specific variable details near constant values (like the unit price in an invoice) and anything else that is represented as text inside the content of the PDF file.

Results

The text extraction quality will vary greatly, depending on the document: how it was created, how it was formatted, etc.

Here are some examples:

Source PDFResulting text file
simple-text-example.pdfsimple-text-example.txt
example-invoice.pdfexample-invoice.txt

Alternative approaches

If you need higher precision, and the capability to extract accurate tables from invoices and reports, we recommend connecting your robot to an external extraction service.

You can see an example using the Amazon Textract service in our Cloud machine learning (ML) APIs, where we extract a table from a test invoice.

Another example robot demonstrates how to process PDF invoices with Amazon Textract.

Here's an example on how to Extract data from PDF files displaying invoice like information, where different approaches are shown:

  • Using our own RPA.PDF with Find Text.
  • Using the Camelot library to detect and extract tables.
  • Using specialized AI services with RPA.DocumentAI to return structured information from both text and image based PDF files.

Another example with RPA.DocumentAI: Intelligent Document Processing with various engines.

Last edit: May 5, 2022