RPA.PDF library | Robocorp documentation

RPA.PDF

PDF is a library for managing PDF documents.

It can be used to extract text from PDFs, add watermarks to pages, and decrypt/encrypt documents.

Merging and splitting PDFs is supported by Add Files To PDF keyword. Read the keyword documentation for examples.

There is also limited support for updating form field values. (check Set Field Value and Save Field Values for more info)

The input PDF file can be passed as an argument to the keywords, or it can be omitted if you first call Open PDF. A reference to the current active PDF will be stored in the library instance and can be changed by using the Switch To PDF keyword with another PDF file path, therefore you can asynchronously work with multiple PDFs.

Attention!

Keep in mind that this library works with text-based PDFs, and it can't extract information from an image-based (scan) PDF file. For accurate results, you have to use specialized external services wrapped by the RPA.DocumentAI library.

Portal example with video recording demo for parsing PDF invoices: https://github.com/robocorp/example-parse-pdf-invoice

Examples

Robot Framework

*** Settings *** Library RPA.PDF Library String *** Tasks *** Extract Data From First Page ${text} = Get Text From PDF report.pdf ${lines} = Get Lines Matching Regexp ${text}[${1}] .+pain.+ Log ${lines} Get Invoice Number Open Pdf invoice.pdf ${matches} = Find Text Invoice Number Log List ${matches} Fill Form Fields Switch To Pdf form.pdf ${fields} = Get Input Fields encoding=utf-16 Log Dictionary ${fields} Set Field Value Given Name Text Box Mark Save Field Values output_path=${OUTPUT_DIR}${/}completed-form.pdf ... use_appearances_writer=${True}

from RPA.PDF import PDF from robot.libraries.String import String pdf = PDF() string = String() def extract_data_from_first_page(): text = pdf.get_text_from_pdf("report.pdf") lines = string.get_lines_matching_regexp(text[1], ".+pain.+") print(lines) def get_invoice_number(): pdf.open_pdf("invoice.pdf") matches = pdf.find_text("Invoice Number") for match in matches: print(match) def fill_form_fields(): pdf.switch_to_pdf("form.pdf") fields = pdf.get_input_fields(encoding="utf-16") for key, value in fields.items(): print(f"{key}: {value}") pdf.set_field_value("Given Name Text Box", "Mark") pdf.save_field_values( output_path="completed-form.pdf", use_appearances_writer=True )