This example demonstrates the standard open-edit-save pipeline with PDF documents, using different options on every step.
Introduction
The PDF documents, or documents in a Portable Document Format, developed by Adobe Corp, are widely used all over the Internet and in document management systems. The PDF format has a crucial distinction from other formats such as DOCX, TXT, or HTML/CSS — it is a so-called fixed-layout format. The main purpose of PDF is to be platform-independent and store the exact representation of a document — wherever and whenever this document is opened, it should provide per-character and even per-pixel fidelity. This means that a document, once created, is “baked” in terms of its representation and editability. While you can freely edit any DOCX document by adding, removing, or moving any part of its content, PDF documents stay “frozen”. Internally, a PDF document consists of pages, where every page contains a set of glyphs (visual characters), each having coordinates of where it is located on the page.
Concluding:
Editing PDF documents like ordinary DOCX, TXT, or HTML is an extremely difficult and complex task.
The quality of editing a PDF document may be very close to what we can do with usual text documents, but it will never be 100%, especially when the input PDF has quite complex formatting and content.
Due to the complexity of the PDF format and the process of making it editable, this operation requires a lot of processing time and memory.
In two words
Editing PDF documents is the same as editing any other document:
Load a PDF document into the Editor class with PdfLoadOptions, specifying a password if needed.
The PdfLoadOptions class is responsible for loading PDF files into the Editor. It has only one property — a password of a string type. By default it is None — no password is specified. This property is vital when an input document is encoded with a password. If a document is not encoded, the property value is ignored whether it was specified or not.
When the input PDF is not password-protected, PdfLoadOptions is not necessary at all — GroupDocs.Editor will automatically detect the PDF format and apply the default PdfLoadOptions by itself. However, specifying even a default PdfLoadOptions will speed up the document processing, because in this case GroupDocs.Editor will not spend processing time on the automatic format detection routine.
fromgroupdocs.editorimportEditorfromgroupdocs.editor.optionsimportPdfLoadOptions# Create the default PDF loading optionsload_options=PdfLoadOptions()# Set a passwordload_options.password="some_password"# Load a PDF without PDF load optionseditor1=Editor("protected.pdf")# Load a PDF with PDF load optionseditor2=Editor("protected.pdf",load_options)
Editing
Like for other format families in GroupDocs.Editor, there is a special PdfEditOptions class for editing PDF documents. The most useful properties are:
The skip_images boolean flag. By default it has a False value — images are not skipped and are preserved. However, if you need only textual information from the document, you can set this flag to True.
The enable_pagination boolean flag. This flag sets the document conversion mode: the float (default value is False) or paginal (True). When the float mode is selected, the document content is converted to a pageless (float) HTML document. When the paginal mode is selected, the pages of the document are preserved in the generated HTML document, like in a PDF viewer.
The pages property, which allows setting a page range that should be processed. By default all pages of the input document are processed.
If a default PdfEditOptions instance is acceptable for you, you may omit creating PdfEditOptions at all — just call the parameterless editor.edit() overload and GroupDocs.Editor will internally generate and apply the default PdfEditOptions for the input PDF document.
The runnable example below loads a PDF document, edits it with adjusted PdfEditOptions, and obtains the HTML content from the resultant EditableDocument.
importosfromgroupdocs.editorimportEditor,Licensefromgroupdocs.editor.optionsimportPdfLoadOptions,PdfEditOptionsdefedit_pdf():# Optionally set a licenselicense_path=os.path.abspath("./GroupDocs.Editor.lic")ifos.path.exists(license_path):License().set_license(license_path)# Prepare PDF load options (optional, speeds up format detection)load_options=PdfLoadOptions()# Load an input PDF document into the EditorwithEditor("./sample-document.pdf",load_options)aseditor:# Create and adjust the PDF edit optionsedit_options=PdfEditOptions()edit_options.enable_pagination=Trueedit_options.skip_images=False# Edit the PDF and obtain an EditableDocumenteditable=editor.edit(edit_options)# Obtain the HTML content (in practice it is sent to the WYSIWYG-editor)content=editable.get_content()print("Generated HTML content length:",len(content))editable.dispose()if__name__=="__main__":edit_pdf()
sample-document.pdf is the sample file used in this example. Click here to download it.
Like for other document formats, there is a special class responsible for saving PDF documents — the PdfSaveOptions class. It has the following properties:
password — allows you to protect the output PDF document with a specified password. By default it is None — password protection is not applied.
compliance — allows setting the PDF standards compliance level for the output PDF.
optimize_memory_usage — a boolean flag that modifies the generation of the output PDF document so that the process takes less memory at the cost of longer processing time. By default it has a False value.
font_embedding — responsible for embedding font resources into the resultant PDF document.
Unlike PdfLoadOptions and PdfEditOptions, which are optional, PdfSaveOptions is mandatory even if all its values are default. After editing, an EditableDocument is created from the modified content and is then passed, together with the save options, to the editor.save() method:
fromgroupdocs.editorimportEditor,EditableDocumentfromgroupdocs.editor.optionsimportPdfEditOptions,PdfSaveOptionswithEditor("./sample-document.pdf")aseditor:original=editor.edit(PdfEditOptions())# Send the content to the WYSIWYG-editor and obtain the edited content (omitted here)edited=EditableDocument.from_markup(original.get_embedded_html())save_options=PdfSaveOptions()save_options.password="some_password"save_options.optimize_memory_usage=Trueeditor.save(edited,"./edited-document.pdf",save_options)original.dispose()edited.dispose()
Different output formats
Keep in mind that when an input PDF was edited and you are going to save it, it is not necessary to save it exactly in the PDF format — you are free to choose any compatible format, like all WordProcessing formats, the text format, or eBook formats.
fromgroupdocs.editorimportEditor,EditableDocumentfromgroupdocs.editor.formatsimportWordProcessingFormatsfromgroupdocs.editor.optionsimportPdfSaveOptions,WordProcessingSaveOptions,TextSaveOptionswithEditor("./sample-document.pdf")aseditor:edited=editor.edit()# Save to PDFeditor.save(edited,"./edited-document.pdf",PdfSaveOptions())# Save to DOCXeditor.save(edited,"./edited-document.docx",WordProcessingSaveOptions(WordProcessingFormats.DOCX))# Save to TXTeditor.save(edited,"./edited-document.txt",TextSaveOptions())edited.dispose()
Obtaining PDF document info
The article Extracting document metainfo describes the get_document_info() method, which allows you to detect the document format and extract its metadata without editing it. This mechanism also works with PDF documents.
When get_document_info() is called for an Editor instance that is loaded with a PDF document, the method returns a metadata view corresponding to the FixedLayoutDocumentInfo type — a common type for all fixed-layout documents, PDF and XPS in particular. It exposes the format, page_count, size, and is_encrypted properties. If the input PDF is encoded, its correct password should be specified in the get_document_info() method.