Features Overview Leave feedback

Parse Data from Documents

GroupDocs.Parser allows to parse documents by user-defined templates.

It is easy to create a template with data field definitions and table definitions. Then it’s easy to use the template (just pass the Template object to parse_by_template method) and extract data such as prices, invoices, tables from your typical documents.

Extract Text

GroupDocs.Parser provides several text extraction methods that cover various text retrieval scenarios.

Extract plain text from any of the supported documents
Extract HTML or Markdown formatted text for a fast preview
Extract structured text
Extract text areas with coordinates, text style and other info
Search text by a keyword or regular expression; get text around the found word

Below different text extraction aspects are described:

Accurate Text Extraction Mode

One of the most demanded features is accurate text extraction. GroupDocs.Parser allows to easily implement it using simple get_text method.

Raw Text Extraction Mode

GroupDocs.Parser provides a way to increase text extraction performance with Raw text extraction mode for some formats. The text doesn’t look so accurate, but performance is higher.

This feature is useful in those text extraction scenarios when text quality may not be the best, but performance is critical.

Extract Formatted Text

In addition to standard text extraction modes, GroupDocs.Parser API provides a method get_formatted_text to extract formatted text for those cases when simple plain text is not enough and you may need to keep formatting like text style, table layout etc.

At this moment the following formats are supported:

Plain Text
Markdown
HTML

Plain Text

With Plain Text mode GroupDocs.Parser performs formatting in plain text making extracted text look closer to original. This is achieved due to special text positioning, box-drawing characters etc.

Markdown

This mode is useful when you need to export the extracted text to any system that supports Markdown-formatted text.

At this moment the following formatting are supported:

Bold text
Italic text
Hyperlinks
Headings
Numbering and bullets lists
Tables

HTML

GroupDocs.Parser also supports HTML formatting.

Following HTML tags are now supported when extracting text with this formatting mode:

Tag	Description
`<p>`	Paragraph is surrounded by `<p>` tag
`<a>`	Hyperlinks
`<b>`	Text with Bold font is surrounded by `<b>` tag
`<i>`	Text with Italic font is surrounded by `<i>` tag
`<h1>` – `<h6>`	If the heading has ‘Heading X’ style, it’s surrounded by `<hX>` tag
`<ol>`/`<ul>`	Numbering and bullets lists
`<table>`	Tables

Extract Structured Text

Many document formats do not contain only text. Usually, the text is organized into paragraphs divided into parts with headers. Also, the text can contain hyperlinks, lists, tables. For this scenario, GroupDocs.Parser provides structured text extraction with the ability to extract text with its structure. This feature is easy to use - you simply call get_structure method that returns XML with structured text.

Extract Text Areas

GroupDocs.Parser provides API that allows to extract text areas with coordinates and text style.

This feature allows to implement advanced scenarios related to text analytics in your applications. Just call get_text_areas method and you will get all text area objects.

Search Text in Documents

GroupDocs.Parser allows to perform search over loaded document using keywords or regular expressions. Use the search method and then loop through the collection of search results.

Extract Metadata

GroupDocs.Parser provides API that allows to extract metadata from supported document formats with simple get_metadata method call.

Extract Images

GroupDocs.Parser supports image extraction from documents. You may call get_images method that returns all info about document images and allows to save them.

Extract Tables

GroupDocs.Parser allows to extract tables from documents preserving their structure. You can use get_tables method to extract tables from the entire document or specific pages.

Extract Data from Attachments and ZIP Archives

GroupDocs.Parser allows to extract data (text, images, other supported extraction methods) from formats that contain other documents like ZIP archives, PDF portfolios, emails, OST containers.

You can simply call get_container method and work with extracted attached or archived documents as with usual document files.

Extract Document Information

GroupDocs.Parser provides the ability to get basic document information such as file type, page count, and size. This can be done using the get_document_info method.

Extract Table of Contents

GroupDocs.Parser allows to extract table of contents from some document formats. To do it, you may call get_toc method.

We value your opinion. Your feedback will help us improve our documentation.