Skip to end of metadata
Go to start of metadata
Contents Summary
 

Features Overview

Text Extractors

GroupDocs.Parser for Java allows users to extract text from a multitude of documents. API comes with numerous classes that aid in extracting text from the corresponding document. For instance, in order to extract text from email messages, EmailTextExtractor and EmailFormattedTextExtractor classes are used.

Metadata Extractors

API simplifies metadata extraction from different documents. As we discussed earlier that different classes are created for text extraction in the same way various classes are created for metadata extraction from various document formats.

Extract Text from Containers

The Container provides the functionality to work with files that contain other files, for example, ZIP archives. GroupDocs.Parser also works with the containers. API also permits the user to extract messages from the containers, e.g. ost-container.

Extract Formatted Text

GroupDocs.Parser provides the functionality to extract formatted text from the supported document formats. At this moment the following formats are supported:

  • Plain text
  • Markdown
  • HTML

Markdown

At this moment the following formattings are supported:

  • Bold text
  • Italic text
  • Hyperlinks
  • Headings
  • Numbering and bullets lists
  • Tables

HTML

At this moment the following HTML tags are supported:

<p>Paragraph is surrounded by <p> tag
<a>Hyperlinks
<b>Text with Bold font is surrounded by <b> tag
<i>Text with Italic font is surrounded by <i> tag
<h1> – <h6>If the heading has 'Heading X' style, it's surrounded by <hx> tag
<ol>/<ul>Numbering and bullets lists
<table>Tables

Extract Structured Text

The regular document does not contain only a text. Usually, the text could be organized into paragraphs divided into parts with headers. Also, the text can contain hyperlinks, lists, tables. For this scenario, GroupDocs.Parser provides structured text extraction with the ability to extract a text with its structure.

The extractors with the ability to extract a text with its structure implement IStructuredExtractor interface. At this time IStructuredExtractor interface is implemented by:

  • CellsTextExtractor
  • WordsTextExtractor
  • SlidesTextExtractor
  • EmailTextExtractor
  • EpubTextExtractor
  • FictionBookTextExtractor

Extract Highlighted Text

GroupDocs.Parser for Java allows its users to extract highlights from documents.

Extract Images

GroupDocs.Parser also allows extracting images from the popular document formats.

Support of Password Protected Documents

GroupDocs.Parser also supports the opening of password-protected documents for the following document formats:

  • Spreadsheets
  • Presentations
  • Text documents
  • PDFs
  • OneNote sections

Convenient Tools

Encoding Detectors

For encoding detection, the API provides the functionality to detect the following encoding:

  • UTF32 LE
  • UTF32 BE
  • UTF16 LE
  • UTF16 BE
  • UTF8
  • UTF7
  • ANSI

Encoding can be detected by BOM or by the content of the file (if BOM is not presented). The constructor accepts default encoding for ANSI.

Media Type Detectors

Each media type detector class allows to detect media type of the corresponding document and each class is inherited from MediaTypeDetector abstract class.

Search Text in Documents

The API allows searching for some text in supported document formats. The API also allows using all highlight extraction modes with search functionality.

Metered Licensing

The GroupDocs.Parser also supports metered licensing.

Labels
  • No labels