GroupDocs.Parser for Java allows users to extract text from a multitude of documents. API comes with numerous classes that aid in extracting text from the corresponding document. For instance, in order to extract text from email messages, EmailTextExtractor and EmailFormattedTextExtractor classes are used.
API simplifies metadata extraction from different documents. As we discussed earlier that different classes are created for text extraction in the same way various classes are created for metadata extraction from various document formats.
Extract Text from Containers
The Container provides the functionality to work with files that contain other files, for example, ZIP archives. GroupDocs.Parser also works with the containers. API also permits the user to extract messages from the containers, e.g. ost-container.
Extract Formatted Text
GroupDocs.Parser provides the functionality to extract formatted text from the supported document formats. At this moment the following formats are supported:
- Plain text
At this moment the following formattings are supported:
- Bold text
- Italic text
- Numbering and bullets lists
At this moment the following HTML tags are supported:
|<p>||Paragraph is surrounded by <p> tag|
|<b>||Text with Bold font is surrounded by <b> tag|
|<i>||Text with Italic font is surrounded by <i> tag|
|<h1> – <h6>||If the heading has 'Heading X' style, it's surrounded by <hx> tag|
|<ol>/<ul>||Numbering and bullets lists|
Extract Structured Text
The regular document does not contain only a text. Usually, the text could be organized into paragraphs divided into parts with headers. Also, the text can contain hyperlinks, lists, tables. For this scenario, GroupDocs.Parser provides structured text extraction with the ability to extract a text with its structure.
The extractors with the ability to extract a text with its structure implement IStructuredExtractor interface. At this time IStructuredExtractor interface is implemented by:
Extract Highlighted Text
GroupDocs.Parser for Java allows its users to extract highlights from documents.
GroupDocs.Parser also allows extracting images from the popular document formats.
Support of Password Protected Documents
GroupDocs.Parser also supports the opening of password-protected documents for the following document formats:
- Text documents
- OneNote sections
For encoding detection, the API provides the functionality to detect the following encoding:
- UTF32 LE
- UTF32 BE
- UTF16 LE
- UTF16 BE
Encoding can be detected by BOM or by the content of the file (if BOM is not presented). The constructor accepts default encoding for ANSI.
Media Type Detectors
Each media type detector class allows to detect media type of the corresponding document and each class is inherited from MediaTypeDetector abstract class.
Search Text in Documents
The API allows searching for some text in supported document formats. The API also allows using all highlight extraction modes with search functionality.
The GroupDocs.Parser also supports metered licensing.