Each supported mode is marked by its own circle:
RAW text extractor
Formatted text extractor
Structured text extractor
Supported as container
|Word XML (.xml)|
|Microsoft Exchange Server|
|Electronic Publication Format|
|Portable Document Format|
|Compression and packaging formats|
The opening of password-protected documents is supported for the following document formats:
The following text and presentation templates are also supported for text extraction:
Following is the list of supported formats for metadata extraction along with their metadata properties that can be extracted using GroupDocs.Parser.
|Metadata Property Name||.docx||.doc||.dot||.odt||.xlsx||.xls||.ods||.pptx||.ppt||.odp||.msg||.eml||.emlx||.epub||.fb2|
The following text and presentation templates are also supported for metadata extraction:
1. Searching for the space character (U+0020) is a method of determining the UTF-16.
2. Searching for the lack of 0x00 bytes is a method of determining the ANSI.
At this moment the following formatting is supported:
At this moment the following HTML tags are supported:
Paragraph is surrounded by <p> tag
Text with Bold font is surrounded by <b> tag
Text with Italic font is surrounded by <i> tag
<h1> – <h6>
If the heading has 'Heading X' style, it's surrounded by <hx> tag
Numbering and bullets lists
The regular document is not only a text. Usually, the text is organized in paragraphs divided into parts with headers. Also, the text can contain hyperlinks, lists, tables. The structured text extraction provides the ability to extract a text with its structure.
Any extractor which implements IStructuredExtractor interface has the ability to extract a text with its structure. At this time IStructuredExtractor interface is implemented by:
This feature allows extracting images from documents. Currently, the following file formats are supported by this feature:
This feature. At the moment text analysis API is supported by: