Supported File Formats

Supported Formats for Text Extraction

Each supported mode is marked by its own circle:

RAW text extractor

Formatted text extractor

Structured text extractor

Metadata extractor

Supported as container

FormatGroupDocs.Parser
Text Documents 
.doc
.dot   
.docx   
.docm   
Word XML (.xml)   
.rtf  
.odt   
.txt 
.md  
Spreadsheets 
.xls   
.xlsx   
.csv  
.xlsm   
.xlsb   
.ods   
.tsv  
SpreadsheetML (.xml)   
Presentations 
.ppt   
.pptx   
.pptm   
.pps   
.ppsx   
.ppsm   
.odp   
OneNote notebooks 
.one
Email 
.msg   
.eml   
.emlx   
TNEF (winmail.dat) 
.pst
.ost
Microsoft Exchange Server
POP
IMAP
Electronic Publication Format 
.epub   
.fb2 (FictionBook)   
Portable Document Format 
.pdf
pdf portfolio 
encrypted pdf 
DOM-based documents 
.xml
.html
.xhtml
.mhtml
Compression and packaging formats 
.zip
.chm
Databases 
ADO.NET


The opening of password-protected documents is supported for the following document formats:

The following text and presentation templates are also supported for text extraction:

Metadata Extraction

Following is the list of supported formats for metadata extraction along with their metadata properties that can be extracted using GroupDocs.Parser. 

Metadata Property Name.docx.doc.dot.odt.xlsx.xls.ods.pptx.ppt.odp.pdf.msg.eml.emlx.epub.fb2
Application      
ApplicationVersion        
Template            
Title   
Subject
Comments      
Keywords    
ContentStatus           
Category         
Manager         
Author   
LastAuthor      
Company         
HyperlinkBase          
CreatedTime     
LastSavedTime     
LastPrintedTime        
RevisionNumber         
TotalEditingTime            
EmailFrom             
EmailTo             
EmailCC             
Description              
Language              
Copyrights               
Publisher              
PublishedDate              

The following text and presentation templates are also supported for metadata extraction:

Encoding Detection

Supported

Not supported

Encoding

BOM

Content

UTF32 LE

UTF32 BE

UTF16 LE

 1

UTF16 BE

 1

UTF8

UTF7

ANSI

 2

 1. Searching for the space character (U+0020) is a method of determining the UTF-16.
 2. Searching for the lack of 0x00 bytes is a method of determining the ANSI.

Text Formatters

Element

Plain

Markdown

Html

Text

List

Table

Hyperlink

Nested table

Markdown

At this moment the following formatting is supported:

HTML

At this moment the following HTML tags are supported:

<p>

Paragraph is surrounded by <p> tag

<a>

Hyperlinks

<b>

Text with Bold font is surrounded by <b> tag

<i>

Text with Italic font is surrounded by <i> tag

<h1> – <h6>

If the heading has 'Heading X' style, it's surrounded by <hx> tag

<ol>/<ul>

Numbering and bullets lists

<table>

Tables

Structured Text Extraction

The regular document is not only a text. Usually, the text is organized in paragraphs divided into parts with headers. Also, the text can contain hyperlinks, lists, tables. The structured text extraction provides the ability to extract a text with its structure.

Any extractor which implements IStructuredExtractor interface has the ability to extract a text with its structure. At this time IStructuredExtractor interface is implemented by:

Image Extraction 

This feature allows extracting images from documents. Currently, the following file formats are supported by this feature:

Text Analysis API

This feature allows extracting text areas and images from document pages. At the moment text analysis API is supported by:

https://wiki.fileformat.com/Presentation/PPTM/