Skip to end of metadata
Go to start of metadata
Contents Summary
 

Supported File Formats

Supported Formats for Text Extraction

Each supported mode is marked by its own circle:

RAW text extractor

Formatted text extractor

Structured text extractor

Metadata extractor

Supported as container

FormatGroupDocs.Parser
Text Documents 
.doc
.dot   
.docx   
.docm   
Word XML (.xml)   
.rtf  
.odt   
.txt 
.md  
Spreadsheets 
.xls   
.xlsx   
.csv  
.xlsm   
.xlsb   
.ods   
Tab Separated Values  
SpreadsheetML (.xml)   
Presentations 
.ppt   
.pptx   
.pptm   
.pps   
.ppsx   
.ppsm   
.odp   
OneNote notebooks 
.one
Email 
.msg   
.eml   
.emlx   
TNEF (winmail.dat) 
.pst
.ost
Microsoft Exchange Server
POP
IMAP
Electronic Publication Format 
.epub   
.fb2 (FictionBook)   
Portable Document Format 
.pdf
pdf portfolio 
encrypted pdf 
DOM-based documents 
.xml
.html
.xhtml
.mhtml
Compression and packaging formats 
.zip
.chm
Databases 
ADO.NET


The opening of password-protected documents is supported for the following document formats:

  • Spreadsheets
  • Presentations
  • Text documents
  • PDFs
  • OneNote sections
  • ZIP archives

The following text and presentation templates are also supported for text extraction:

  •     dotx (Template)
  •     dotm (Macro-enabled template)
  •     ott (OpenDocument Text Template)
  •     potx (Template)
  •     potm (Macro-enabled template)
  •     ppsm (Macro-enabled slideshow)
  •     pptm (Macro-enabled presentation)

Metadata Extraction

Following is the list of supported formats for metadata extraction along with their metadata properties that can be extracted using GroupDocs.Parser. 

Metadata Property Name.docx.doc.dot.odt.xlsx.xls.ods.pptx.ppt.odp.pdf.msg.eml.emlx.epub.fb2
Application      
ApplicationVersion        
Template            
Title   
Subject
Comments      
Keywords    
ContentStatus           
Category         
Manager         
Author   
LastAuthor      
Company         
HyperlinkBase          
CreatedTime     
LastSavedTime     
LastPrintedTime        
RevisionNumber         
TotalEditingTime            
EmailFrom             
EmailTo             
EmailCC             
Description              
Language              
Copyrights               
Publisher              
PublishedDate              

The following text and presentation templates are also supported for metadata extraction:

  •     dotx (Template)
  •     dotm (Macro-enabled template)
  •     ott (OpenDocument Text Template)
  •     potx (Template)
  •     potm (Macro-enabled template)
  •     ppsm (Macro-enabled slideshow)
  •     pptm (Macro-enabled presentation)

Encoding Detection

Supported

Not supported

Encoding

BOM

Content

UTF32 LE

UTF32 BE

UTF16 LE

 1

UTF16 BE

 1

UTF8

UTF7

ANSI

 2

 1. Searching for the space character (U+0020) is a method of determining the UTF-16.
 2. Searching for the lack of 0x00 bytes is a method of determining the ANSI.

Text Formatters

Element

Plain

Markdown

Html

Text

List

Table

Hyperlink

Nested table

Markdown

At this moment the following formatting is supported:

  • Bold text
  • Italic text
  • Hyperlinks
  • Headings
  • Numbering and bullets lists
  • Tables

HTML

At this moment the following HTML tags are supported:

<p>

Paragraph is surrounded by <p> tag

<a>

Hyperlinks

<b>

Text with Bold font is surrounded by <b> tag

<i>

Text with Italic font is surrounded by <i> tag

<h1> – <h6>

If the heading has 'Heading X' style, it's surrounded by <hx> tag

<ol>/<ul>

Numbering and bullets lists

<table>

Tables

Structured Text Extraction

The regular document is not only a text. Usually, the text is organized in paragraphs divided into parts with headers. Also, the text can contain hyperlinks, lists, tables. The structured text extraction provides the ability to extract a text with its structure.

Any extractor which implements IStructuredExtractor interface has the ability to extract a text with its structure. At this time IStructuredExtractor interface is implemented by:

  • CellsTextExtractor
  • WordsTextExtractor
  • SlidesTextExtractor
  • EmailTextExtractor
  • EpubTextExtractor
  • FictionBookTextExtractor
  • MarkdownTextExtractor
  • ChmTextExtractor

Text Analysis API

This feature allows extracting text areas and images from document pages. At the moment text analysis API is supported by:

  • PdfTextExtractor
  • CellsTextExtractor
  • SlidesTextExtractor
  • WordsTextExtractor
Labels
  • No labels