Supported Document Formats

Supported File Formats

The following tables indicate the file formats from which GroupDocs.Parser for Java can extract data.

Word Processing

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
DOC
Microsoft Word Document
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
DOT
Microsoft Word Document Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
DOCX
Office Open XML Document
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
DOCM
Office Open XML Macro-Enabled Document
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
DOTX
Office Open XML Document Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
DOTM
Office Open XML Document Macro-Enabled Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
TXT
Plain text
(tick)
ODT
Open Document Text
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
OTT
Open Document Text Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
RTF
Rich Text Format
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)

PDF

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
PDF
Portable Document Format File
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)

Markup

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
XHTML
Extensible Hypertext Markup Language File
(tick)(tick)
MHTML
MIME HTML File
(tick)(tick)
MD
Markdown
(tick)(tick)
(Formatted Text is Not supported)
XML
XML File
(tick)

Ebook

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
CHM
Compiled HTML Help File
(tick)(tick)(tick)(tick)
EPUB
Digital E-Book File Format
(tick)(tick)(tick)(tick)
FB2
FictionBook 2.0 File
(tick)(tick)(tick)
MOBI
Mobipocket
(tick)
AZW3
Kindle Format 8
(tick)

Speadsheet

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
XLS
Microsoft Excel Spreadsheet
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
XLT
Microsoft Excel Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
XLSX
Office Open XML Spreadsheet
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
XLSM
Office Open XML Macro-Enabled Spreadsheet
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
XLSB
Office Open XML Binary Spreadsheet
(tick)(tick)(tick)(tick)(tick)
XLTX
Office Open XML Spreadsheet Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
XLTM
Office Open XML Macro-Enabled Spreadsheet Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
ODS
Open Document Spreadsheet
(tick)(tick)(tick)(tick)(tick)
OTS
Open Document Spreadsheet Template
(tick)(tick)(tick)(tick)(tick)
CSV
Comma Separated Values
(tick)
XLA
Excel Add-In File
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
XLAM
Excel Open XML Macro-Enabled Add-In
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
NUMBERS
Apple iWork Numbers
(tick)(tick)(tick)(tick)

Presentation

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
PPT
PowerPoint Presentation
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
PPS
PowerPoint Slideshow
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
POT
PowerPoint Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
PPTX
Office Open XML Presentation
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
PPTM
Office Open XML Macro-Enabled Presentation
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
POTX
Office Open XML Presentation Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
POTM
Office Open XML Macro-Enabled Presentation Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
PPSX
Office Open XML Presentation Slideshow
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
PPSM
Office Open XML Macro-Enabled Presentation Slideshow
(tick)(tick)(tick)(tick)(tick)(tick)(tick)(tick)
ODP
Open Document Presentation
(tick)(tick)(tick)(tick)(tick)(tick)(tick)
OTP
Open Document Presentation Template
(tick)(tick)(tick)(tick)(tick)(tick)(tick)

Email

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
PST
Outlook Personal Information Store File
(tick)
OST
Outlook Offline Data File
(tick)
EML
E-Mail Message
(tick)(tick)(tick)(tick)(tick)
EMLX
Apple Mail Message
(tick)(tick)(tick)(tick)(tick)
MSG
Outlook Mail Message
(tick)(tick)(tick)(tick)(tick)

Note

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
ONE
OneNote Document
(tick)

Archive

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
ZIP
Zipped File
(tick)(tick)
RAR
Rar File
(tick)(tick)
TAR
Tar File
(tick)(tick)
GZ
GZip file
(tick)(tick)
BZ2
BZip2 File
(tick)(tick)

Image

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
BMP
Bitmap Image file
(tick)
GIF
Graphical Interchange Format
(tick)
JP2
JPEG 2000
(tick)
JPG, JPEG
JPEG Image file
(tick)
PNG
Portable Network Graphics
(tick)
TIF, TIFF
Tagged Image File Format
(tick)
DICOM
DICOM (Digital Imaging and Communications in Medicine)
(tick)
DJVU
DjVu File Format
(tick)
EMF
Enhanced metafile
(tick)
J2K
JPEG 2000
(tick)
PS
PostScript File Format
(tick)
PSD
Photoshop Document
(tick)
SVG
Scalar Vector Graphics file
(tick)
SVGZ
Scalar Vector Graphics file (with gzip compression)
(tick)
WEBP
WebP Image File Format
(tick)
WMF
Microsoft Windows Metafile
(tick)

Database

Databases are supported via JDBC. To work with the corresponding database format install its database provider.

Document TypeParse Document by TemplateExtract Text (Accurate)Extract Text (Raw)Extract Structured Text and Formatted TextExtract Text AreasExtract MetadataExtract ImagesExtract Containers and AttachmentsParse Form DataExtract Table of ContentsScan Barcode
JDBC(tick)       (tick)