In some cases, it’s required to specify the document format manually to guarantee correct output produced by GroupDocs.Parser. The following are the cases when the document format must be specified manually:
Here are the steps to specify the document format:
Instantiate the LoadOptions object and pass the document format in the constructor
Create Parser object with the file path and LoadOptions
Call extraction methods as usual
Load Markdown document
The following example shows how to specify the document format for Markdown document:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportLoadOptions,FileFormat# Create LoadOptions for Markdown formatload_options=LoadOptions(FileFormat.MARKUP)# Open the file streamwithopen("sample.md","rb")asstream:# Create an instance of Parser class with LoadOptionswithParser(stream,load_options)asparser:# Extract text from the Markdown documenttext_reader=parser.get_text()iftext_readerisnotNone:# Print the extracted text# Markdown is detected; text without special symbols is printedprint(text_reader)else:print("Text extraction isn't supported")
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportLoadOptions,FileFormat# Create LoadOptions for OTP (OpenDocument Presentation Template) formatload_options=LoadOptions(FileFormat.OTP)# Load OTP documentwithParser("template.otp",load_options)asparser:# Get document infodoc_info=parser.get_document_info()print(f"File type: {doc_info.file_type.file_format}")print(f"Pages: {doc_info.page_count}")# Extract texttext_reader=parser.get_text()iftext_reader:print(text_reader)
Detect format and load appropriately
You can create a helper function to detect and load documents with appropriate format:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportLoadOptions,FileFormatimportosdefget_file_format(file_path):"""Detect file format based on extension"""_,ext=os.path.splitext(file_path)ext=ext.lower()format_map={'.md':FileFormat.MARKUP,'.markdown':FileFormat.MARKUP,'.mhtml':FileFormat.MHTML,'.mht':FileFormat.MHTML,'.otp':FileFormat.OTP,}returnformat_map.get(ext)defload_document_with_format(file_path):"""Load document with appropriate format specification"""file_format=get_file_format(file_path)iffile_format:print(f"Loading with explicit format: {file_format}")load_options=LoadOptions(file_format)parser=Parser(file_path,load_options)else:print("Loading with auto-detection")parser=Parser(file_path)returnparser# Usage exampleswithload_document_with_format("webpage.mhtml")asparser:text_reader=parser.get_text()iftext_reader:print(text_reader)
Available file formats
The FileFormat enumeration includes the following formats:
fromgroupdocs.parser.optionsimportFileFormat# Common document formats (auto-detected)# - PDF# - DOC, DOCX# - XLS, XLSX# - PPT, PPTX# - TXT# - etc.# Formats that may require explicit specification:formats_needing_specification={"Markdown":FileFormat.MARKUP,"MHTML":FileFormat.MHTML,"OTP":FileFormat.OTP,}forname,format_valueinformats_needing_specification.items():print(f"{name}: {format_value}")
Error handling when loading specific formats
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportLoadOptionsdefsafe_load_with_format(file_path,file_format):"""Safely load a document with specific format and error handling"""try:# Create LoadOptionsload_options=LoadOptions(file_format)# Create Parser instancewithParser(file_path,load_options)asparser:# Verify loading was successfuldoc_info=parser.get_document_info()print(f"Document loaded: {doc_info.file_type.file_format}")# Extract texttext_reader=parser.get_text()iftext_reader:returntext_readerelse:print("Text extraction not supported")returnNoneexceptExceptionase:print(f"Error loading document with format {file_format}: {e}")# Try loading without format specificationtry:print("Attempting to load with auto-detection...")withParser(file_path)asparser:text_reader=parser.get_text()iftext_reader:returntext_readerexceptExceptionase2:print(f"Auto-detection also failed: {e2}")returnNone# Example usagefromgroupdocs.parser.optionsimportFileFormattext=safe_load_with_format("document.md",FileFormat.MARKUP)
When to use format specification
Use explicit format specification when:
Working with Markdown files - Always specify FileFormat.MARKUP for .md files
Processing MHTML web archives - Specify FileFormat.MHTML for .mhtml or .mht files
Loading OTP templates - Specify FileFormat.OTP for OpenDocument presentation templates
Connecting to databases - Specify appropriate database format
Fetching emails from remote servers - Specify email format when loading from network
In most other cases, GroupDocs.Parser can automatically detect the document format based on file extension and content.
More resources
GitHub examples
You may find more code examples in our GitHub repository:
Along with the full-featured library, we provide a free online document parser app. You are welcome to extract data from PDF, DOCX, XLSX, and more with our Free Online Document Parser App.
Was this page helpful?
Any additional feedback you'd like to share with us?
Please tell us how we can improve this page.
Thank you for your feedback!
We value your opinion. Your feedback will help us improve our documentation.