With GroupDocs.Parser you can retrieve the following information about a document:
file_type represents the document file type (PDF, Word document, Excel spreadsheet, PowerPoint presentation, image etc.)
page_count represents the number of pages in a document
size represents the document file size in bytes
The following code samples show how to get document information:
Get document info from a local file
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.docx")asparser:# Get the document infodoc_info=parser.get_document_info()# Print document informationprint(f"File type: {doc_info.file_type.file_format}")print(f"Page count: {doc_info.page_count}")print(f"File size: {doc_info.size} bytes")# Print file extensionprint(f"File extension: {doc_info.file_type.extension}")
The following sample file is used in this example: sample.docx
Get document info from a stream
fromgroupdocs.parserimportParser# Open the file streamwithopen("sample.pdf","rb")asstream:# Create an instance of Parser class with the streamwithParser(stream)asparser:# Get the document infodoc_info=parser.get_document_info()# Print document informationprint(f"File type: {doc_info.file_type.file_format}")print(f"Page count: {doc_info.page_count}")print(f"File size: {doc_info.size} bytes")
The following sample file is used in this example: sample.pdf
Check document properties before extraction
It’s useful to check document properties before performing extraction operations:
fromgroupdocs.parserimportParserdefprocess_document(file_path):withParser(file_path)asparser:# Get document informationdoc_info=parser.get_document_info()print(f"Processing: {file_path}")print(f"Type: {doc_info.file_type.file_format}")print(f"Pages: {doc_info.page_count}")print(f"Size: {doc_info.size/1024:.2f} KB")# Process based on page countifdoc_info.page_count>0:print("Document has pages, proceeding with text extraction...")text_reader=parser.get_text()iftext_reader:print(text_reader)else:print("Document has no pages or page count is not available")# Process different document typesprocess_document("sample.pdf")
The following sample file is used in this example: sample.pdf
Get page-specific information
For multi-page documents, you can also get information about individual pages:
fromgroupdocs.parserimportParserwithParser("./sample.docx")asparser:# Get document infodoc_info=parser.get_document_info()print(f"Total pages: {doc_info.page_count}")# Iterate through pagesforpage_indexinrange(doc_info.page_count):# Extract text from each pageprint(f"---Page{page_index+1}---")text_reader=parser.get_text(page_index)iftext_reader:page_text=text_readerprint(f"Characters: {len(page_text)}")
The following sample file is used in this example: sample.docx
Working with unsupported formats
If a document format doesn’t support certain features, the API returns appropriate values:
fromgroupdocs.parserimportParsertry:withParser("./unknown.format")asparser:doc_info=parser.get_document_info()ifdoc_info:print(f"File type: {doc_info.file_type.file_format}")# Some formats may not have page countifdoc_info.page_count==0:print("Page count is not available for this format")else:print("Could not retrieve document information")exceptExceptionase:print(f"Error: {e}")
More resources
Advanced usage topics
To learn more about document data extraction features and how to extract text, images, metadata, and more, please refer to the advanced usage section.
GitHub examples
You may find more code examples in our GitHub repository:
Along with the full-featured library, we provide a free online document parser app. You are welcome to extract data from PDF, DOCX, XLSX, and more with our Free Online Document Parser App.
Was this page helpful?
Any additional feedback you'd like to share with us?
Please tell us how we can improve this page.
Thank you for your feedback!
We value your opinion. Your feedback will help us improve our documentation.