Get document info

With GroupDocs.Parser you can retrieve the following information about a document:

  • file_type represents the document file type (PDF, Word document, Excel spreadsheet, PowerPoint presentation, image etc.)
  • page_count represents the number of pages in a document
  • size represents the document file size in bytes

The following code samples show how to get document information:

Get document info from a local file

from groupdocs.parser import Parser

# Create an instance of Parser class
with Parser("./sample.docx") as parser:
    # Get the document info
    doc_info = parser.get_document_info()
    
    # Print document information
    print(f"File type: {doc_info.file_type.file_format}")
    print(f"Page count: {doc_info.page_count}")
    print(f"File size: {doc_info.size} bytes")
    
    # Print file extension
    print(f"File extension: {doc_info.file_type.extension}")

The following sample file is used in this example: sample.docx

Get document info from a stream

from groupdocs.parser import Parser

# Open the file stream
with open("sample.pdf", "rb") as stream:
    # Create an instance of Parser class with the stream
    with Parser(stream) as parser:
        # Get the document info
        doc_info = parser.get_document_info()
        
        # Print document information
        print(f"File type: {doc_info.file_type.file_format}")
        print(f"Page count: {doc_info.page_count}")
        print(f"File size: {doc_info.size} bytes")

The following sample file is used in this example: sample.pdf

Check document properties before extraction

It’s useful to check document properties before performing extraction operations:

from groupdocs.parser import Parser

def process_document(file_path):
    with Parser(file_path) as parser:
        # Get document information
        doc_info = parser.get_document_info()
        
        print(f"Processing: {file_path}")
        print(f"Type: {doc_info.file_type.file_format}")
        print(f"Pages: {doc_info.page_count}")
        print(f"Size: {doc_info.size / 1024:.2f} KB")
        
        # Process based on page count
        if doc_info.page_count > 0:
            print("Document has pages, proceeding with text extraction...")
            text_reader = parser.get_text()
            if text_reader:
                print(text_reader)
        else:
            print("Document has no pages or page count is not available")

# Process different document types
process_document("sample.pdf")

The following sample file is used in this example: sample.pdf

Get page-specific information

For multi-page documents, you can also get information about individual pages:

from groupdocs.parser import Parser

with Parser("./sample.docx") as parser:
    # Get document info
    doc_info = parser.get_document_info()
    
    print(f"Total pages: {doc_info.page_count}")
    
    # Iterate through pages
    for page_index in range(doc_info.page_count):
        # Extract text from each page
        print(f"
--- Page {page_index + 1} ---")
        text_reader = parser.get_text(page_index)
        if text_reader:
            page_text = text_reader
            print(f"Characters: {len(page_text)}")

The following sample file is used in this example: sample.docx

Working with unsupported formats

If a document format doesn’t support certain features, the API returns appropriate values:

from groupdocs.parser import Parser

try:
    with Parser("./unknown.format") as parser:
        doc_info = parser.get_document_info()
        
        if doc_info:
            print(f"File type: {doc_info.file_type.file_format}")
            
            # Some formats may not have page count
            if doc_info.page_count == 0:
                print("Page count is not available for this format")
        else:
            print("Could not retrieve document information")
            
except Exception as e:
    print(f"Error: {e}")

More resources

Advanced usage topics

To learn more about document data extraction features and how to extract text, images, metadata, and more, please refer to the advanced usage section.

GitHub examples

You may find more code examples in our GitHub repository:

Free online document parser

Along with the full-featured library, we provide a free online document parser app. You are welcome to extract data from PDF, DOCX, XLSX, and more with our Free Online Document Parser App.