Quick Start Guide Leave feedback

Prerequisites

Before you begin, ensure you have:

Python 3.5 or higher installed
GroupDocs.Parser for Python via .NET installed (see Installation)

Extract Text from a Document

The most common task is extracting text from documents. Here’s how to do it:

Python

from groupdocs.parser import Parser

def extract_text_from_document():
    # Create an instance of Parser class
    with Parser("./sample.pdf") as parser:
        # Extract text from the document
        text_reader = parser.get_text()
        
        if text_reader is not None:
            # Print the extracted text
            print(text_reader)
        else:
            print("Text extraction isn't supported for this format")

if __name__ == "__main__":
    extract_text_from_document()

sample.pdf

The following sample file is used in this example: sample.pdf

Get Document Information

You can retrieve basic information about a document:

Python

from groupdocs.parser import Parser

def get_document_information():
    # Create an instance of Parser class
    with Parser("./sample.pdf") as parser:
        # Get document info
        info = parser.get_document_info()
        
        print(f"File type: {info.file_type.file_format}")
        print(f"Page count: {info.page_count}")
        print(f"Size: {info.size} bytes")

if __name__ == "__main__":
    get_document_information()

sample.pdf

The following sample file is used in this example: sample.pdf

Extract Metadata

Extract metadata properties from documents:

Python

from groupdocs.parser import Parser

def extract_metadata():
    # Create an instance of Parser class
    with Parser("./sample.pdf") as parser:
        # Extract metadata
        metadata = parser.get_metadata()
        
        if metadata is not None:
            for item in metadata:
                print(f"{item.name}: {item.value}")

if __name__ == "__main__":
    extract_metadata()

sample.pdf

The following sample file is used in this example: sample.pdf

Extract Images

Extract images from documents:

Python

from groupdocs.parser import Parser

def extract_images():
    # Create an instance of Parser class
    with Parser("./sample.pdf") as parser:
        # Extract images
        images = parser.get_images()
        
        if images is not None:
            for i, image in enumerate(images):
                # Save image to file
                with open(f"image_{i}.{image.file_type.extension}", "wb") as file:
                    file.write(image.get_image_stream().read())

if __name__ == "__main__":
    extract_images()

sample.pdf

The following sample file is used in this example: sample.pdf

Extract Text from Specific Page

Extract text from a particular page:

Python

from groupdocs.parser import Parser

def extract_text_from_specific_page():
    # Create an instance of Parser class
    with Parser("./sample.pdf") as parser:
        # Get document info to check page count
        info = parser.get_document_info()
        
        if info.page_count > 0:
            # Extract text from the first page (page index is 0-based)
            text_reader = parser.get_text(0)
            
            if text_reader is not None:
                print(text_reader)

if __name__ == "__main__":
    extract_text_from_specific_page()

sample.pdf

The following sample file is used in this example: sample.pdf

Check Format Support

Before processing a document, you can check if the format is supported:

Python

from groupdocs.parser import Parser

def check_format_support():
    # Check if file format is supported
    if Parser.get_file_info("./sample.pdf").file_type.file_format != "Unknown":
        print("Format is supported")
        
        # Process the document
        with Parser("./sample.pdf") as parser:
            text_reader = parser.get_text()
            if text_reader is not None:
                print(text_reader)
    else:
        print("Format is not supported")

if __name__ == "__main__":
    check_format_support()

sample.pdf

The following sample file is used in this example: sample.pdf

Next Steps

Now that you’ve learned the basics, explore more advanced features:

Extract text from documents - Learn different text extraction techniques
Working with images - Advanced image extraction
Working with tables - Extract tables from documents
Template-based parsing - Parse structured data using templates

Additional Resources

We value your opinion. Your feedback will help us improve our documentation.

Quick Start Guide Leave feedback

On this page

Prerequisites

Extract Text from a Document

Get Document Information

Extract Metadata

Extract Images

Extract Text from Specific Page

Check Format Support

Next Steps

Additional Resources

Was this page helpful?

Any additional feedback you'd like to share with us?

Please tell us how we can improve this page.

Thank you for your feedback!

On this page