Extract text from documents Leave feedback

Extract text from documents

To extract text from documents, simply call the get_text method:

Python

from groupdocs.parser import Parser

with Parser("./sample.docx") as parser:
  text_reader = parser.get_text()

The method returns a text reader object with the extracted text.

Here are the steps to extract text from a document:

Instantiate the Parser object with the source document path
Call the get_text method and obtain the text reader object
Read the text from the reader

The following code snippet shows how to extract text from a document:

Extract text from local files

Python

from groupdocs.parser import Parser

# Create an instance of Parser class
with Parser("./sample.docx") as parser:
    # Extract text into the reader
    text_reader = parser.get_text()
    
    if text_reader is not None:
        # Print the extracted text
        extracted_text = text_reader
        print(extracted_text)
    else:
        print("Text extraction isn't supported for this format")

sample.docx

The following sample file is used in this example: sample.docx

Extract text from a stream

Python

from groupdocs.parser import Parser

# Open the file stream
with open("sample.pdf", "rb") as stream:
    # Create an instance of Parser class with the stream
    with Parser(stream) as parser:
        # Extract text into the reader
        text_reader = parser.get_text()
        
        if text_reader is not None:
            # Print the extracted text
            extracted_text = text_reader
            print(extracted_text)
        else:
            print("Text extraction isn't supported for this format")

sample.pdf

The following sample file is used in this example: sample.pdf

Extract text from specific pages

You can also extract text from specific pages in a document:

Python

from groupdocs.parser import Parser

with Parser("./sample.pdf") as parser:
    # Get document info to check page count
    doc_info = parser.get_document_info()
    
    # Check if the document has pages
    if doc_info.page_count > 0:
        # Extract text from the first page (page index is 0-based)
        text_reader = parser.get_text(0)
        
        if text_reader:
            print(f"Text from page 1:")
            print(text_reader)

sample.pdf

The following sample file is used in this example: sample.pdf

More resources

Advanced usage topics

To learn more about text extraction features, including formatted text, text structure, and search functionality, please refer to the advanced usage section.

GitHub examples

You may find more code examples in our GitHub repository:

GroupDocs.Parser for Python via .NET examples

Free online text extractor

Along with the full-featured library, we provide a free online text extractor app. You are welcome to extract text from your documents with our Free Online Document Parser App.

We value your opinion. Your feedback will help us improve our documentation.

Extract text from documents Leave feedback

On this page