Extract text from documents

Text extraction is one of the most fundamental features of GroupDocs.Parser. The API allows you to extract text from:

  • PDF documents
  • Microsoft Office documents (Word, Excel, PowerPoint)
  • Email messages
  • Images (with OCR)
  • eBooks (EPUB, FB2, CHM)
  • And 50+ other formats

GroupDocs.Parser’s text extractor is easy to use and powerful at the same time. For advanced text extraction scenarios, refer to the advanced usage section.

Extract text from documents

To extract text from documents, simply call the get_text method:

from groupdocs.parser import Parser

with Parser("./sample.docx") as parser:
  text_reader = parser.get_text()

The method returns a text reader object with the extracted text.

Here are the steps to extract text from a document:

  1. Instantiate the Parser object with the source document path
  2. Call the get_text method and obtain the text reader object
  3. Read the text from the reader

The following code snippet shows how to extract text from a document:

Extract text from local files

from groupdocs.parser import Parser

# Create an instance of Parser class
with Parser("./sample.docx") as parser:
    # Extract text into the reader
    text_reader = parser.get_text()
    
    if text_reader is not None:
        # Print the extracted text
        extracted_text = text_reader
        print(extracted_text)
    else:
        print("Text extraction isn't supported for this format")

The following sample file is used in this example: sample.docx

Extract text from a stream

from groupdocs.parser import Parser

# Open the file stream
with open("sample.pdf", "rb") as stream:
    # Create an instance of Parser class with the stream
    with Parser(stream) as parser:
        # Extract text into the reader
        text_reader = parser.get_text()
        
        if text_reader is not None:
            # Print the extracted text
            extracted_text = text_reader
            print(extracted_text)
        else:
            print("Text extraction isn't supported for this format")

The following sample file is used in this example: sample.pdf

Extract text from specific pages

You can also extract text from specific pages in a document:

from groupdocs.parser import Parser

with Parser("./sample.pdf") as parser:
    # Get document info to check page count
    doc_info = parser.get_document_info()
    
    # Check if the document has pages
    if doc_info.page_count > 0:
        # Extract text from the first page (page index is 0-based)
        text_reader = parser.get_text(0)
        
        if text_reader:
            print(f"Text from page 1:")
            print(text_reader)

The following sample file is used in this example: sample.pdf

More resources

Advanced usage topics

To learn more about text extraction features, including formatted text, text structure, and search functionality, please refer to the advanced usage section.

GitHub examples

You may find more code examples in our GitHub repository:

Free online text extractor

Along with the full-featured library, we provide a free online text extractor app. You are welcome to extract text from your documents with our Free Online Document Parser App.