Text extraction is one of the most fundamental features of GroupDocs.Parser. The API allows you to extract text from:
PDF documents
Microsoft Office documents (Word, Excel, PowerPoint)
Email messages
Images (with OCR)
eBooks (EPUB, FB2, CHM)
And 50+ other formats
GroupDocs.Parser’s text extractor is easy to use and powerful at the same time. For advanced text extraction scenarios, refer to the advanced usage section.
Extract text from documents
To extract text from documents, simply call the get_text method:
The method returns a text reader object with the extracted text.
Here are the steps to extract text from a document:
Instantiate the Parser object with the source document path
Call the get_text method and obtain the text reader object
Read the text from the reader
The following code snippet shows how to extract text from a document:
Extract text from local files
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.docx")asparser:# Extract text into the readertext_reader=parser.get_text()iftext_readerisnotNone:# Print the extracted textextracted_text=text_readerprint(extracted_text)else:print("Text extraction isn't supported for this format")
The following sample file is used in this example: sample.docx
Extract text from a stream
fromgroupdocs.parserimportParser# Open the file streamwithopen("sample.pdf","rb")asstream:# Create an instance of Parser class with the streamwithParser(stream)asparser:# Extract text into the readertext_reader=parser.get_text()iftext_readerisnotNone:# Print the extracted textextracted_text=text_readerprint(extracted_text)else:print("Text extraction isn't supported for this format")
The following sample file is used in this example: sample.pdf
Extract text from specific pages
You can also extract text from specific pages in a document:
fromgroupdocs.parserimportParserwithParser("./sample.pdf")asparser:# Get document info to check page countdoc_info=parser.get_document_info()# Check if the document has pagesifdoc_info.page_count>0:# Extract text from the first page (page index is 0-based)text_reader=parser.get_text(0)iftext_reader:print(f"Text from page 1:")print(text_reader)
The following sample file is used in this example: sample.pdf
More resources
Advanced usage topics
To learn more about text extraction features, including formatted text, text structure, and search functionality, please refer to the advanced usage section.
GitHub examples
You may find more code examples in our GitHub repository:
Along with the full-featured library, we provide a free online text extractor app. You are welcome to extract text from your documents with our Free Online Document Parser App.
Was this page helpful?
Any additional feedback you'd like to share with us?
Please tell us how we can improve this page.
Thank you for your feedback!
We value your opinion. Your feedback will help us improve our documentation.