GroupDocs.Parser provides the functionality to extract text from documents with the highest quality.
The Accurate mode is the default text extraction mode, providing the best possible text quality from documents.
You can extract text from the entire document or from individual pages.
Prerequisites
Before you begin, ensure you have:
GroupDocs.Parser for Python via .NET installed
A valid license or trial
Sample documents for testing
Extract text from document
To extract text from the entire document in Accurate mode, use the get_text() method:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract text from the documenttext_reader=parser.get_text()# Check if text extraction is supportediftext_readerisNone:print("Text extraction isn't supported")else:# Print the extracted textprint(text_reader)
The following sample file is used in this example: sample.pdf
Expected behavior: The method returns a TextReader object containing the entire document text, or None if text extraction is not supported for the document format.
Extract text from document page
To extract text from a specific page:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Check if text extraction is supportedifnotparser.features.text:print("Document doesn't support text extraction")return# Get document infoinfo=parser.get_document_info()# Check if document has pagesifinfo.page_count==0:print("Document has no pages")return# Iterate over pagesforpage_indexinrange(info.page_count):# Print page numberprint(f"Page{page_index+1}/{info.page_count}")# Extract text from the pagetext_reader=parser.get_text(page_index)# Print the page textiftext_readerisnotNone:print(text_reader)
The following sample file is used in this example: sample.pdf
Expected behavior: The method extracts text from each page individually, allowing you to process documents page by page.
Extract text with error handling
Here’s a robust example with error handling:
fromgroupdocs.parserimportParserdefextract_text_safely(file_path):try:withParser(file_path)asparser:# Check feature supportifnotparser.features.text:print(f"Text extraction not supported for {file_path}")returnNone# Extract texttext_reader=parser.get_text()iftext_readerisnotNone:returntext_readerexceptExceptionase:print(f"Error extracting text: {e}")returnNone# Usagetext=extract_text_safely("sample.docx")iftext:print(f"Extracted {len(text)} characters")
The following sample file is used in this example: sample.docx
Notes
Accurate mode is the default and provides the best text quality
The get_text() method returns None if text extraction is not supported
Use parser.features.text to check if text extraction is available before calling get_text()
For better performance with large documents, consider extracting text page by page