This guide demonstrates the essential steps to get started with GroupDocs.Parser for Python via .NET and perform basic document parsing operations.
Prerequisites
Before you begin, ensure you have:
Python 3.5 or higher installed
GroupDocs.Parser for Python via .NET installed (see Installation)
Extract Text from a Document
The most common task is extracting text from documents. Here’s how to do it:
fromgroupdocs.parserimportParserdefextract_text_from_document():# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract text from the documenttext_reader=parser.get_text()iftext_readerisnotNone:# Print the extracted textprint(text_reader)else:print("Text extraction isn't supported for this format")if__name__=="__main__":extract_text_from_document()
The following sample file is used in this example: sample.pdf
Get Document Information
You can retrieve basic information about a document:
fromgroupdocs.parserimportParserdefget_document_information():# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Get document infoinfo=parser.get_document_info()print(f"File type: {info.file_type.file_format}")print(f"Page count: {info.page_count}")print(f"Size: {info.size} bytes")if__name__=="__main__":get_document_information()
The following sample file is used in this example: sample.pdf
Extract Metadata
Extract metadata properties from documents:
fromgroupdocs.parserimportParserdefextract_metadata():# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract metadatametadata=parser.get_metadata()ifmetadataisnotNone:foriteminmetadata:print(f"{item.name}: {item.value}")if__name__=="__main__":extract_metadata()
The following sample file is used in this example: sample.pdf
Extract Images
Extract images from documents:
fromgroupdocs.parserimportParserdefextract_images():# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract imagesimages=parser.get_images()ifimagesisnotNone:fori,imageinenumerate(images):# Save image to filewithopen(f"image_{i}.{image.file_type.extension}","wb")asfile:file.write(image.get_image_stream().read())if__name__=="__main__":extract_images()
The following sample file is used in this example: sample.pdf
Extract Text from Specific Page
Extract text from a particular page:
fromgroupdocs.parserimportParserdefextract_text_from_specific_page():# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Get document info to check page countinfo=parser.get_document_info()ifinfo.page_count>0:# Extract text from the first page (page index is 0-based)text_reader=parser.get_text(0)iftext_readerisnotNone:print(text_reader)if__name__=="__main__":extract_text_from_specific_page()
The following sample file is used in this example: sample.pdf
Check Format Support
Before processing a document, you can check if the format is supported:
fromgroupdocs.parserimportParserdefcheck_format_support():# Check if file format is supportedifParser.get_file_info("./sample.pdf").file_type.file_format!="Unknown":print("Format is supported")# Process the documentwithParser("./sample.pdf")asparser:text_reader=parser.get_text()iftext_readerisnotNone:print(text_reader)else:print("Format is not supported")if__name__=="__main__":check_format_support()
The following sample file is used in this example: sample.pdf
Next Steps
Now that you’ve learned the basics, explore more advanced features: