GroupDocs.Parser provides functionality to extract images from various document formats including PDF, Word, Excel, PowerPoint, and more.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents containing images
Write access to save extracted images (optional)
Extract images from document
To extract all images from a document:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract imagesimages=parser.get_images()# Check if image extraction is supportedifimagesisNone:print("Image extraction isn't supported")else:# Iterate over imagesforidx,imageinenumerate(images):# Print image informationprint(f"Image {idx+1}:")print(f" Page: {image.page.index+1}")print(f" Type: {image.file_type}")print(f" Size: {image.rectangle.width}x{image.rectangle.height}")print(f" Position: ({image.rectangle.left}, {image.rectangle.top})")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns a collection of PageImageArea objects representing all images found in the document, or None if image extraction is not supported.
Save extracted images to files
To save extracted images to disk:
fromgroupdocs.parserimportParserimportos# Create output directoryoutput_dir="extracted_images"os.makedirs(output_dir,exist_ok=True)# Create an instance of Parser classwithParser("./sample.docx")asparser:# Extract imagesimages=parser.get_images()ifimagesisNone:print("Image extraction isn't supported")else:# Iterate over images and save themforidx,imageinenumerate(images):# Get file extension based on image typeextension=image.file_type.extension# Generate filenamefilename=f"image_{idx+1}{extension}"filepath=os.path.join(output_dir,filename)# Save image to fileimage.save(filepath)print(f"Saved: {filepath}")
The following sample file is used in this example: sample.docx
Expected behavior: Saves each extracted image to a separate file with the appropriate file extension (.png, .jpg, .gif, etc.).
Extract images with metadata
To extract images along with detailed metadata:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pptx")asparser:# Check if image extraction is supportedifnotparser.features.images:print("Document doesn't support image extraction")return# Extract imagesimages=parser.get_images()ifimages:print(f"Found {len(list(images))} images")images=parser.get_images()# Re-extract as iterator was consumedforidx,imageinenumerate(images):print(f"Image {idx+1}:")print(f" Page: {image.page.index+1}")print(f" Format: {image.file_type}")print(f" Rotation: {image.rotation}°")print(f" Rectangle: {image.rectangle}")print(f" Width: {image.rectangle.width}")print(f" Height: {image.rectangle.height}")print()
The following sample file is used in this example: sample.pptx
Expected behavior: Displays comprehensive information about each image including position, size, format, and rotation angle.
Get image stream
To work with image data as a stream:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportImageOptions,ImageFormat# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract imagesimages=parser.get_images()ifimages:foridx,imageinenumerate(images):# Get image streamimage_stream=image.get_image_stream()# Read image dataimage_data=image_stream.read()print(f"Image {idx+1}: {len(image_data)} bytes, format: {image.file_type}")# Optionally convert to PNGpng_options=ImageOptions(ImageFormat.PNG)png_stream=image.get_image_stream(png_options)png_data=png_stream.read()print(f" Converted to PNG: {len(png_data)} bytes")
The following sample file is used in this example: sample.pdf
The following sample file is used in this example: ImageFormat.PNG
Expected behavior: Provides access to raw image data as a stream, with optional format conversion.
Convert images during extraction
To convert images to a specific format during extraction:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportImageOptions,ImageFormatimportos# Create output directoryoutput_dir="converted_images"os.makedirs(output_dir,exist_ok=True)# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract imagesimages=parser.get_images()ifimages:# Create image options for PNG formatpng_options=ImageOptions(ImageFormat.PNG)foridx,imageinenumerate(images):# Save image as PNG (regardless of original format)filename=f"image_{idx+1}.png"filepath=os.path.join(output_dir,filename)# Save with conversionimage.save(filepath,png_options)print(f"Saved as PNG: {filepath}")
The following sample file is used in this example: sample.pdf
The following sample file is used in this example: ImageFormat.PNG
Expected behavior: All extracted images are converted to PNG format before saving, regardless of their original format.
Batch image extraction
Extract images from multiple documents:
fromgroupdocs.parserimportParserimportosfrompathlibimportPathdefextract_images_from_directory(input_dir,output_dir):"""
Extract images from all documents in a directory.
"""os.makedirs(output_dir,exist_ok=True)# Supported document extensionsextensions=['.pdf','.docx','.doc','.xlsx','.pptx','.ppt']forfile_pathinPath(input_dir).rglob('*'):iffile_path.suffix.lower()inextensions:print(f"Processing:{file_path.name}")try:withParser(str(file_path))asparser:images=parser.get_images()ifimagesisNone:print(f" Image extraction not supported")continue# Create subdirectory for this documentdoc_output_dir=os.path.join(output_dir,file_path.stem)os.makedirs(doc_output_dir,exist_ok=True)# Save imagesimage_count=0foridx,imageinenumerate(images):filename=f"image_{idx+1}{image.file_type.extension}"filepath=os.path.join(doc_output_dir,filename)image.save(filepath)image_count+=1print(f" Extracted {image_count} images")exceptExceptionase:print(f" Error: {e}")# Usageextract_images_from_directory("input_documents","extracted_images")
Notes
The get_images() method returns None if image extraction is not supported for the document format
Always check parser.features.images before attempting to extract images