GroupDocs.Parser provides functionality to extract data from ZIP archives and other container formats, allowing you to iterate through archive contents and parse individual files.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample ZIP archives for testing
Understanding of container/archive concepts
Detect and iterate through archive items
To iterate through files in a ZIP archive:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./archive.zip")asparser:# Get container items (files in the archive)attachments=parser.get_container()# Check if container extraction is supportedifattachmentsisNone:print("Container extraction isn't supported")else:# Iterate over attachmentsforidx,attachmentinenumerate(attachments):print(f"File{idx+1}:")print(f" Name: {attachment.name}")print(f" Size: {attachment.size} bytes")print(f" File Path: {attachment.file_path}")
The following sample file is used in this example: archive.zip
Expected behavior: Returns a collection of container items representing files in the ZIP archive, or None if container extraction is not supported.
Extract text from files in ZIP archive
To extract text from specific files within an archive:
fromgroupdocs.parserimportParser# Create an instance of Parser class for the archivewithParser("./documents.zip")asparser:# Get container itemsattachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")else:# Iterate over attachmentsforattachmentinattachments:print(f"Processing:{attachment.name}")try:# Get parser for the attachmentwithattachment.open_parser()asfile_parser:# Extract text from the filetext_reader=file_parser.get_text()iftext_reader:text=text_readerprint(f" Extracted {len(text)} characters")print(f" Preview: {text[:100]}...")else:print(f" Text extraction not supported for this file")exceptExceptionase:print(f" Error: {e}")
The following sample file is used in this example: documents.zip
Expected behavior: Opens each file in the archive and extracts text content using a nested parser.
Detect file type of container items
To identify file types within an archive:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("mixed_files.zip")asparser:# Get container itemsattachments=parser.get_container()ifattachments:print("Archive Contents:\n")# Group files by typefiles_by_type={}forattachmentinattachments:# Get file infotry:withattachment.open_parser()asfile_parser:info=file_parser.get_document_info()file_type=info.file_type.extensionifinfoandinfo.file_typeelse"unknown"iffile_typenotinfiles_by_type:files_by_type[file_type]=[]files_by_type[file_type].append(attachment.name)except:if"unknown"notinfiles_by_type:files_by_type["unknown"]=[]files_by_type["unknown"].append(attachment.name)# Print summaryforfile_type,filesinsorted(files_by_type.items()):print(f"{file_type.upper()}files({len(files)}):")forfilenameinfiles:print(f" - {filename}")
The following sample file is used in this example: mixed_files.zip
Expected behavior: Categorizes archive contents by file type.
Extract specific files from archive
To extract only specific files based on criteria:
fromgroupdocs.parserimportParserimportosdefextract_specific_files(archive_path,output_dir,extensions=None):"""
Extract specific files from ZIP archive based on extension.
Args:
archive_path: Path to ZIP archive
output_dir: Directory to save extracted files
extensions: List of extensions to extract (e.g., ['.pdf', '.docx'])
"""os.makedirs(output_dir,exist_ok=True)withParser(archive_path)asparser:attachments=parser.get_container()ifnotattachments:print("No files found in archive")returnextracted_count=0forattachmentinattachments:# Check extension filterifextensions:file_ext=os.path.splitext(attachment.name)[1].lower()iffile_extnotinextensions:continueprint(f"Extracting: {attachment.name}")try:# Open parser for the filewithattachment.open_parser()asfile_parser:# Extract texttext_reader=file_parser.get_text()iftext_reader:# Save to output directoryoutput_file=os.path.join(output_dir,f"{attachment.name}.txt")withopen(output_file,'w',encoding='utf-8')asf:f.write(text_reader)extracted_count+=1print(f" Saved to: {output_file}")exceptExceptionase:print(f" Error: {e}")print(f"Extracted{extracted_count}files")# Usage - extract only PDF and DOCX filesextract_specific_files("documents.zip","extracted",['.pdf','.docx'])
The following sample file is used in this example: documents.zip
Expected behavior: Extracts and processes only files matching specified extensions.
Process nested archives
To handle ZIP files within ZIP files:
fromgroupdocs.parserimportParserdefprocess_archive_recursively(parser,level=0):"""
Recursively process archives and nested archives.
"""indent=" "*level# Get container itemsattachments=parser.get_container()ifnotattachments:returnforattachmentinattachments:print(f"{indent}{attachment.name} ({attachment.size} bytes)")try:# Open parser for the attachmentwithattachment.open_parser()asfile_parser:# Check if it's also a container (nested archive)nested_attachments=file_parser.get_container()ifnested_attachments:print(f"{indent} └─ [Archive - processing nested items]")process_archive_recursively(file_parser,level+2)else:# Try to extract texttext_reader=file_parser.get_text()iftext_reader:text=text_readerprint(f"{indent} └─ Text: {len(text)} characters")exceptExceptionase:print(f"{indent} └─ Error: {e}")# Usageprint("Archive Structure:\n")withParser("nested_archive.zip")asparser:process_archive_recursively(parser)
Expected behavior: Recursively processes nested ZIP archives, showing the complete hierarchy.
Extract metadata from archive files
To extract metadata from files within an archive:
fromgroupdocs.parserimportParserdefextract_archive_metadata(archive_path):"""
Extract metadata from all files in an archive.
"""results=[]withParser(archive_path)asparser:attachments=parser.get_container()ifnotattachments:print("No files in archive")returnresultsforattachmentinattachments:file_info={'name':attachment.name,'size':attachment.size,'path':attachment.file_path}try:withattachment.open_parser()asfile_parser:# Get metadatametadata=file_parser.get_metadata()ifmetadata:file_info['metadata']={}foriteminmetadata:file_info['metadata'][item.name]=str(item.value)# Get document infoinfo=file_parser.get_document_info()ifinfo:file_info['file_type']=info.file_type.extensionfile_info['page_count']=info.page_countexceptExceptionase:file_info['error']=str(e)results.append(file_info)returnresults# Usagemetadata_list=extract_archive_metadata("documents.zip")print("Archive Metadata Report:\n")foriteminmetadata_list:print(f"File: {item['name']}")print(f" Size: {item['size']} bytes")if'file_type'initem:print(f" Type: {item['file_type']}")if'page_count'initem:print(f" Pages: {item['page_count']}")if'metadata'initem:print(f" Metadata: {len(item['metadata'])} properties")print()
The following sample file is used in this example: documents.zip
Expected behavior: Generates a comprehensive metadata report for all files in the archive.
Create archive inventory
To create a detailed inventory of archive contents:
fromgroupdocs.parserimportParserimportjsondefcreate_archive_inventory(archive_path,output_json):"""
Create detailed inventory of ZIP archive contents.
"""inventory={'archive':archive_path,'total_files':0,'total_size':0,'files':[]}withParser(archive_path)asparser:attachments=parser.get_container()ifattachments:forattachmentinattachments:file_entry={'name':attachment.name,'size':attachment.size,'path':attachment.file_path,'extractable':False}try:withattachment.open_parser()asfile_parser:# Check if text extraction is supportediffile_parser.features.text:text_reader=file_parser.get_text()iftext_reader:text=text_readerfile_entry['extractable']=Truefile_entry['text_length']=len(text)file_entry['word_count']=len(text.split())except:passinventory['files'].append(file_entry)inventory['total_size']+=attachment.sizeinventory['total_files']+=1# Save to JSONwithopen(output_json,'w',encoding='utf-8')asf:json.dump(inventory,f,indent=2,ensure_ascii=False)print(f"Inventory created: {output_json}")print(f" Total files: {inventory['total_files']}")print(f" Total size: {inventory['total_size']:,} bytes")print(f" Extractable: {sum(1forfininventory['files']iff['extractable'])}")# Usagecreate_archive_inventory("archive.zip","inventory.json")
Expected behavior: Creates a JSON inventory file with detailed information about archive contents.
Notes
The get_container() method returns None if container extraction is not supported
Use open_parser() on attachments to create a parser for individual files
Nested archives can be processed recursively
Each attachment has properties: name, size, and file_path
The parser automatically handles nested containers (ZIP within ZIP)
Always use context managers (with statements) to ensure proper resource cleanup
Some files within archives may not support text extraction