GroupDocs.Parser provides functionality to detect the file type of items within containers (ZIP archives, OST/PST files, etc.) before processing them.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample container files
Understanding of file type detection
Detect file type of container items
To detect the file type of each item in a container:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./archive.zip")asparser:# Get container itemsattachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")else:# Iterate over attachmentsforattachmentinattachments:print(f"File:{attachment.name}")try:# Open parser for the attachmentwithattachment.open_parser()asfile_parser:# Get document info to detect file typeinfo=file_parser.get_document_info()ifinfoandinfo.file_type:print(f" Type: {info.file_type.file_format}")print(f" Extension: {info.file_type.extension}")else:print(f" Type: Unknown")exceptExceptionase:print(f" Error: {e}")
The following sample file is used in this example: archive.zip
Expected behavior: Detects and displays the file type of each item in the container.
Categorize items by file type
To organize container items by their detected file type:
fromgroupdocs.parserimportParserfromcollectionsimportdefaultdictdefcategorize_by_file_type(file_path):"""
Categorize container items by detected file type.
"""withParser(file_path)asparser:attachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")return{}categories=defaultdict(list)forattachmentinattachments:try:withattachment.open_parser()asfile_parser:info=file_parser.get_document_info()ifinfoandinfo.file_type:file_type=info.file_type.file_formatelse:file_type="Unknown"except:file_type="Error"categories[file_type].append(attachment.name)returndict(categories)# Usagecategories=categorize_by_file_type("mixed_files.zip")print("Files categorized by type:\n")forfile_type,filesinsorted(categories.items()):print(f"{file_type}: {len(files)} files")forfilenameinfiles[:3]:# Show first 3print(f" - {filename}")iflen(files)>3:print(f" ... and {len(files)-3} more")print()
The following sample file is used in this example: mixed_files.zip
Expected behavior: Groups files by their detected format (PDF, DOCX, XLSX, etc.).
Filter items by supported formats
To identify which items can be processed:
fromgroupdocs.parserimportParserdefget_supported_items(file_path):"""
Get list of items that support text extraction.
"""withParser(file_path)asparser:attachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")return[]supported=[]unsupported=[]forattachmentinattachments:try:withattachment.open_parser()asfile_parser:# Check if text extraction is supportediffile_parser.features.text:info=file_parser.get_document_info()file_type=info.file_type.file_formatifinfoandinfo.file_typeelse"Unknown"supported.append({'name':attachment.name,'type':file_type,'size':attachment.size})else:unsupported.append(attachment.name)except:unsupported.append(attachment.name)return{'supported':supported,'unsupported':unsupported}# Usageresult=get_supported_items("documents.zip")print(f"Supported files ({len(result['supported'])}):")foriteminresult['supported']:print(f" {item['name']} [{item['type']}]")print(f"Unsupportedfiles({len(result['unsupported'])}):")fornameinresult['unsupported']:print(f" {name}")
Expected behavior: Separates items into supported and unsupported categories based on feature availability.
Create file type report
To generate a detailed report of file types in the container:
fromgroupdocs.parserimportParserimportjsondefcreate_file_type_report(file_path,output_json):"""
Create detailed file type report for container contents.
"""withParser(file_path)asparser:attachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")returnFalsereport={'container':file_path,'items':[],'summary':{}}type_counts={}forattachmentinattachments:item_info={'name':attachment.name,'size':attachment.size,'path':attachment.file_pathor''}try:withattachment.open_parser()asfile_parser:info=file_parser.get_document_info()ifinfoandinfo.file_type:file_type=info.file_type.file_formatextension=info.file_type.extensionitem_info['file_type']=file_typeitem_info['extension']=extensionitem_info['page_count']=info.page_countifhasattr(info,'page_count')elseNone# Update countstype_counts[file_type]=type_counts.get(file_type,0)+1else:item_info['file_type']='Unknown'type_counts['Unknown']=type_counts.get('Unknown',0)+1exceptExceptionase:item_info['file_type']='Error'item_info['error']=str(e)type_counts['Error']=type_counts.get('Error',0)+1report['items'].append(item_info)report['summary']={'total_items':len(report['items']),'type_distribution':type_counts}# Save reportwithopen(output_json,'w',encoding='utf-8')asf:json.dump(report,f,indent=2,ensure_ascii=False)print(f"File type report saved to {output_json}")print(f"Summary:")print(f" Total items: {report['summary']['total_items']}")forfile_type,countinsorted(type_counts.items(),key=lambdax:x[1],reverse=True):print(f" {file_type}: {count}")returnTrue# Usagecreate_file_type_report("archive.zip","file_type_report.json")
Expected behavior: Creates a comprehensive JSON report with file type information and statistics.
Detect and validate file types
To detect file types and validate against expected types:
fromgroupdocs.parserimportParserdefvalidate_container_contents(file_path,expected_types):"""
Validate that container only contains expected file types.
Args:
file_path: Path to container
expected_types: List of expected file formats (e.g., ['Pdf', 'Docx'])
"""withParser(file_path)asparser:attachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")returnFalsevalid_items=[]invalid_items=[]forattachmentinattachments:try:withattachment.open_parser()asfile_parser:info=file_parser.get_document_info()ifinfoandinfo.file_type:file_type=info.file_type.file_formatiffile_typeinexpected_types:valid_items.append({'name':attachment.name,'type':file_type})else:invalid_items.append({'name':attachment.name,'type':file_type,'reason':'Unexpected type'})else:invalid_items.append({'name':attachment.name,'type':'Unknown','reason':'Type not detected'})exceptExceptionase:invalid_items.append({'name':attachment.name,'type':'Error','reason':str(e)})# Print resultsprint(f"Validation Results:")print(f" Valid items: {len(valid_items)}")print(f" Invalid items: {len(invalid_items)}")ifinvalid_items:print("Invalid items:")foritemininvalid_items:print(f" {item['name']}: {item['type']} - {item['reason']}")returnlen(invalid_items)==0# Usage - validate that archive contains only PDFs and Word documentsis_valid=validate_container_contents("documents.zip",['Pdf','Docx','Doc'])print(f"Containerisvalid:{is_valid}")
The following sample file is used in this example: documents.zip
Expected behavior: Validates container contents against expected file types and reports any violations.
Extract metadata based on file type
To extract different metadata based on detected file type:
fromgroupdocs.parserimportParserdefextract_type_specific_metadata(file_path):"""
Extract metadata specific to each file type.
"""withParser(file_path)asparser:attachments=parser.get_container()ifattachmentsisNone:print("Container extraction not supported")returnforattachmentinattachments:print(f"{'='*60}")print(f"File: {attachment.name}")try:withattachment.open_parser()asfile_parser:info=file_parser.get_document_info()ifinfoandinfo.file_type:print(f"Type: {info.file_type.file_format}")# Get metadatametadata=file_parser.get_metadata()ifmetadata:print("\nMetadata:")foriteminmetadata:print(f" {item.name}: {item.value}")# Get page count if availableifhasattr(info,'page_count'):print(f"\nPages: {info.page_count}")exceptExceptionase:print(f"Error: {e}")# Usageextract_type_specific_metadata("mixed_documents.zip")