Extract data from attachments and ZIP archives
It is easy to extract data, text, images and use any GroupDocs.Parser feature for ZIP-archived documents. The same feature allows to get attachments from PDF and Emails and extract data from them.
Extract data from attachments and ZIP archives
To extract documents from ZIP files and get attachments from containers simply call the getContainer method:
Iterable<ContainerItem> getContainer();
This method returns a collection of ContainerItem objects:
Member | Description |
---|---|
getName | The name of the item. |
getDirectory | The directory of the item. |
getFilePath | The full path of the item. |
getSize | The size of the item in bytes. |
getMetadata | The collection of item metadata. |
detectFileType(FileTypeDetectionMode) | Detects a file type of the container item. |
openStream | Opens the stream of the item content. |
openParser | Creates the Parser object for the item content. |
openParser(LoadOptions) | Creates the Parser object for the item content with LoadOptions. |
openParser(LoadOptions, ParserSettings) | Creates the Parser object for the item content with LoadOptions and ParserSettings. |
Container represents both container-only files (like zip archives, outlook storage) and documents with attachments (like emails, PDF Portfolios).
Here are the steps to extract a text from from zip entities:
- Instantiate Parser object for the initial document;
- Call getContainer method and obtain collection of document container item objects;
- Check if collection isn’t null (container extraction is supported for the document);
- Iterate through the collection and obtain Parser object to extract a text.
The following example shows how to extract a text from from zip entities:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleZip)) {
// Extract attachments from the container
Iterable<ContainerItem> attachments = parser.getContainer();
// Check if container extraction is supported
if (attachments == null) {
System.out.println("Container extraction isn't supported");
}
// Iterate over zip entities
for (ContainerItem item : attachments) {
// Print the file path
System.out.println(item.getFilePath());
try {
// Create Parser object for the zip entity content
try (Parser attachmentParser = item.openParser()) {
// Extract an zip entity text
try (TextReader reader = attachmentParser.getText()) {
System.out.println(reader == null ? "No text" : reader.readToEnd());
}
}
} catch (UnsupportedDocumentFormatException ex) {
System.out.println("Isn't supported.");
}
}
}
More resources
Advanced usage topics
To learn more about document data extraction features and get familiar how to extract text, images, forms and more, please refer to the advanced usage section.
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples:
Free online document parser App
Along with full featured Java library we provide simple, but powerful free Apps.
You are welcome to extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.