Working with Metadata Extractors
GroupDocs.Parser also provides a simplified way of extracting metadata attached with the supported file formats. For extracting metadata, the following classes are used.
|CellsMetadataExtractor||Provides the functionality to extract metadata from spreadsheets.|
|SlidesMetadataExtractor||Provides the functionality to extract metadata from presentations.|
|WordsMetadataExtractor||Provides the functionality to extract metadata from text documents.|
|PdfMetadataExtractor||Provides the functionality to extract metadata from PDF documents.|
|EmailMetadataExtractor||Provides the functionality to extract metadata from email messages.|
|EpubMetadataExtractor||Provides the functionality to extract metadata from EPUB documents.|
|FictionBookMetadataExtractor||Provides the functionality to extract metadata from FictionBook (fb2) documents.|
All classes are inherited from MetadataExtractor abstract class. It provides the interface for extracting metadata from documents.
Following methods are used to extract metadata from the documents.
|extractMetadata(Stream stream)||Extracts metadata from the stream|
|extractMetadata(string fileName)||Extracts metadata from the file|
All methods return an instance of MetadataCollection class. This class provides the dictionary-style collection of metadata. MetadataNames class contains all supported metadata keys. It's recommended to use MetadataNames class constants instead of using string literals to retrieve values from MetadataCollection class.
The following code sample shows how to extract metadata from a text document. The same technique will be used to extract metadata from the other document formats using the relevant metadata extractor classes.
Extracting Metadata using Extractor Class
To extract metadata from any supported document format, Extractor class is used. The Extractor class allows you to extract metadata without using the concrete metadata extractor classes such as WordsMetadataExtractor. Since version 18.12, GroupDocs.Parser allows you to extract metadata from the following text and presentation template formats using Extractor class:
- dotx (Template)
- dotm (Macro-enabled template)
- ott (OpenDocument Text Template)
- potx (Template)
- potm (Macro-enabled template)
- ppsm (Macro-enabled slideshow)
- pptm (Macro-enabled presentation)
The following code sample shows how to extract metadata from the documents using Extractor class.
EPUB document can contain one or more packages. Each package has its own metadata collection. For working with such documents, ComplexMetadataExtractor class is used. It has extractComplexMetadata methods for extracting the complex metadata. The methods return an enumerator for all metadata collections. EpubMetadataExtractor is inherited from ComplexMetadataExtractor class.
Following code snippet shows how to extract complex metadata.