GroupDocs.Parser for .NET 17.07 Release Notes

Major Features

There are the following features in this release:

  • Implement the ability to extract a text from pdf portfolios
  • Implement IContainer interface support for email text extractors
  • Implement the support for DOT files
  • Implement IPageTextExtractor interface

All Changes

KeySummaryIssue Type
TEXTNET-628Implement the ability to extract a text from pdf portfoliosNew feature
TEXTNET-648Implement IContainer interface support for email text extractorsNew feature
TEXTNET-650Implement the support for DOT filesNew feature
TEXTNET-666Implement IPageTextExtractor interfaceNew feature

Public API and Backward Incompatible Changes

Implement the ability to extract a text from pdf portfolios

This feature allows to extract a text from PDF Portfolio documents.

Public API Changes
Added Entities property to PdfTextExtractor class.
Added OpenEntityStream method to PdfTextExtractor class.

Usage:

C#

// Create an extractor factory 
var factory = new ExtractorFactory();
// Create an instance of PdfTextExtractor class 
var extractor = new PdfTextExtractor(fileName);
// Iterate over all files in the portfolio 
for (var i = 0; i < extractor.Entities.Count; i++) 
{
    // Print the name of a file   
    Console.WriteLine(extractor.Entities[i].Name);
    // Open the stream of a file   
    using (var stream = extractor.Entities[i].OpenStream()) 
    {
        // Create the text extractor for a file     
        var entityExtractor = factory.CreateTextExtractor(stream);
        // If a media type is supported
        if (entityExtractor != null) try 
        {
            // Print the content of a file       
            Console.WriteLine(entityExtractor.ExtractAll());
        }
    finally 
	{
      entityExtractor.Dispose();
    }
}

Extract a text from attachments for email format (using IContainer Interface)

This feature allows to work with an email text extractor as a container.

Public API changes
Added IContainer interface.
Added OpenEntityStream method to Container class.
Added Entities property to EmailTextExtractorBase class.
Added OpenEntityStream method to EmailTextExtractorBase class.

Usage:

C#

// Create an extractor factory
var factory = new ExtractorFactory();
// Create an instance of EmailTextExtractor class 
var extractor = new EmailTextExtractor(fileName);
// Iterate over all attachments in the message 
for (var i = 0; i < extractor.Entities.Count; i++) 
{
    // Print the name of an attachment   
    Console.WriteLine(extractor.Entities[i].Name);
    // Open the stream of an attachment   
	using (var stream = extractor.Entities[i].OpenStream()) 
	{
    	// Create the text extractor for an attachment     
		var attachmentExtractor = factory.CreateTextExtractor(stream);
    	// If a media type is supported     
		if (attachmentExtractor != null) try 
		{
      		// Print the content of an attachment       
			Console.WriteLine(attachmentExtractor.ExtractAll());
    	}
    finally 
	{
      attachmentExtractor.Dispose();
    }
}

Implement the support for DOT files

This feature allows to extract a formatted and a raw text from .DOT files.

Public API changes
None

Usage:

C#

// Create an instance of WordsTextExtractor class 
using (var extractor = new WordsTextExtractor("sample.dot"))
{
	// Extract a text   
	Console.WriteLine(extractor.ExtractAll());
}

Implement IPageTextExtractor interface

This feature allows to work with document’s pages in the same way for all supported documents.

Public API changes
Added IPageTextExtractor interface.
Added PageCount property to CellsTextExtractor, CellsFormattedTextExtractor, SlidesTextExtractor and SlidesFormattedTextExtractor classes.
Added ExtractPage method to CellsTextExtractor, CellsFormattedTextExtractor, SlidesTextExtractor and SlidesFormattedTextExtractor classes.

Usage:

C#

// Create an extractor factory 
var factory = new ExtractorFactory();
// Create an instance of text extractor class 
using (var extractor = factory.CreateTextExtractor(fileName)) 
{
	// Check if IPageTextExtractor is supported   
	var pageTextExtractor = extractor as IPageTextExtractor;
    if (pageTextExtractor != null) 
	{
    	// Iterate over all pages     
		for (var i = 0; i < pageTextExtractor.PageCount; i++) 
		{
      		// Print a page number       
			Console.WriteLine(string.Format("{0}/{1}", i, pageTextExtractor.PageCount));
      		// Extract a text from the page       
			Console.WriteLine(pageTextExtractor.ExtractPage(i));
    	}
  	}
}