GroupDocs.Parser for .NET 17.05 Release Notes

Major Features

There are the following features in this release:

  • Ability to extract a text with its structure from EPUB documents.
  • Ability to extract a formatted text from EPUB documents.
  • Ability to extract a text from fb2 (FictionBook) documents.
  • Ability to extract highlights from fb2 (FictionBook) documents.
  • Ability to search a text in fb2 (FictionBook) documents.
  • Ability to extract metadata from fb2 (FictionBook) documents.
  • Ability to detect fb2 (FictionBook) documents.
  • Ability to extract a text with its structure from fb2 (FictionBook) documents.
  • Ability to use Metered keys.

All Changes

KeySummaryIssue Type
TEXTNET-511Implement the ability to extract a structured text from EPUB filesNew feature
TEXTNET-536Implement the ability to extract a formatted text from EPUB filesNew feature
TEXTNET-537Implement the ability to extract a text from fb2 filesNew feature
TEXTNET-538Implement the ability to extract metadata from fb2 filesNew feature
TEXTNET-539Implement the ability to extract highlights from fb2 filesNew feature
TEXTNET-540Implement the ability to search a text in fb2 filesNew feature
TEXTNET-604Implement the ability to extract a structured text from fb2 filesNew feature
TEXTNET-615Implement the support for Dynabic.MeteredNew feature
TEXTNET-616Implement the media type detector for fb2 documentsNew feature

Public API and Backward Incompatible Changes

Implement the ability to extract a structured text from EPUB files

This feature allows to extract a text with its structure from EPUB documents.

Public API Changes
Added SectionProperties class.
Added Section item to StructuredElementType enum.
Added Section event to StructuredHandler class.
Added OnStartSection protected method to StructuredHandler class.
Added ExtractStructured method to EpubTextExtractor class.

Extracting section titles from EPUB document:

C#

// Create a text extractor
using (EpubTextExtractor extractor = new EpubTextExtractor(stream))
{
    StringBuilder sb = null;
    bool isSectionHasTitle = false;

    // Create a handler
    StructuredHandler handler = new StructuredHandler();

    // Handle ElementText event to process a section
    handler.Section += (sender, e) =>
    {
        isSectionHasTitle = false; // a new section doesn't have a title
    };

    // Handle Paragraph event to process a paragraph
    handler.Paragraph += (sender, e) =>
    {
        // is paragraph a heading?
        bool isHeading = ParagraphStyle.Heading1 <= e.Properties.Style && e.Properties.Style <= ParagraphStyle.Heading6;

        if (isHeading && !isSectionHasTitle)
        {
            sb = new StringBuilder();
        }
    };

    // Handle ElementClosed event to process a closing of a paragraph
    handler.ElementClosed += (sender, e) =>
    {
        // Check if closing tag is paragraph
        if (sb != null && e.Properties is ParagraphProperties)
        {
            // Print a title to the console
            Console.WriteLine(sb.ToString());
            // Section has a title
            isSectionHasTitle = true;
            sb = null;
        }
    };

    // Handle ElementText event to process a text
    handler.ElementText += (sender, e) =>
    {
        if (sb != null)
        {
            // Add a text to the title
            sb.Append(e.Text);
        }
    };

    // Extract a text with its structure
    extractor.ExtractStructured(handler);
}

Implement the ability to extract a formatted text from EPUB files

This feature allows to extract a formatted text from EPUB documents.

Public API changes
Added EpubTextExtractorBase class.
Added EpubFormattedTextExtractor class.

Extracting a formatted text:

C#

// Create a formatted text extractor for EPUB documents
using (var extractor = new EpubFormattedTextExtractor(stream)) {
  // Set a document formatter to Markdown
  extractor.DocumentFormatter = new MarkdownDocumentFormatter();
  // Extact a text and print it to the console
  Console.Write(extractor.ExtractAll());
}

Implement the ability to extract a text from fb2 files

This feature allows to extract a text form fb2 (FictionBook) documents.

Public API changes
Added FictionBookTextExtractor class.

Extracting the whole text at once:

C#

using(var extractor = new FictionBookTextExtractor(stream))
{
   Console.Write(extractor.ExtractAll());
}

Extracting a text by lines:

C#

using(var extractor = new FictionBookTextExtractor(stream))
{
   string line = extractor.ExtractLine();
   while(line != null)
   {
      Console.Write(line);
      line = extractor.ExtractLine();
   }
}

Implement the ability to extract metadata from fb2 files

This feature allows to extract metadata from fb2 (FictionBook) documents.

Public API changes
Added FictionBookMetadataExtractor class.

C#

// Create a metadata extractor 
FictionBookMetadataExtractor extractor = new FictionBookMetadataExtractor();
// Extract metadata
 MetadataCollection metadata = extractor.ExtractMetadata(stream);
// Iterate metadata values 
foreach (string key in metadata.Keys)
{
    // Print a metadata key/value pair to the console    
   Console.WriteLine(string.Format("{0} = {1}", key, metadata[key]));
}

Implement the ability to extract highlights from fb2 files

This feature allows to extract a highlight from fb2 (FictionBook) documents.

Public API changes
Added ExtractHighlights method to FictionBookTextExtractor class.

C#

// Create a text extractor 
using (FictionBookTextExtractor extractor = new FictionBookTextExtractor(stream))
{
    // Extract two highlights with the fixed position and length     
	var highlights = extractor.ExtractHighlights(
        HighlightOptions.CreateFixedLengthOptions(HighlightDirection.Left, 1962, 22),
        HighlightOptions.CreateFixedLengthOptions(HighlightDirection.Right, 1982, 10));

    for(int i = 0; i < highlights.Count; i++)
    {
        // Print highlights to the console        
	    Console.WriteLine(highlights[i]);
    }
}

Implement the ability to search a text from fb2 files

This feature allows to search with a regular expression or search a text in fb2 (FictionBook) documents.

Public API changes
Added Search methods to FictionBookTextExtractor class.
Added SearchWithRegex method to FictionBookTextExtractor class.

Searching with a regular expression:

C#

// Create a text extractor 
using (var extractor = new FictionBookTextExtractor(stream)) {
  // Create search options   
	var searchOptions = new RegexSearchOptions(); 
  // Create a search handler. ListSearchHandler collects search results to the list  
	 var handler = new ListSearchHandler(); 
  // Search with a regular expression   
	extractor.SearchWithRegex("On[a-z]", handler, searchOptions); 

  // If list doesn't contain any results   
  if (handler.List.Count == 0)
  {
    Console.WriteLine("Not found"); // Print "Not Found" to the console  
  }
  else
  {
    // Iterate search results     
	for (int i = 0; i < handler.List.Count; i++)
    {
      // Print a search result to the console       
	  Console.Write(handler.List[i].LeftText); // a text on the left side from the found text      
	  Console.Write("_");
      Console.Write(handler.List[i].FoundText); // the found text       
      Console.Write("_");
      Console.Write(handler.List[i].RightText); // a text on the right side from the found text       
      Console.WriteLine("---");
    }
  }
}

Searching a text:

C#

// Create a text extractor 
using (var extractor = new FictionBookTextExtractor(stream)) {
  // Create search options   
	var options = new SearchOptions(SearchHighlightOptions.CreateFixedLengthOptions(0));
  // Create a search handler. ListSearchHandler collects search results to the list  
	var handler = new ListSearchHandler();
  // Create keywords to search   
	var keywords = new string[] { "One" };
  // Search keywords   
	extractor.Search(options, handler, keywords);

  // If list doesn't contain any results   
  if (handler.List.Count == 0)
  {
    Console.WriteLine("Not found"); 
	// Print "Not Found" to the console   
  }
  else
  {
    // Iterate search results     
	for (int i = 0; i < handler.List.Count; i++)
    {
      // Print a search result to the console       
	  Console.Write(handler.List[i].LeftText); 
	  // a text on the left side from the found text       
	  Console.Write("_");
      Console.Write(handler.List[i].FoundText); 
      // the found text       
      Console.Write("_");
      Console.Write(handler.List[i].RightText); 
      // a text on the right side from the found text       
      Console.WriteLine("---");
    }
  }
}

Implement the ability to extract a structured text from fb2 files

This feature allows to extract a text with its structure from fb2 (FictionBook) documents.

Public API changes
Added LineBreakProperties class.
Added LineBreak item to StructuredElementType enum.
Added LineBreak event to StructuredHandler class.
Added OnLineBreak protected method to StructuredHandler class.
Added GroupProperties class.
Added Group item to StructuredElementType enum.
Added Group event to StructuredHandler class.
Added OnStartGroup protected method to StructuredHandler class.
Added ExtractStructured method to FictionBookTextExtractor class.

Extracting section titles from fb2 document:

C#

// Create a text extractor 
using (FictionBookTextExtractor extractor = new FictionBookTextExtractor(File.OpenRead(@"C:\Sources\GroupDocs.Parser\GroupDocs.Parser.Net\testdata\unit\fb2\complex.fb2")))
{
    StringBuilder sb = null;

    // Create a handler     
	StructuredHandler handler = new StructuredHandler();

    // Handle Group event to process a group     
	handler.Group += (sender, e) =>
    {
        StructuredHandler h = sender as StructuredHandler;

        // Is the group a section title?         
		bool isSectionTitleGroup = h != null && h.Depth > 1 && h[0] is SectionProperties;

        // If a group is the section title        
	    if(isSectionTitleGroup)
        {
            sb = new StringBuilder();
        }
    };

    // Handle Paragraph event to process a paragraph     
	handler.Paragraph += (sender, e) =>
    {
        if (sb != null && sb.Length > 0)
        {
            sb.AppendLine();
        }
    };

    // Handle ElementClosed event to process a closing of a paragraph     
	handler.ElementClosed += (sender, e) =>
    {
        if (sb == null || sb.Length == 0)
        {
            return;
        }

        // Check if closing tag is paragraph         
		if (e.Properties is ParagraphProperties)
        {
            sb.AppendLine();
        }

        // Check if closing tag is group of section title         
		if (e.Properties is GroupProperties && (e.Properties as GroupProperties).Style == "title")
        {
            // Print a title to the console             
			Console.WriteLine(sb.ToString());
            sb = null;
        }
    };

    // Handle ElementText event to process a text     
	handler.ElementText += (sender, e) =>
    {
        if (sb != null)
        {
            // Add a text to the title             
		sb.Append(e.Text);
        }
    };

    // Extract a text with its structure     
	extractor.ExtractStructured(handler);
}

Implement the support for Dynabic.Metered

This feature allows to use metered keys.

Public API changes
Added Metered class.

In this example, an attempt will be made to set metered public and private key:

C#

 
Metered matered = new Metered();
matered.SetMeteredKey("PublicKey", "PrivateKey");

Implement the media type detector for fb2 documents

This feature allows to detect fb2 (FictionBook) documents.

Public API changes
Added FictionBookMediaTypeDetector class.

C#

// Create a media type detector 
var detector = new FictionBookMediaTypeDetector();
// Detect a media type by the file name 
Console.WriteLine(detector.Detect("file.fb2"));
// Detect a media type by the content 
Console.WriteLine(detector.Detect(stream));