GroupDocs.Parser for .NET 17.04 Release Notes

Major Features

There are the following features in this release:

  • Ability to extract a text with its structure.

All Changes

KeySummaryIssue Type
TEXTNET-341Implement the architecture to read a structured textNew feature
TEXTNET-342Implement the ability to read a structured text from spreadsheetsNew feature
TEXTNET-343Implement the ability to read a structured text from text documentsNew feature
TEXTNET-344Implement the ability to read a structured text from presentationsNew feature
TEXTNET-345Implement the ability to read a structured text from emailsNew feature

Public API and Backward Incompatible Changes

Implement the architecture to read a structured text

This feature allows to extract a text with its structure.

Public API Changes
Added IStructuredExtractor interface.
Added StructuredHandler class.
Added XmlStructuredHandler class.
Added ModelStructuredHandler class.
Added ModelNode class.
Added StructuredEventArgs class.
Added StructuredEventHandler delegate.
Added StructuredTextEventArgs class.
Added StructuredTextEventHandler delegate.
Added StructuredElementProperties class.
Added DocumentProperties class.
Added PageProperties class.
Added SlideProperties class.
Added ParagraphProperties class.
Added HyperlinkProperties class.
Added TextProperties class.
Added TableProperties class.
Added TableRowProperties class.
Added TableCellProperties class.
Added ListProperties class.
Added ListItemProperties class.
Added ListType enumeration.
Added ParagraphStyle enumeration.
Added StructuredElementType enumeration.

Extracting headers from a document:

C#

StringBuilder sb = new StringBuilder();
IStructuredExtractor extractor = new WordsTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();

// Handle List event to prevent processing of lists
handler.List += (sender, e) => e.Properties.SkipElement = true; // ignore lists
// Handle Table event to prevent processing of tables
handler.Table += (sender, e) => e.Properties.SkipElement = true; // ignore tables
// Handle Paragraph event to process a paragraph
handler.Paragraph += (sender, e) =>
  {
      int h1 = (int)ParagraphStyle.Heading1;
      int h6 = (int)ParagraphStyle.Heading6;
      
      int style = (int)e.Properties.Style;
      if (h1 <= style && style <= h6)
      {
          if (sb.Length > 0)
          {
              sb.AppendLine();
          }

          sb.Append(' ', style - h1); // make an indention for the header (h1 - no indention)
      }
      else
      {
          e.Properties.SkipElement = e.Properties.Style != ParagraphStyle.Title; // skip paragraph if it's not a header or a title
      }
  };

// Handle ElementText event to process a text
handler.ElementText += (sender, e) => sb.Append(e.Text);

// Extract a text with its structure
extractor.ExtractStructured(handler);

Console.WriteLine(sb.ToString());

Extracting hyperlinks from a document:

C#

List<string> hyperlinks = new List<string>();
StringBuilder sb = null;
string currentLink = null;
IStructuredExtractor extractor = new WordsTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();

// Handle Hyperlink event to process a starting of a hyperlink
handler.Hyperlink += (sender, e) =>
{
    sb = new StringBuilder();
    currentLink = e.Properties.Link;
};

// Handle ElementClose event to process a closing of a hyperlink
handler.ElementClosed += (sender, e) =>
{
    StructuredHandler h = sender as StructuredHandler;
    if (h != null && h[0] is HyperlinkProperties) // closing of hyperlink
    {
        if (sb != null)
        {
            hyperlinks.Add(string.Format("{0} ({1})", sb.ToString(), currentLink));
        }
        sb = null;
        currentLink = null;
    }
};

// Handle ElementText event to process a text
handler.ElementText += (sender, e) =>
{
    if (sb != null) // if hyperlink is open
    {
        sb.Append(e.Text);
    }
};

// Extract a text with its structure
extractor.ExtractStructured(handler);

foreach(string hl in hyperlinks)
{
    Console.WriteLine(hl);
}

Implement the ability to read a structured text from spreadsheets

This feature allows to extract a text with its structure from spreadsheets.

Public API changes
Added ExtractStructured method to CellsTextExtractor class.

Extracting a sheet which name is Sheet2:

C#

StringBuilder sb = new StringBuilder();
IStructuredExtractor extractor = new CellsTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();

// Handle Table event to process a table
handler.Table += (sender, e) =>
{
    e.Properties.SkipElement = e.Properties.Name != "Sheet2"; // process only the sheet which name is Sheet2
    if (!e.Properties.SkipElement)
    {
        if (sb.Length > 0)
        {
            sb.AppendLine();
        }

        sb.Append(e.Properties.Name);
    }
};

// Handle TableRow event to process a table row
handler.TableRow += (sender, e) =>
{
    sb.AppendLine();
};

// Handle TableCell event to process a table cell
handler.TableCell += (sender, e) =>
{
    if (e.Properties.Column > 0)
    {
        sb.Append(" ");
    }
};

// Handle ElementText event to process a text
handler.ElementText += (sender, e) =>
{
    sb.Append(e.Text);
};

// Extract a text with its structure
extractor.ExtractStructured(handler);
Console.WriteLine(sb.ToString());

Implement the ability to read a structured text from text documents

This feature allows to extract a text with its structure from text documents.

Public API changes
Added ExtractStructured method to WordsTextExtractor class.

Extracting headers from a document:

C#

StringBuilder sb = new StringBuilder();
IStructuredExtractor extractor = new WordsTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();

// Handle List event to prevent processing of lists
handler.List += (sender, e) => e.Properties.SkipElement = true; // ignore lists
// Handle Table event to prevent processing of tables
handler.Table += (sender, e) => e.Properties.SkipElement = true; // ignore tables
// Handle Paragraph event to process a paragraph
handler.Paragraph += (sender, e) =>
  {
      int h1 = (int)ParagraphStyle.Heading1;
      int h6 = (int)ParagraphStyle.Heading6;
      
      int style = (int)e.Properties.Style;
      if (h1 <= style && style <= h6)
      {
          if (sb.Length > 0)
          {
              sb.AppendLine();
          }

          sb.Append(' ', style - h1); // make an indention for the header (h1 - no indention)
      }
      else
      {
          e.Properties.SkipElement = e.Properties.Style != ParagraphStyle.Title; // skip paragraph if it's not a header or a title
      }
  };

// Handle ElementText event to process a text
handler.ElementText += (sender, e) => sb.Append(e.Text);

// Extract a text with its structure
extractor.ExtractStructured(handler);

Console.WriteLine(sb.ToString());

Implement the ability to read a structured text from presentations

This feature allows to extract a text with its structure from slides.

Public API changes
Added ExtractStructured method to SlidesTextExtractor class.

Extracting top-level lists from a presentation:

C#

StringBuilder sb = new StringBuilder();
IStructuredExtractor extractor = new SlidesTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();

bool isList = false;

// Handle Hyperlink event to process a starting of a list
handler.List += (sender, e) =>
{
    e.Properties.SkipElement = e.Properties.Depth > 0; // process only top-level lists
    if (!e.Properties.SkipElement)
    {
        isList = true;
    }
};

// Handle ElementClose event to process a closing of a list
handler.ElementClosed += (sender, e) =>
{
    StructuredHandler h = sender as StructuredHandler;
    if (h != null && h[0] is ListProperties)
    {
        isList = false;
    }
};

// Handle ElementText event to process a text
handler.ElementText += (sender, e) =>
{
    if (!isList)
    {
        return;
    }

    if (sb.Length > 0)
    {
        sb.AppendLine();
    }

    sb.Append(e.Text);
};

// Extract a text with its structure
extractor.ExtractStructured(handler);

Console.WriteLine(sb.ToString());

Implement the ability to read a structured text from emails

This feature allows to extract a text with its structure from emails.

Public API changes
Added ExtractStructured method to EmailTextExtractor class.

Extracting hyperlinks from an email:

C#

List<string> hyperlinks = new List<string>();
StringBuilder sb = null;
string currentLink = null;
IStructuredExtractor extractor = new EmailTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();

// Handle Hyperlink event to process a starting of a hyperlink
handler.Hyperlink += (sender, e) =>
{
    sb = new StringBuilder();
    currentLink = e.Properties.Link;
};

// Handle ElementClose event to process a closing of a hyperlink
handler.ElementClosed += (sender, e) =>
{
    StructuredHandler h = sender as StructuredHandler;
    if (h != null && h[0] is HyperlinkProperties) // closing of hyperlink
    {
        if (sb != null)
        {
            hyperlinks.Add(string.Format("{0} ({1})", sb.ToString(), currentLink));
        }
        sb = null;
        currentLink = null;
    }
};

// Handle ElementText event to process a text
handler.ElementText += (sender, e) =>
{
    if (sb != null) // if hyperlink is open
    {
        sb.Append(e.Text);
    }
};

// Extract a text with its structure
extractor.ExtractStructured(handler);

foreach(string hl in hyperlinks)
{
    Console.WriteLine(hl);
}