GroupDocs.Parser for .NET 18.10 Release Notes

Major Features

There are the following features in this release:

  • Implemented API to extract images from documents

Full List of Issues Covering all Changes in this Release

KeySummaryIssue Type
PARSERNET-65Implement API to extract images from documentsNew feature
PARSERNET-69Implement the ability to extract images from PDFNew feature
PARSERNET-71Implement the ability to extract images from spreadsheetsNew feature
PARSERNET-72Implement the ability to extract images from text documentsNew feature
PARSERNET-74Implement the ability to extract images from presentationsNew feature

Public API and Backward Incompatible Changes

API to extract images from documents

Description

This feature allows extracting images from documents.

Public API changes

  • Added Emf constant to MediaTypeNames.Application class
  • Added Images nested class to MediaTypeNames class
  • Added Windows nested class to MediaTypeNames class
  • Added ImageArea class
  • Added ImageAreaSearchOptions class
  • Added GetImageAreas methods to DocumentContent class
  • Added ImageAreasproperty to DocumentPage class

To extract images from the page GetImageAreas methods are used:

C#

class DocumentContent {
  public IList<ImageArea> GetImageAreas(int pageIndex);
  public IList<ImageArea> GetImageAreas(int pageIndex, ImageAreaSearchOptions searchOptions);
}

The method with one parameter returns all images from the page with zero-based pageIndex. The method with ImageAreaSearchOptions optional parameter returns only the images which meet the conditions of searchOptions. Both versions of the method return a collection of ImageArea objects:

MemberDescription
PageLink to the page which contains this image
RectangleRectangle of the image area
RotationAngle of the image rotation (0 if image isn’t rotated)
MediaTypeMIME type of the image
GetBitmapStreamReturns a stream with bitmap representation of the image
GetRawStreamReturns a stream with the image

ImageAreaSearchOptions class has only one property - Rectangle. If it’s set, the method returns only the images which are intersected with the given Rectangle.

Usage

C#

private static void ExtractImages()
{
    // Create a text extractor
    WordsTextExtractor extractor = new WordsTextExtractor("cv.docx");
 
    // Create search options
    ImageAreaSearchOptions searchOptions = new ImageAreaSearchOptions();
    // Limit the search with the rectangle: position (0; 0), size (300; 300)
    searchOptions.Rectangle = new Rectangle(0, 0, 300, 300);
 
    // Get images from the first page
    IList<ImageArea> imageAreas = extractor.DocumentContent.GetImageAreas(0, searchOptions);
 
    // Iterate over the images
    for (int i = 0; i < imageAreas.Count; i++)
    {
        using (Stream fs = File.Create(String.Format("{0}.jpg", i)))
        {
            // Save the image to the file
            CopyStream(imageAreas[i].GetRawStream(), fs);
        }
    }
}
 
private static void CopyStream(Stream source, Stream dest)
{
    byte[] buffer = new byte[4096];
    source.Position = 0;
 
    int r = 0;
    do
    {
        r = source.Read(buffer, 0, buffer.Length);
        if (r > 0)
        {
            dest.Write(buffer, 0, r);
        }
    }
    while (r > 0);
}

Extracting images from PDF documents

Description

This feature allows extracting images from PDF documents.

Public API changes

No public API changes

Usage

To extract images from the page GetImageAreas methods are used:

C#

private static void ExtractImages()
{
    // Create a text extractor
    PdfTextExtractor extractor = new PdfTextExtractor("cv.pdf");
 
    // Create search options
    ImageAreaSearchOptions searchOptions = new ImageAreaSearchOptions();
    // Limit the search with the rectangle: position (0; 0), size (300; 300)
    searchOptions.Rectangle = new Rectangle(0, 0, 300, 300);
 
    // Get images from the first page
    IList<ImageArea> imageAreas = extractor.DocumentContent.GetImageAreas(0, searchOptions);
 
    // Iterate over the images
    for (int i = 0; i < imageAreas.Count; i++)
    {
        using (Stream fs = File.Create(String.Format("{0}.jpg", i)))
        {
            // Save the image to the file
            CopyStream(imageAreas[i].GetRawStream(), fs);
        }
    }
}
 
private static void CopyStream(Stream source, Stream dest)
{
    byte[] buffer = new byte[4096];
    source.Position = 0;
 
    int r = 0;
    do
    {
        r = source.Read(buffer, 0, buffer.Length);
        if (r > 0)
        {
            dest.Write(buffer, 0, r);
        }
    }
    while (r > 0);
}

Extracting images from spreadsheets

Description

This feature allows extracting images from spreadsheets.

Public API changes

No public API changes

Usage

To extract images from the sheet GetImageAreas methods are used:

C#

private static void ExtractImages()
{
    // Create a text extractor
    CellsTextExtractor extractor = new CellsTextExtractor("catalog.xlsx");
 
    // Create search options
    ImageAreaSearchOptions searchOptions = new ImageAreaSearchOptions();
    // Limit the search with the rectangle: position (0; 0), size (300; 300)
    searchOptions.Rectangle = new Rectangle(0, 0, 300, 300);
 
    // Get images from the first sheet
    IList<ImageArea> imageAreas = extractor.DocumentContent.GetImageAreas(0, searchOptions);
 
    // Iterate over the images
    for (int i = 0; i < imageAreas.Count; i++)
    {
        using (Stream fs = File.Create(String.Format("{0}.jpg", i)))
        {
            // Save the image to the file
            CopyStream(imageAreas[i].GetRawStream(), fs);
        }
    }
}
 
private static void CopyStream(Stream source, Stream dest)
{
    byte[] buffer = new byte[4096];
    source.Position = 0;
 
    int r = 0;
    do
    {
        r = source.Read(buffer, 0, buffer.Length);
        if (r > 0)
        {
            dest.Write(buffer, 0, r);
        }
    }
    while (r > 0);
}

Extracting images from text documents

Description

This feature allows extracting images from text documents.

Public API changes

No public API changes

Usage

To extract images from the page GetImageAreas methods are used:

C#

private static void ExtractImages()
{
    // Create a text extractor
    WordsTextExtractor extractor = new WordsTextExtractor("cv.docx");
 
    // Create search options
    ImageAreaSearchOptions searchOptions = new ImageAreaSearchOptions();
    // Limit the search with the rectangle: position (0; 0), size (300; 300)
    searchOptions.Rectangle = new Rectangle(0, 0, 300, 300);
 
    // Get images from the first page
    IList<ImageArea> imageAreas = extractor.DocumentContent.GetImageAreas(0, searchOptions);
 
    // Iterate over the images
    for (int i = 0; i < imageAreas.Count; i++)
    {
        using (Stream fs = File.Create(String.Format("{0}.jpg", i)))
        {
            // Save the image to the file
            CopyStream(imageAreas[i].GetRawStream(), fs);
        }
    }
}
 
private static void CopyStream(Stream source, Stream dest)
{
    byte[] buffer = new byte[4096];
    source.Position = 0;
 
    int r = 0;
    do
    {
        r = source.Read(buffer, 0, buffer.Length);
        if (r > 0)
        {
            dest.Write(buffer, 0, r);
        }
    }
    while (r > 0);
}

Extracting images from presentations

Description

This feature allows extracting images from presentations.

Public API changes

No public API changes

Usage

To extract images from the slide GetImageAreas methods are used:

C#

private static void ExtractImages()
{
    // Create a text extractor
    SlidesTextExtractor extractor = new SlidesTextExtractor("presentation.pptx");
 
    // Create search options
    ImageAreaSearchOptions searchOptions = new ImageAreaSearchOptions();
    // Limit the search with the rectangle: position (0; 0), size (300; 300)
    searchOptions.Rectangle = new Rectangle(0, 0, 300, 300);
 
    // Get images from the first slide
    IList<ImageArea> imageAreas = extractor.DocumentContent.GetImageAreas(0, searchOptions);
 
    // Iterate over the images
    for (int i = 0; i < imageAreas.Count; i++)
    {
        using (Stream fs = File.Create(String.Format("{0}.jpg", i)))
        {
            // Save the image to the file
            CopyStream(imageAreas[i].GetRawStream(), fs);
        }
    }
}
 
private static void CopyStream(Stream source, Stream dest)
{
    byte[] buffer = new byte[4096];
    source.Position = 0;
 
    int r = 0;
    do
    {
        r = source.Read(buffer, 0, buffer.Length);
        if (r > 0)
        {
            dest.Write(buffer, 0, r);
        }
    }
    while (r > 0);
}