Get the document content information

GroupDocs v21.3 and later allows you to get information not only about the width and height of any page but also about page number and text content. To do this, use the PageInfo structure by calling the GetDocumentInfo() method of the Document class.

The PageInfo structure represents each page and contains the list of TextLineinfo. Every TextLineinfo stores text top and left indents, width, height and the text itself. This way, each page is represented as a sequence of text lines. Every text line is described by its parameters (width, height, and indents).

The following code snippet shows how to get data from described structures:

using (Annotator annotator = new Annotator("input.docx"))
{
    IDocumentInfo documentInfo = annotator.Document.GetDocumentInfo();

    foreach (PageInfo page in documentInfo.PagesInfo)
    {
        //Here you can access PageInfo fields
        Console.WriteLine("Page number {0}, width: {1} and height: {2}", page.PageNumber, page.Width, page.Height);

        foreach (TextLineInfo textLine in page.TextLines)
        {
            //Here you can access TextLineInfo fields
            Console.WriteLine("\tText line. '{0}'", textLine.Text);
            Console.WriteLine("\t\tText width {0} and height {1}. Top indent: {2}, left indent: {3}", 
                textLine.Width, textLine.Height, textLine.TopIndent, textLine.LeftIndent);
        }
    }
}

Supported formats

The GroupDocs.Annotation allows you to retrieve text information from files of the following formats: Word, PDF, Excel files, Visio diagrams, PowerPoint presentations, HTML, and email. You can retrieve text directly from files of all specified formats except of HTML (.htm, .html, etc.) and email (.eml, .msg, etc.). To retrieve text from HTML and email files, GroupDocs.Annotation converts them to .docx.

Description of text parameters

Example below shows how these parameters are calculated:

Example of how text parameters are calculated

The possible text rectangles are marked in black. The arrows indicate corresponding text parameters.