Getting information about document content
Since version 21.3 the structure PageInfo has been changed. Now you can get an information not only about the width and height of each page, but also about it text content. In addition to that, page numbering was also added to the PageInfo structure. You can now take an advantage of the new functionality by calling the GetDocumentInfo() method of the Document class.
Each page, represented by PageInfo structure, now contains list of TextLineinfo. Every TextLineinfo contains information about text top and left indents, width, height and text itself. In other words, we can say that each page is represented as a sequence of text lines. Every text line is described by its parameters (width, heigh and indents).
All numerical specifications are presented in pixels relative to the top left-hand corner of the page.
Code example below shows how you can get data from described structures:
using (Annotator annotator = new Annotator("input.docx"))
{
IDocumentInfo documentInfo = annotator.Document.GetDocumentInfo();
foreach (PageInfo page in documentInfo.PagesInfo)
{
//Here you can access PageInfo fields
Console.WriteLine("Page number {0}, width: {1} and height: {2}", page.PageNumber, page.Width, page.Height);
foreach (TextLineInfo textLine in page.TextLines)
{
//Here you can access TextLineInfo fields
Console.WriteLine("\tText line. '{0}'", textLine.Text);
Console.WriteLine("\t\tText width {0} and height {1}. Top indent: {2}, left indent: {3}",
textLine.Width, textLine.Height, textLine.TopIndent, textLine.LeftIndent);
}
}
}
Supported formats
The ability to retrieve text information is implemented for most supported formats: word, pdf, excel, visio diagrams, power point presentations, html and email. Text retrieval works for all formats as it is, except for html (.htm, .html etc) and email (.eml, .msg etc). With those formats it works by converting them into the word document (.docx). Therefore, text parameters for these formats corresponds to their word counterparts.
How it works
Example below shows how these parameters are calculated:
Example above is schematic, although it’s quite accurate.
Imagine you have a text part like the one in the picture. Each text line may be represented filled into rectangle. Width and height of this rectangle is the width and height of the text, and the indentation from the top left corner of the page to the border of this rectangle is top and left indents.
In the illustration above, the possible text rectangles are marked in black. The arrows indicate corresponding text parameters.
More resources
GitHub Examples
You may easily run the code above and see the feature in action in our GitHub examples:
- GroupDocs.Annotation for .NET examples, plugins, and showcase
- GroupDocs.Annotation for Java examples, plugins, and showcase
- Document Annotation for .NET MVC UI Example
- Document Annotation for .NET App WebForms UI Modern Example
- Document Annotation for Java App Dropwizard UI Modern Example
- Document Annotation for Java Spring UI Example
Free Online App
Along with full-featured .NET library we provide simple but powerful free Apps. You are welcome to annotate your PDF, DOC or DOCX, XLS or XLSX, PPT or PPTX, PNG and other documents with free to use online GroupDocs Annotation App.