GroupDocs.Parser provides functionality to extract highlights - text fragments with surrounding context. This feature is useful for creating search result previews, text snippets, and contextual excerpts.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents for testing
Understanding of text positioning concepts
What are highlights?
Highlights are text fragments extracted from a document at a specific position, including:
Text: The main text content
Position: Character position in the document
Context: Surrounding text before and after
Extract highlights
To extract a highlight from a specific position:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportHighlightOptions# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Create highlight options (extract 20 characters)options=HighlightOptions(20)# Extract highlight at position 100highlight=parser.get_highlight(100,True,options)ifhighlight:print(f"Position: {highlight.position}")print(f"Text: {highlight.text}")else:print("Highlight extraction not supported or position out of range")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns a HighlightItem object containing the text fragment starting at the specified position, or None if extraction is not supported.
Extract highlights with fixed length
To extract highlights with a specific character count:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportHighlightOptions# Create an instance of Parser classwithParser("sample.docx")asparser:# Define positions to extract highlights frompositions=[0,100,500,1000]# Create highlight options (50 characters)options=HighlightOptions(50)forposinpositions:# Extract highlighthighlight=parser.get_highlight(pos,True,options)ifhighlight:print(f"Position{pos}:")print(f"Text: {highlight.text}")
The following sample file is used in this example: sample.docx
Expected behavior: Extracts 50-character text fragments from specified positions.
Extract highlights for search results
To create search result previews with context:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportSearchOptions,HighlightOptionsdefget_search_results_with_context(file_path,keyword,context_length=30):"""
Search for keyword and extract surrounding context.
"""withParser(file_path)asparser:# Create highlight options for contexthighlight_opts=HighlightOptions(context_length)# Create search options with highlightssearch_opts=SearchOptions(match_case=False,match_whole_word=False,use_regular_expression=False,left_highlight_options=highlight_opts,right_highlight_options=highlight_opts)# Search for keywordresults=parser.search(keyword,search_opts)ifresultsisNone:print("Search not supported")return[]# Collect results with contextsearch_results=[]forresultinresults:left_context=result.left_highlight_item.textifresult.left_highlight_itemelse""right_context=result.right_highlight_item.textifresult.right_highlight_itemelse""search_results.append({'keyword':result.text,'position':result.position,'page':result.page_index+1ifhasattr(result,'page_index')elseNone,'left_context':left_context,'right_context':right_context,'full_snippet':f"{left_context}[{result.text}]{right_context}"})returnsearch_results# Usageresults=get_search_results_with_context("sample.pdf","artificial intelligence",40)print(f"Found {len(results)} occurrences:")foridx,resultinenumerate(results,1):print(f"{idx}. {result['full_snippet']}")ifresult['page']:print(f" (Page {result['page']}, position {result['position']})")
Expected behavior: Searches for keywords and returns results with surrounding context for better previews.
Extract highlights from multiple positions
To extract text snippets from various document sections:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportHighlightOptionsdefextract_multiple_highlights(file_path,positions,length=50):"""
Extract highlights from multiple positions.
"""highlights=[]withParser(file_path)asparser:options=HighlightOptions(length)forpositioninpositions:highlight=parser.get_highlight(position,True,options)ifhighlight:highlights.append({'position':highlight.position,'text':highlight.text,'length':len(highlight.text)})returnhighlights# Extract highlights at specific positionspositions=[0,500,1000,1500,2000]highlights=extract_multiple_highlights("sample.pdf",positions,length=80)print(f"Extracted {len(highlights)} highlights:")forhinhighlights:print(f"Position {h['position']}:")print(f" {h['text'][:60]}...")print()
The following sample file is used in this example: sample.pdf
Expected behavior: Extracts text fragments from multiple specified positions in the document.
Create document preview with highlights
To generate a document preview with highlighted sections:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportHighlightOptionsdefcreate_document_preview(file_path,num_sections=5,section_length=100):"""
Create a document preview by extracting highlights from various positions.
"""withParser(file_path)asparser:# Get document text to determine positionstext_reader=parser.get_text()ifnottext_reader:print("Text extraction not supported")returnNonefull_text=text_readertext_length=len(full_text)# Calculate evenly distributed positionsiftext_length<=section_length:positions=[0]else:step=text_length//num_sectionspositions=[i*stepforiinrange(num_sections)]# Extract highlightspreview_sections=[]options=HighlightOptions(section_length)forposinpositions:highlight=parser.get_highlight(pos,True,options)ifhighlight:preview_sections.append(highlight.text)return{'document_length':text_length,'preview_sections':preview_sections,'full_preview':'...'.join(preview_sections)}# Usagepreview=create_document_preview("sample.pdf",num_sections=3,section_length=150)ifpreview:print(f"Document length: {preview['document_length']} characters")print("Preview:")print(preview['full_preview'])
Expected behavior: Creates a representative preview of the document by extracting highlights from evenly distributed positions.
Extract highlights with word boundaries
To extract highlights respecting word boundaries:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportHighlightOptionsdefextract_highlight_with_words(file_path,position,max_length=100):"""
Extract highlight ensuring complete words.
"""withParser(file_path)asparser:# Extract more than neededoptions=HighlightOptions(max_length+50)highlight=parser.get_highlight(position,True,options)ifnothighlight:returnNonetext=highlight.text# Find last complete word within max_lengthiflen(text)>max_length:truncated=text[:max_length]last_space=truncated.rfind(' ')iflast_space>0:text=truncated[:last_space]+"..."return{'position':highlight.position,'text':text,'complete_words':True}# Usagehighlight=extract_highlight_with_words("sample.pdf",500,80)ifhighlight:print(f"Position {highlight['position']}:")print(highlight['text'])
The following sample file is used in this example: sample.pdf
Expected behavior: Extracts highlights that end at word boundaries, avoiding cut-off words.
Highlight extraction for table of contents
To create a table of contents with text previews:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportHighlightOptionsdefcreate_toc_with_previews(file_path,preview_length=60):"""
Create table of contents with text previews for each section.
"""withParser(file_path)asparser:# Get TOCtoc_items=parser.get_toc()ifnottoc_items:print("TOC extraction not supported")return[]toc_with_previews=[]options=HighlightOptions(preview_length)# Get text to find section positionstext_reader=parser.get_text()ifnottext_reader:return[]full_text=text_readerforitemintoc_items:# Try to extract highlight for this TOC item# Use text search to find the sectionsection_text=item.textposition=full_text.find(section_text)ifposition>=0:highlight=parser.get_highlight(position,True,options)preview=highlight.textifhighlightelse"No preview available"else:preview="Section not found"toc_with_previews.append({'title':item.text,'depth':item.depth,'preview':preview})returntoc_with_previews# Usagetoc=create_toc_with_previews("SampleWithToc.pdf",preview_length=100)foritemintoc:indent=" "*item['depth']print(f"{indent}{item['title']}")print(f"{indent} Preview: {item['preview'][:50]}...")
Expected behavior: Generates a table of contents with text previews for each section.
Notes
The get_highlight() method returns None if highlight extraction is not supported
The second parameter (boolean) indicates whether to extract from fixed position (True) or dynamic position (False)
Highlight length is specified in characters
Use highlights with search results to show context
Highlights are useful for creating document previews and snippets
Consider word boundaries when displaying highlights to users