GroupDocs.Parser provides powerful text search functionality with support for keywords, regular expressions, case-sensitive search, and highlighting.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents for testing
Basic understanding of regular expressions (for advanced searches)
Search text by keyword
To search for a specific keyword in a document:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Search for a keywordsearch_results=parser.search("invoice")# Check if search is supportedifsearch_resultsisNone:print("Search isn't supported")else:# Iterate over search resultsforresultinsearch_results:# Print position and found textprint(f"At {result.position}: {result.text}")
The following sample file is used in this example: sample.pdf
Expected behavior: The method returns a collection of SearchResult objects, each containing the position and text of every occurrence of the keyword.
Search with regular expressions
To search using regular expressions:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportSearchOptions# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Create search options for regex search# Parameters: match_case, match_whole_word, use_regular_expressionoptions=SearchOptions(True,False,True)# Search with a regular expression (case-sensitive)search_results=parser.search(r"page number: \d+",options)# Check if search is supportedifsearch_resultsisNone:print("Search isn't supported")else:# Iterate over search resultsforresultinsearch_results:print(f"At {result.position}: {result.text}")
The following sample file is used in this example: sample.pdf
Expected behavior: Finds all text matching the regular expression pattern with the specified options.
Search with case sensitivity and whole word matching
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportSearchOptions# Create an instance of Parser classwithParser("./sample.docx")asparser:# Search for exact word match (case-insensitive, whole word)# Parameters: match_case=False, match_whole_word=True, use_regular_expression=Falseoptions=SearchOptions(False,True,False)search_results=parser.search("invoice",options)ifsearch_results:print(f"Found {len(list(search_results))} occurrences of 'invoice' as whole word")
The following sample file is used in this example: sample.docx
Search text with highlights
To search and extract surrounding text (highlights):
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportSearchOptions,HighlightOptions# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Create highlight options (extract 15 characters around the match)highlight_options=HighlightOptions(15)# Create search options with highlightssearch_options=SearchOptions(match_case=False,match_whole_word=False,use_regular_expression=False,left_highlight_options=highlight_options,right_highlight_options=highlight_options)# Search with highlightssearch_results=parser.search("lorem",search_options)ifsearch_resultsisNone:print("Search isn't supported")else:# Iterate over search results and print with highlightsforresultinsearch_results:left_text=result.left_highlight_item.textifresult.left_highlight_itemelse""right_text=result.right_highlight_item.textifresult.right_highlight_itemelse""print(f"{left_text}[{result.text}]{right_text}")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns search results with context from surrounding text on both sides of the match.
Search text with page numbers
To search and get page numbers where text appears:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportSearchOptions# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Create search options with page search enabled# Parameters: match_case, match_whole_word, use_regular_expression, search_by_pagesoptions=SearchOptions(False,False,False,True)# Search with page numberssearch_results=parser.search("lorem",options)ifsearch_resultsisNone:print("Search isn't supported")else:# Iterate over search resultsforresultinsearch_results:# Print position, page number, and found textprint(f"At {result.position} (page {result.page_index+1}): {result.text}")
The following sample file is used in this example: sample.pdf
Expected behavior: Each search result includes the page index where the text was found.
Advanced search example
Combine multiple search techniques:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportSearchOptions,HighlightOptionsdefadvanced_search(file_path,pattern,case_sensitive=False,use_regex=False):"""
Perform advanced text search with highlights and page numbers.
"""try:withParser(file_path)asparser:# Configure highlight optionshighlight_opts=HighlightOptions(20)# Configure search optionssearch_opts=SearchOptions(match_case=case_sensitive,match_whole_word=False,use_regular_expression=use_regex,search_by_pages=True,left_highlight_options=highlight_opts,right_highlight_options=highlight_opts)# Perform searchresults=parser.search(pattern,search_opts)ifresultsisNone:print("Search not supported for this document")return[]# Process resultsfound_items=[]forresultinresults:found_items.append({'text':result.text,'position':result.position,'page':result.page_index+1,'left_context':result.left_highlight_item.textifresult.left_highlight_itemelse"",'right_context':result.right_highlight_item.textifresult.right_highlight_itemelse""})returnfound_itemsexceptExceptionase:print(f"Error during search: {e}")return[]# Usageresults=advanced_search("sample.pdf",r"\d{4}-\d{2}-\d{2}",use_regex=True)foriteminresults:print(f"Page {item['page']}: {item['left_context']}[{item['text']}]{item['right_context']}")
Notes
The search() method returns None if search is not supported for the document format
Use parser.features.search to check if search is available before calling search()