Extract hyperlinks from document page area
Leave feedback
On this page
GroupDocs.Parser allows extracting hyperlinks from specific rectangular areas of document pages, enabling precise link extraction from defined regions.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents with hyperlinks
Understanding of coordinate systems
Extract hyperlinks from page area
To extract hyperlinks from a specific rectangular area:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Size# Create an instance of Parser classwithParser("./webpage.html")asparser:# Check if hyperlink extraction is supportedifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")sys.exit()# Define the area (upper portion, 600x200 pixels)area=Rectangle(Point(0,0),Size(600,200))# Create options for the areaoptions=PageAreaOptions(area)# Extract hyperlinks from the specified areahyperlinks=parser.get_hyperlinks(options)ifhyperlinks:print(f"Hyperlinks found in area:")forlinkinhyperlinks:print(f" {link.text} -> {link.url}")else:print("No hyperlinks found in the specified area")
The following sample file is used in this example: webpage.htm
Expected behavior: Returns only hyperlinks located within the specified rectangular area.
Extract hyperlinks from specific page area
To extract hyperlinks from an area on a specific page:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Size# Create an instance of Parser classwithParser("document.pdf")asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")sys.exit()# Define area (sidebar region)area=Rectangle(Point(400,0),Size(200,800))options=PageAreaOptions(area)# Get document infoinfo=parser.get_document_info()# Process first pagepage_index=0ifpage_index<info.page_count:# Extract hyperlinks from area on specific pagehyperlinks=parser.get_hyperlinks(page_index,options)ifhyperlinks:print(f"Hyperlinks in sidebar (page {page_index+1}):")forlinkinhyperlinks:print(f" {link.text}: {link.url}")
The following sample file is used in this example: document.pdf
Expected behavior: Extracts hyperlinks from a specific area on a specific page.
Extract navigation links from header
To extract navigation links from the header area:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizedefextract_header_links(file_path):"""
Extract navigation links from document header.
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return[]# Define header area (top 100 pixels, full width)header_area=Rectangle(Point(0,0),Size(1000,100))options=PageAreaOptions(header_area)# Extract hyperlinks from headerhyperlinks=parser.get_hyperlinks(options)ifhyperlinks:links=[]forlinkinhyperlinks:links.append({'text':link.textor'No text','url':link.urlor''})returnlinksreturn[]# Usagenav_links=extract_header_links("website.html")print(f"Navigation links ({len(nav_links)}):")forlinkinnav_links:print(f" {link['text']} → {link['url']}")
The following sample file is used in this example: website.htm
Expected behavior: Extracts navigation or menu links typically found in document headers.
Extract links from multiple regions
To extract hyperlinks from several predefined areas:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizedefextract_links_from_regions(file_path,regions):"""
Extract hyperlinks from multiple regions.
Args:
file_path: Path to document
regions: Dict of region names to (x, y, width, height) tuples
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return{}results={}forregion_name,(x,y,w,h)inregions.items():# Create rectangle for this regionarea=Rectangle(Point(x,y),Size(w,h))options=PageAreaOptions(area)# Extract hyperlinks from this regionhyperlinks=parser.get_hyperlinks(options)ifhyperlinks:results[region_name]=[link.urlforlinkinhyperlinks]else:results[region_name]=[]returnresults# Define regionsregions={'header':(0,0,600,100),'sidebar':(0,100,150,700),'footer':(0,800,600,100)}# Extract links from regionslinks_by_region=extract_links_from_regions("page.html",regions)forregion,urlsinlinks_by_region.items():print(f"{region.upper()}: {len(urls)} links")forurlinurls[:5]:# Show first 5print(f" - {url}")
The following sample file is used in this example: page.htm
Expected behavior: Extracts and categorizes hyperlinks by document region.
Extract footer links
To extract links from the footer area:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizedefextract_footer_links(file_path,page_height=842):"""
Extract links from document footer.
Args:
file_path: Path to document
page_height: Approximate page height in points (default: A4 = 842)
"""withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return[]# Define footer area (bottom 100 pixels)footer_y=page_height-100footer_area=Rectangle(Point(0,footer_y),Size(600,100))options=PageAreaOptions(footer_area)# Extract hyperlinks from footerhyperlinks=parser.get_hyperlinks(options)ifhyperlinks:links=[]forlinkinhyperlinks:links.append({'text':link.textor'','url':link.urlor'','position':(link.rectangle.left,link.rectangle.top)})returnlinksreturn[]# Usagefooter_links=extract_footer_links("document.pdf")print(f"Footer links: {len(footer_links)}")forlinkinfooter_links:print(f"{link['text']}: {link['url']}")
The following sample file is used in this example: document.pdf
Expected behavior: Extracts links commonly found in document footers (e.g., copyright, contact, social media).
Filter links by area and URL pattern
To extract links from an area matching a URL pattern:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizeimportredefextract_links_by_area_and_pattern(file_path,area_rect,url_pattern):"""
Extract links from area matching URL pattern.
Args:
file_path: Path to document
area_rect: Tuple (x, y, width, height)
url_pattern: Regex pattern for URLs
"""x,y,w,h=area_rectarea=Rectangle(Point(x,y),Size(w,h))options=PageAreaOptions(area)withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return[]hyperlinks=parser.get_hyperlinks(options)ifnothyperlinks:return[]# Filter by URL patternmatching_links=[]pattern=re.compile(url_pattern,re.IGNORECASE)forlinkinhyperlinks:iflink.urlandpattern.search(link.url):matching_links.append({'text':link.text,'url':link.url})returnmatching_links# Usage - extract PDF links from sidebarsidebar_area=(400,100,200,700)pdf_links=extract_links_by_area_and_pattern("document.html",sidebar_area,r'\.pdf$')print(f"PDF download links in sidebar: {len(pdf_links)}")forlinkinpdf_links:print(f" {link['text']}: {link['url']}")
The following sample file is used in this example: document.htm
The following sample file is used in this example: .pdf
Expected behavior: Extracts only links from the specified area that match the URL pattern.
Extract social media links from specific area
To extract social media links from a defined region:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizedefextract_social_media_links(file_path,area_rect):"""
Extract social media links from a specific area.
"""x,y,w,h=area_rectarea=Rectangle(Point(x,y),Size(w,h))options=PageAreaOptions(area)# Social media domainssocial_domains=['facebook.com','twitter.com','linkedin.com','instagram.com','youtube.com','tiktok.com']withParser(file_path)asparser:ifnotparser.features.hyperlinks:print("Hyperlink extraction not supported")return{}hyperlinks=parser.get_hyperlinks(options)ifnothyperlinks:return{}# Categorize by platformsocial_links={}forlinkinhyperlinks:iflink.url:fordomaininsocial_domains:ifdomaininlink.url:platform=domain.split('.')[0].capitalize()ifplatformnotinsocial_links:social_links[platform]=[]social_links[platform].append(link.url)breakreturnsocial_links# Usage - extract from footer areafooter_area=(0,800,600,100)social_links=extract_social_media_links("webpage.html",footer_area)print("Social Media Links:")forplatform,urlsinsocial_links.items():print(f" {platform}: {', '.join(urls)}")
The following sample file is used in this example: webpage.htm
Expected behavior: Extracts and categorizes social media links from a specific area.
Notes
Coordinates are in document-specific units (points or pixels)
The origin (0, 0) is at the top-left corner
Hyperlinks partially overlapping the area boundary are included
Use get_hyperlinks(page_index, options) for page-specific area extraction
Empty collections are returned if no hyperlinks are found (not None)
Combine with page index for precise multi-page area extraction