Grab Links from a Web Page
There are several ways to get all the links in a web page document. But I like this approach because of its simplicity and straight forward implementations. The rest of the approaches are like text processing with regular expressions. But for this approach a little knowledge of HTML and WebBrowser control is necessary
.Net framework 2.0 onwards is shipping with WebBrowser Control. This can be used to deal with HTML DOM objects. As the control is dealing with web the nature of the document loaded is in HTML format. To use it effectively one should know some on HTML.
The document object is filled only after the document is loaded into the WebBrowser component. So to access a document we need to wait till the document is fully loaded. To determine this event there is an event in the WebBrowser component called DocumentCompleted. But unfortunately this is not exactly as the name says. So we need to use another property to find out the document is fully complete. The property is ReadyState. The value should be WebBrowserReadyState.Complete.
The next is to get all the links once the document is loaded in the component. Here we can use the Document.Links to fetch all the links. Yes, this is really that simple. But the downfall is the links collection contains all HtmlElement. It is a generic type used to get all controls. So you can not expect a direct property like href, target. So we need to use GetAttribute(“”) to get the attributes of the HtmlElement.
Public Class Form1
Private Sub Button1_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles Button1.Click
Private Sub WebBrowser1_DocumentCompleted( _
ByVal sender As Object, _
ByVal e As WebBrowserDocumentCompletedEventArgs) _
If (WebBrowser1.ReadyState = WebBrowserReadyState.Complete) Then
For Each ClientControl As HtmlElement In WebBrowser1.Document.Links