现在的位置: 首页 > 综合 > 正文

Using Internet Explorer from .NET

2012年08月01日 ⁄ 综合 ⁄ 共 18239字 ⁄ 字号小中大 ⁄ 评论关闭

Earlier in this book we have looked at how to read HTML from websites, and how to navigate through websites using GET and POST requests. These techniques certainly offer high performance, but with many websites using cryptic POST data, complex cookie data, and JavaScript rendered text, it might be useful to know that you can always call on the assistance of Internet Explorer’s browsing engine to help you get the data you need.

It must be stated though, that using Internet Explorer to data mine web pages creates a much larger memory footprint, and is not as fast as scanning using HTTP requests alone. But it does come into its own when a data mining process requires a degree of human interaction. A good example of this would be if you wanted to create an automated test of your website, and needed to allow a non-technical user the ability to follow a sequence of steps, and select data to extract and compare, based on the familiar Internet Explorer interface.

This chapter is divided into two main sections. The first deals with how to use the Internet Explorer object to interact with all the various types of web page controls. The second section deals with how Internet explorer can detect and respond to a user interacting with web page elements.

5.1 Web page navigation

The procedure for including the Internet Explorer object in your application differs depending on which version of Visual Studio .NET you are using. After starting a new windows forms project, users of Visual Studio .NET 2002 should right click on their toolbox and select “Customize toolbox”, click “COM components” then select “Microsoft Web Browser”. Users of Visual Studio .NET 2003 should right click on their toolbox and select “Add/Remove Items”, and then follow the same procedure as mentioned above. In Visual Studio .NET 2005, you do not need to add the web browser to the toolbox, just drag the “WebBrowser” control to the form.

An important distinction between the Internet Explorer object used in Visual Studio .NET 2002/03 and the 2005 version is that, the latter uses a native .NET class to interact with Internet Explorer, whereas the former uses a .NET wrapper around a COM (Common Object Model) object. This creates some syntactic differences between how Internet Explorer is used within .NET 2.0 and .NET 1.x. The first example in this chapter will cover both versions of .NET for completeness. Further examples will show .NET 2.0 code only, unless the equivalent .NET 1.x code would differ substantially.

The first thing you will need to know when using Internet Explorer is how to navigate to a web page. Since Internet Explorer works asynchronously, you will also need to know when Internet Explorer is finished loading a web page. In the following example, we will simply navigate to www.google.com and popup a message box once the page is loaded.

To begin this example, drop an Internet Explorer object onto a form, as described above, and call it “WebBrowser”. Now add a button to the form and name it “btnNavigate”. Click on the button and add the following code

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync("http://www.google.com");

MessageBox.Show("page loaded");

}

VB.NET

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("http://www.google.com")

MessageBox.Show("page loaded")

End Sub

We then create the NavigateToUrlSync method. Note how the C# version differs in version 1.x and 2.0. This is because the COM object is expecting four optional ref object parameters. These parameters can optionally define the flags, target frame name, post data and headers sent with the request. They are not used in this case, yet since C# does not support optional parameters they have to be passed in nonetheless.

C# 1.x

public void NavigateToUrlSync(string url)

{

object oMissing = null;

bBusy=true;

WebBrowser.Navigate(url,ref oMissing,ref oMissing,ref oMissing,ref oMissing);

while(bBusy)

{

Application.DoEvents();

}

C# 2.0

public void NavigateToUrlSync(string url)

{

bBusy=true;

WebBrowser.Navigate(url);

while(bBusy)

{

Application.DoEvents();

}

VB.NET

Public Sub NavigateToUrlSync(ByVal url As String)

bBusy = True

WebBrowser.Navigate(url)

While (bBusy)

Application.DoEvents()

End While

End Sub

The while loop is polls until the public bBusy flag is cleared. The DoEvents command ensures that the application remains responsive whilst waiting for a response from the web server.

To clear the bBusy flag, we handle either the DocumentComplete (.NET 1.x) or DocumentCompleted (.NET 2.0) thus:

C# 1.x

private void WebBrowser_DocumentComplete(object sender, AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEvent e)

{

bBusy = false;

}

C# 2.0

private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)

{

bBusy = false;

}

VB.NET 1.x

Private Sub WebBrowser_DocumentComplete(ByVal sender As Object, _

ByVal e As AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEvent) _

Handles WebBrowser.DocumentComplete

bBusy = False

End Sub

VB.NET 2.0

Private Sub WebBrowser_DocumentCompleted(ByVal sender As Object, _

ByVal e As WebBrowserDocumentCompletedEventArgs) _

Handles WebBrowser.DocumentCompleted

bBusy = False

End Sub

To finish off the example, don’t forget to declare the public bBusy flag.

public bool bBusy = false;

VB.NET

public bBusy As Boolean = false

To test the application, compile and run it in Visual Studio, then press the navigate button. You should see something similar to Figure 5.0

Figure 5.0 – Navigating synchronously to a web page

5.2 Manipulating web pages

An advantage of using Internet Explorer over raw HTTP requests is that you get access to the DOM (Document Object Model) of web pages, once they are loaded into Internet Explorer. For developers familiar with JavaScript, this should be an added bonus, since you will be able to control the web page in much the same way as if you were using JavaScript within a HTML page.

The main difference however, between using the DOM in .NET versus JavaScript, is that .NET is a strongly typed language, and therefore you must know the type of the element you are interacting with before you can access its full potential.

If you are using .NET 1.x you will need to reference the HTML type library, by clicking Projects > Add Reference. Then select Microsoft.mshtml from the list. For each of the examples in this section you must import the namespace into your code thus:

using mshtml;

VB.NET

Imports mshtml

If you then cast the WebBrowser.Document object to an HTMLDocument class, many of the code examples shown below should word equally well for .NET 1.x as .NET 2.0

5.2.1 Frames

Frames may be going out of fashion in modern websites, but oftentimes, you may need to extract data from a website that uses frames, and you need to be aware how to handle them within Internet Explorer. In this section, you will notice that the code differs substantially between version 1.x and 2.0 of .NET, therefore source code for both are included.

To create a simple frameset, create three files, Frameset.html, left.html and right.html, these files containing the following HTML code respectively.

Frameset.html

<html>

</frameset>

</html>

Left.html

<html>

This is the left frame

</html>

Right.html

<html>

This is the right frame

</html>

In the following example, we will use Internet Explorer to read the HTML contents of the left frame. This example uses code from the program listing in section 5.1, and assumes you have saved the HTML files in C:/

VB.NET 1.x

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("C:/frameset.html")

Dim hDoc As HTMLDocument

hDoc = WebBrowser.Document

hDoc = CType(hDoc.frames.item(0), HTMLWindow2).document

MessageBox.Show(hDoc.body.innerHTML)

End Sub

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("C:/frameset.html")

Dim hDoc As HtmlDocument

hDoc = WebBrowser.Document.Window.Frames(0).Document

MessageBox.Show(hDoc.Body.InnerHtml)

End Sub

C# 1.x

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/frameset.html");

HTMLDocument hDoc;

object oFrameIndex = 0;

hDoc = (HTMLDocument)WebBrowser.Document;

hDoc = (HTMLDocument)((HTMLWindow2)hDoc.frames.item(

ref oFrameIndex)).document;

MessageBox.Show(hDoc.body.innerHTML);

}

C# 2.0

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/frameset.html");

HtmlDocument hDoc;

hDoc = WebBrowser.Document.Window.Frames[0].Document;

MessageBox.Show(hDoc.Body.InnerHtml);

}

The main difference between the .NET 2.0 and .NET 1.x versions of the above code is that the indexer on the frames collection returns an object, which must be cast to an HTMLWindow2 under the COM wrapper in .NET 1.x. In .NET 2.0 the indexer performs the cast internally, and returns an HtmlWindow object.

To test the application, compile and run it from Visual Studio .NET, press the navigate button, and a message box should pop up saying “This is the left frame”, as shown in Figure 5.1

Figure 5.1 – Reading framesets with Internet Explorer

5.2.2 Input boxes

Input boxes are used in HTML to allow the user enter text into a web page. Here we will automatically populate an input box with some data.

Given a some HTML, which we save as InputBoxes.html as follows

<html>

My Name is :

</form>

</html>

We can get a reference to the input box on the form by calling getElementById on the HtmlDocument. In .NET 1.x this should be then cast to an IHTMLInputElement.

C# 2.0

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/InputBoxes.html");

HtmlElement hElement;

hElement = WebBrowser.Document.GetElementById("myName");

hElement.SetAttribute("value", "Joe Bloggs");

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("C:/InputBoxes.html")

Dim hElement As HtmlElement

hElement = WebBrowser.Document.GetElementById("myName")

hElement.SetAttribute("value", "Joe Bloggs")

End Sub

In order to enter the text into the input box, we call the SetAttribute method of the HtmlElement, passing in the property to change, and the new text. In .NET 1.x we would set the value property of the IHTMLInputElement to the new text.

To test the application, compile and run it from Visual Studio .NET, then press the navigate button. You should see the name “Joe Bloggs” appearing in the input box as in Figure 5.2

Figure 5.2 – Input Boxes in Internet Explorer

5.2.3 Drop down lists

In HTML, drop down lists are used in web pages to allow users input from a list of pre-defined values. In the following example, we will demonstrate how to set a value of a drop down list, and then read it back.

We shall start off with a HTML file, which we save as DropDownList.html

<html>

My favourite colour is:

</select>

</form>

</html>

We can get a reference to the drop down list by calling getElementById on the HtmlDocument. In .NET 1.x this should be then cast to an IHTMLSelectElement.

C# 2.0

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/dropdownlists.html");

HtmlElement hElement;

hElement = WebBrowser.Document.GetElementById("myColour");

hElement.SetAttribute("selectedIndex", "1");

MessageBox.Show("My favourite colour is:" + hElement.GetAttribute("value"));

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlAsync("C:/dropdownlists.html")

Dim hElement As HtmlElement

hElement = WebBrowser.Document.GetElementById("myColour")

hElement.SetAttribute("selectedIndex", "1")

MessageBox.Show("My favourite colour is:" + hElement.GetAttribute("value"))

End Sub

Here, we can see that in order to set our selection we pass “selectedIndex” and the selection number to SetAttribute. We then pass “value” to GetAttribute in order to read back the selection. In .NET 1.x, we achieve the same results by setting the selectedIndex property on the IHTMLSelectElement and reading back the selection from the value property.

To test the application, compile and run it from Visual Studio .NET, press the navigate button, and you should see a message box appear saying “My favorite color is: Red”, similar to as shown in figure 5.3

Figure 5.3 – Using drop down lists in Internet Explorer

5.2.4 Check boxes and radio buttons

Check boxes and radio buttons are generally used on web pages to allow the user to select between small numbers of options. In the following example, we shall demonstrate how to toggle check boxes and radio buttons.

We shall start off with a HTML file, which we will save as CheckBoxes.html

<html>

<input type="checkbox" name="myCheckBox">Check this.<br>

<input type="radio" name="myRadio" value="Yes">Yes

<input type="radio" name="myRadio" checked="true" value="No">No

</form>

</html>

As before we can get a reference to the checkbox by calling getElementById. However, since the two radio buttons have the same name, we need to use

Document.All.GetElementsByName and then select the required radio button from the HtmlElementCollection returned.

In .NET 1.x, we would use a call to getElementsByName on the HTMLDocument. This returns an IHTMLElementCollection. We can then get the reference to the IHTMLInputElement with the method item(null,1).

C# 2.0

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/checkboxes.html");

HtmlElement hElement;

HtmlElementCollection hElements;

hElement = WebBrowser.Document.GetElementById("mycheckBox");

hElement.SetAttribute("checked", "true");

hElements = WebBrowser.Document.All.GetElementsByName("myRadio");

hElement = hElements[0];

hElement.SetAttribute("checked", "true");

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("C:/checkboxes.html")

Dim hElement As HtmlElement

hElement = WebBrowser.Document.GetElementById("mycheckBox")

hElement.SetAttribute("checked", "true")

hElement = WebBrowser.Document.All.GetElementsByName("myRadio").Item(0)

hElement.SetAttribute("checked", "true")

End Sub

As before, we set the property of the HtmlElement using the SetAttribute method. In .NET 1.x, you need to set the @checked property on the IHTMLInputElement

To test the application, compile and run it from Visual Studio, then press the navigate button. You should see the check box and radio button toggle simultaneously, as shown in figure 5.4

Figure 5.4 – Using Radio buttons and Check boxes in Internet Explorer

5.2.5 Buttons

Submit buttons and standard buttons are generally used to submit forms in HTML. They form a crucial part in navigating any website.

Given a simple piece of HTML, which we save as Buttons.html as follows:

<html>

</form>

</html>

We can get a reference to the button on the form by calling getElementById on the HtmlDocument. In .NET 1.x this should be then cast to an IHTMLElement.

C# 2.0

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/buttons.html");

HtmlElement hElement;

hElement = WebBrowser.Document.GetElementById("btnSubmit");

hElement.InvokeMember("click");

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("C:/buttons.html")

Dim hElement As HtmlElement

hElement = WebBrowser.Document.GetElementById("btnSubmit")

hElement.InvokeMember("click")

End Sub

In the above example, we can see that after we get a reference to the button, we call the click method using InvokeMember. Similarly, if we wanted to submit the form without clicking the button, we could get a reference to myForm and pass “submit” to the InvokeMember method.

In .NET 1.x, there is no InvokeMember method of IHTMLElement, so therefore you must call the click method of the IHTMLElement. In the case of a form, you should cast the IHTMLElement to an IHTMLFormElement and call it’s submit method.

To test this application, compile and run it from Visual Studio .NET, and press the navigate button. The form should load and then automatically forward itself to a google.com search result page as in Figure 5.5.

Figure 5.5 – Using Buttons and Forms in Internet Explorer.

5.2.6 JavaScript

Many web pages use JavaScript to perform complex interactions between the user and the page. It is important to know how to execute JavaScript functions from within Internet explorer. The simplest method is to use Navigate with the prefix javascript: then the function name. However, this does not give us a return value, nor will it work correctly in all situations.

We shall start with a HTML page, which contains a JavaScript function to display some text. This will be saved as JavaScript.html

<html>

<span id="hiddenText" style="display:none">This was displayed by javascript</span>

function jsFunction()

{

window.document.all["hiddenText"].style.display="block";

return "ok";

}

</script>

</html>

We can then use the Document.InvokeScript method to execute the JavaScript thus:

C# 2.0

private void btnNavigate_Click(object sender, System.EventArgs e)

{

NavigateToUrlSync(@"C:/javascript.html");

string strRetVal = "";

strRetVal = (string)WebBrowser.Document.InvokeScript("jsFunction");

MessageBox.Show(strRetVal);

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles btnNavigate.Click

NavigateToUrlSync("C:/javascript.html")

Dim strRetVal As String

strRetVal = WebBrowser.Document.InvokeScript("jsFunction").ToString()

MessageBox.Show(strRetVal)

End Sub

In .NET 1.x, we would call the parentWindow.execScript method on the HTMLDocument. Not forgetting to add empty parenthesis after the JavaScript function name. Unfortunately execScript returns null instead of the JavaScript return value.

To test the application, compile and run it from Visual Studio .NET, then press the Navigate button. You should see a message “This was displayed by JavaScript” as shown in figure 5.6

Figure 5.6 – Using JavaScript in Internet Explorer

5.3 Extracting data from web pages

In order to extract HTML from a web page using Internet Explorer, you need to call Body.Parent.OuterHtml in .NET 2.0 or body.parentElement.outerHTML in .NET 1.x. You should be aware that the HTML returned by this method is different to the actual HTML content of the page.

Internet Explorer will “correct” HTML in the page by adding <BODY>, <TBODY> and <HEAD> tags where missing. It will also capitalize existing HTML Tags, and make other formatting changes that you should be aware of.

Techniques for parsing this textual data are explained later in the book under the section concerning Regular Expressions.

5.4 Advanced user interaction

When designing an application which uses Internet Explorer as a tool for data mining, it comes of added benefit, that the user can interact with the control in a natural fashion, in order to manipulate its behavior. The following sections describe ways in which a user can interact with Internet Explorer, and how these events can be handled within .NET

5.4.1 Design mode

If you wanted to provide the user with the ability to manipulate web pages on-the-fly, there is no simpler way to do it, than using the in-built “design mode” in internet explorer. This particular feature is not supported with the managed .NET 2.0 WebBrowser control. However, it is possible to access the unmanaged interfaces, which we were using in .NET 1.x through the Document.DomDocument property. This can be then cast to the HTMLDocument in the mshtml library (Not to be confused with the managed HtmlDocument class). Therefore, in the case, you will need to add a reference to the mshtml library and add a “using mshtml” statement to the top of your code.