现在的位置: 首页 > 综合 > 正文

The Swing HTML Parser

2017年01月11日 ⁄ 综合 ⁄ 共 12867字 ⁄ 字号 评论关闭
文章目录
 

The Swing HTML Parser
Parsing a Netscape Navigator Bookmarks File

By Scott Violet

The high-level Swing component, JEditorPane, is responsible for displaying, among other things, HTML text. However, this article shows how you can use the HTML parser outside of JEditorPane. An example provided shows how to use the standard HTML parser (also shipped with HotJava) to parse the bookmarks file created by Netscape Navigator. Previous Swing Connection articles have featured the custom component, JTreeTable. This article also demonstrates an enhanced editable JTreeTable.

The parser provided by Swing is DTD driven, and therefore is capable of parsing much more than HTML. This article uses the parser with its standard DTD for parsing HTML documents. Future articles will address how to configure the parser with a custom DTD, and will discuss the binary DTD format used by the parser.

HTMLEditorKit.ParserCallback

The main entry point into the HTML parser is the class ParserDelegator. ParserDelegator parses an HTML document passed in as a Reader and notifies the passed-in ParserCallback object as to the state of the parsing. ParserCallback implements the following methods:

    public void flush() throws BadLocationException
    public void handleText(char[] data, int pos)
    public void handleComment(char[] data, int pos)
    public void handleStartTag(HTML.Tag t, 
                               MutableAttributeSet a, int pos)
    public void handleEndTag(HTML.Tag t, int pos)
    public void handleSimpleTag(HTML.Tag t, 
                                MutableAttributeSet a, int pos)
    public void handleError(String errorMsg, int pos)
    public void handleEndOfLineString(String eol)
    


Here is a simple example of creating your own ParserCallback subclass to output all the text from an HTML document:

    HTMLEditorKit.ParserCallback callback = 
      new HTMLEditorKit.ParserCallback () {
        public void handleText(char[] data, int pos) {
            System.out.println(data);
        }
    };
    Reader reader = new FileReader("myFile.html");
    new ParserDelegator().parse(reader, callback, false);
    


When the parser encounters a tag, it invokes either handleStartTag or handleSimpleTag, based on the tag. The method parameters specify the tag, any attributes on the tag, and the position in the reader where the element was encountered.

handleSimpleTag is invoked for empty tags. Empty tags are tags that are defined not to have an end tag, and can thus have no content or child tags. BR and IMG are examples of empty tags, whereas P is not an empty tag. (While the end tag for P is optional, P is not an empty tag.) The set of empty tags currently supported by the Swing HTML DTD are:

  • BASEFONT
  • BR
  • AREA
  • LINK
  • IMG
  • PARAM
  • HR
  • INPUT
  • ISINDEX
  • BASE
  • META
  • FRAME

The handleSimpleTag method is also invoked for tags not defined in the DTD. For example, <foo> is not a valid HTML tag, and thus handleSimpleTag is invoked when the tag foo is encountered. On the other hand, handleStartTag is invoked for the valid non-empty tags -- the normal tags that are defined in the DTD.

 

Both handleStartTag and handleSimpleTag are passed a MutableAttributeSet containing the attributes of the tag. The MutableAttributeSet argument is reused by the caller. If you need to keep a reference to the AttributeSet, you must make a copy, perhaps using AttributeSet.copyAttributes. If an attribute is defined, the MutableAttributeSet key is an instance of HTML.Attribute, otherwise (with a few exceptions) it is a String containing the name of the attribute. For normal attributes, the attribute values in the AttributeSet are Strings. Two special keys and one value worth noting are:

ParserCallback.IMPLIED Indicates the DTD implied a particular tag, but it was not present in the content. For example, <html><body><table><td> is not legal HTML, as TR is missing. The parser generates the TR noting that the TR was implied by adding ParserCallback.IMPLIED as a key in the AttributeSet passed into handleStartTag.
HTML.Attribute.ENDTAG Indicates the end of an element not defined in the DTD was encountered. Remember that handleSimpleTag is invoked for elements not defined in the DTD. handleSimpleTag is also invoked for the end of elements not defined in the DTD (such as <foo>). The callback method can check for this by checking for the key HTML.Attribute.ENDTAG in the passed-in AttributeSet.
HTML.NULL_ATTRIBUTE_VALUE Indicates an attribute of an element did not have an explicit value, and the DTD did not have a default value. For example, <tr rowspan width=10% foo> illustrates the three possible types of attribute values. The width attribute has an explicit value of 10%. The rowspan attribute has an implicit value of 1 (implicit values are defined in the DTD; the attribute rowspan of a TR element has a default value of 1). The foo attribute has no default value which will be indicated with the NULL_ATTRIBUTE_VALUE. The callback method can identify attributes that don't have a defined value by checking for the value HTML.NULL_ATTRIBUTE_VALUE in the AttributeSet as the value for the attribute name.

handleEndTag is invoked for closing tags that are known to the DTD, such as </html>.

handleText, as the name implies, is invoked when any content is encountered in the document. The text and location (as an integer into the document) are passed in. Each occurrence of white space (any newlines, tabs, carriage returns, or multiple spaces) is coalesced into a single space character.

Any errors encountered are notified via the handleError method. The default implementation of the callback method ignores any errors, as many pages on the web do not contain valid HTML.

flush is actually not invoked by the parser, but by HTMLEditorKit, to indicate that parsing has successfully finished.

Since white space is stripped when parsing, handleEndOfLineString is invoked after parsing with the best guess for the end of line string. The end of line string will be /n, /r, or /r/n, whichever is encountered the most in the document.

Sample Document

Let's take a look at what happens when a small HTML file is parsed. Consider the following HTML:

    <html><p>A <foo>xx</foo><a href=test>link</a>
    

The following shows the sequence of invocations on the callback as well as some of the values:

Method Position Tag Attributes Text
handleStartTag 0 html    
handleStartTag 6 head IMPLIED=true  
handleEndTag 6 head    
handleStartTag 6 body IMPLIED=true  
handleStartTag 6 p    
handleError 16     tag.unrecognized foo??
handleText 9     A
handleSimpleTag 11 foo    
handleError 25     end.unrecognized foo??
handleText 16     xx
handleSimpleTag 18 foo ENDTAG=true  
handleStartTag 24 a HREF=test  
handleText 37     link
handleEndTag 41 a    
handleEndTag 44 p    
handleEndTag 44 body    
handleEndTag 44 html    

A couple of things are worth noting. Because the parser looks ahead to determine state, it is often possible to get error notification out of order. Notice that the HTML has A <foo>, but the parser reports that foo is invalid before it reports the text A. Also notice that the parser automatically generates start tags for the HEAD and BODY, even though they are not specified in the document. Callback methods can detect tags implied by the DTD by checking for the ParserCallback.IMPLIED key in the MutableAttributeSet.

Netscape Bookmarks File

In recent versions of Netscape Navigator, bookmarks are saved in a file format that closely resembles HTML. Here is a sample bookmarks.html file (bookmark.htm on Windows):

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an example file! -->
<TITLE>Bookmarks for TreeTableExample 3</TITLE>
<H1>Bookmarks for TreeTableExample 3</H1>

<DD>Toolbar Folder&lt;<BR>
&lt;A
<DL><p>
    <DT><H3 ADD_DATE="871524103">Games</H3>
    <DL><p>
        <DT><A HREF="http://www.activision.com" ADD_DATE="917293502"
           LAST_VISIT="920521850" 
           LAST_MODIFIED="920521850">Activision</A>
    </DL>
</DL>                    


Notice that each bookmark directory is represented with DL, and each bookmark entry is represented with DT.

To better illustrate these concepts, we will create a JTreeTable that displays bookmarks from a Navigator bookmarks file, similar to the Edit Bookmarks... feature of Navigator. As the format of the bookmarks file is almost valid HTML, we can use the Swing HTML parser with the default DTD. A custom implementation of ParserCallback (called Bookmarks) is used to create the internal objects used to represent the bookmarks.

Here is a snapshot of the GUI:

The name of the directory and the name of the bookmark entry are both represented as text in the file. As such, when handleText is invoked on Bookmarks, a new object is created to represent the directory or bookmark. Bookmarks sets an instance variable to indicate which object should be created when the handleText method is invoked: either a representation of a bookmark directory (BookmarkDirectory) or a representation of a bookmark entry (BookmarkEntry). This instance variable, state, is set in the handleStartTag method based on the passed-in parameters. A DT followed by the anchor tag A identifies the accompanying text as a bookmark entry name. A DT followed by an H3 tag identifies the accompanying text as the name of a bookmark directory. In addition to setting the state variable, handleStartTag extracts the ADD_DATE attribute for bookmark directories, and the HREF, DATE_CREATED, LAST_VISITED, and LAST_MODIFIED attributes for bookmark entries. Here is the relevant code for handleStartTag:

    public void handleStartTag(HTML.Tag tag, 
                         MutableAttributeSet attrSet, int pos) {
        if (tag == HTML.Tag.A && lastTag == HTML.Tag.DT) {
            URL url = 
             new URL((String)attrSet.getAttribute(HTML.Attribute.HREF));
            Date createDate = convertNetscapeDateToDate
                          ((String)attrSet.getAttribute("add_date"));
            Date lastVisited = convertNetscapeDateToDate
                          ((String)attrSet.getAttribute("last_visit"));
            state = BOOKMARK_ENTRY;
        }
        else if (tag == HTML.Tag.H3 && lastTag == HTML.Tag.DT) {
            Date createDate = convertNetscapeDateToDate
                            ((String)attrSet.getAttribute("add_date"));
            state = BOOKMARK_DIRECTORY;
        }
        lastTag = tag;
    }
    private Date convertNetscapeDateToDate(String nsDate) {
        return new Date(1000l * Long.parseLong(nsDate));
    }
    


state is used to indicate what should happen when the next block of text is encountered and is used in handleText to create the correct representation of the data. Here is the relevant code for handleText:

    public void handleText(char[] data, int pos) {
        switch (state) {
        case BOOKMARK_ENTRY:
            createBookmark(new String(data), url, createDate, 
                                                     lastVisited);
            break;
        case DIRECTORY_ENTRY:
            createBookmarkDirectory(new String(data), createDate);
            break;
        default:
            break;
        }
        state = NO_ENTRY;
    }
    


The last information we need to track is which directory new bookmark entries are added to. When a new bookmark directory is created via createBookmarkDirectory, it becomes the directory new entries are added to. Similarly, when an end DL tag is encountered, the directory new entries are added to should become the current directory's parent directory. handleEndTag is overridden to handle this case, and looks like:

    public void handleEndTag(HTML.Tag t, int pos) {
        if (t == HTML.Tag.DL && parent != null) {
            parent = (BookmarkDirectory)parent.getParent();

        }
    }
    


Editable JTreeTable

Previous articles on JTreeTable (Creating TreeTables in Swing and Creating TreeTables: Part 2) have not touched on how to make the JTree column editable. We have received numerous requests asking how to do this, so, for this example, the JTree column of the JTreeTable is editable. There are different approaches that can be taken to make the JTree column editable, and we describe the most straightforward approach here.

The obvious approach is to make the JTree itself editable. This does not work. One subtle point of renderers and editors is that the same Component cannot be both the renderer and the editor. Remember that the renderer is used as a rubber stamp, that is, the renderer Component is continually added and removed from the containment hierarchy and asked to paint at each step, similar to rubber stamping a document. On the other hand, the editor Component is much longer lived. The editor Component exists in the JTable or JTree as long as the JTable or JTree is editable. The two problems with using the same Component for both the renderer and editor then become:

  • The renderer could be asked to paint when editing, resulting in the editor's current value getting lost.
  • After rendering a value, the editor may no longer be in the containment hierarchy, making subsequent editing fail.

To give the illusion the JTree is editable, a custom TableCellEditor is used by the JTable for the JTree column. In this way, the JTree is never really editable, rather, the JTable is responsible for editing the JTree column. The only other trickery involved is that JTree editors do not usually take up all the horizontal space allocated to them while JTable editors do. JTree editors instead are horizontally positioned based on the depth of the current node. (This behavior actually depends on the current look and feel, but all the look and feels currently defined for JTree base the horizontal position on the depth.) To solve this problem, the custom JTextField used for the actual editing Component locks its horizontal location based on the current node's depth. To do this, reshape is overridden to look like this:

    public void reshape(int x, int y, int w, int h) {
        int newX = Math.max(x, offset);
        super.reshape(newX, y, w - (newX - x), h);
    }
    

In the above code, offset is set before the editor Component is returned from the editor. That is, offset is computed in JTreeTable's getTableCellRendererComponent method by looking at JTree's getRowBounds and the width of DefaultTreeCellRenderer's icon.

The Source

The following files are new or have changed since the last JTreeTable article:

  • Bookmarks.java - Responsible for parsing the Netscape bookmarks file.
  • BookmarksModel.java - An implementation of TreeTableModel based on the values from an instance of Bookmarks.
  • DynamicTreeTableModel.java - An implementation of TreeTableModel that uses reflection to look up values.
  • JTreeTable.java - An implementation of JTable with one column containing a JTree. The new feature is editing support for the JTree column.
  • TreeTableExample3.java - Builds the GUI containing all the necessary components.

 

The following files have not changed since the last JTreeTable article:

 

A sample Navigator bookmarks file can be found here: bookmarks.html.

All the of the sources can be downloaded at once from the zip file bookmarks.zip.

The main method is in TreeTableExample3.java. By default, TreeTableExample3 looks for the file bookmarks.html in the ~/.netscape directory. If this file cannot be found, the bookmarks.html file in the current directory is used. Or you can specify an alternate file at the command line:

    % java TreeTableExample3 myBookmarksFile.html
    

 

Conclusion

We have only lightly touched on the capabilities of the HTML parsing support in Swing. Future articles will more fully explore this powerful and flexible feature.

 

【上篇】
【下篇】

抱歉!评论已关闭.