The Swing HTML Parser
Parsing a Netscape Navigator Bookmarks File
By Scott Violet
The high-level Swing component, JEditorPane
, is responsible for displaying, among other things, HTML text. However, this article shows how you can use the HTML parser outside of JEditorPane
. An example provided shows how to use the standard HTML parser (also shipped with HotJava) to parse the bookmarks file created by Netscape Navigator. Previous Swing Connection articles have featured the custom component, JTreeTable
. This article also demonstrates an enhanced editable JTreeTable
.
The parser provided by Swing is DTD driven, and therefore is capable of parsing much more than HTML. This article uses the parser with its standard DTD for parsing HTML documents. Future articles will address how to configure the parser with a custom DTD, and will discuss the binary DTD format used by the parser.
HTMLEditorKit.ParserCallback
The main entry point into the HTML parser is the class ParserDelegator
. ParserDelegator
parses an HTML document passed in as a Reader
and notifies the passed-in ParserCallback
object as to the state of the parsing. ParserCallback
implements the following methods:
public void flush() throws BadLocationException public void handleText(char[] data, int pos) public void handleComment(char[] data, int pos) public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) public void handleEndTag(HTML.Tag t, int pos) public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) public void handleError(String errorMsg, int pos) public void handleEndOfLineString(String eol) |
Here is a simple example of creating your own ParserCallback
subclass to output all the text from an HTML document:
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback () { public void handleText(char[] data, int pos) { System.out.println(data); } }; Reader reader = new FileReader("myFile.html"); new ParserDelegator().parse(reader, callback, false); |
When the parser encounters a tag, it invokes either handleStartTag
or handleSimpleTag
, based on the tag. The method parameters specify the tag, any attributes on the tag, and the position in the reader where the element was encountered.
handleSimpleTag
is invoked for empty tags. Empty tags are tags that are defined not to have an end tag, and can thus have no content or child tags. BR
and IMG
are examples of empty tags, whereas P
is not an empty tag. (While the end tag for P
is optional, P
is not an empty tag.) The set of empty tags currently supported by the Swing HTML DTD are:
BASEFONT
BR
AREA
LINK
IMG
PARAM
HR
INPUT
ISINDEX
BASE
META
FRAME
The handleSimpleTag
method is also invoked for tags not defined in the DTD. For example, <foo>
is not a valid HTML tag, and thus handleSimpleTag
is invoked when the tag foo
is encountered. On the other hand, handleStartTag
is invoked for the valid non-empty tags -- the normal tags that are defined in the DTD.
Both handleStartTag
and handleSimpleTag
are passed a MutableAttributeSet
containing the attributes of the tag. The MutableAttributeSet
argument is reused by the caller. If you need to keep a reference to the AttributeSet
, you must make a copy, perhaps using AttributeSet.copyAttributes
. If an attribute is defined, the MutableAttributeSet
key is an instance of HTML.Attribute
, otherwise (with a few exceptions) it is a String
containing the name of the attribute. For normal attributes, the attribute values in the AttributeSet
are String
s. Two special keys and one value worth noting are:
ParserCallback.IMPLIED |
Indicates the DTD implied a particular tag, but it was not present in the content. For example, <html><body><table><td> is not legal HTML, as TR is missing. The parser generates the TR noting that the TR was implied by adding ParserCallback.IMPLIED as a key in the AttributeSet passed into handleStartTag . |
HTML.Attribute.ENDTAG |
Indicates the end of an element not defined in the DTD was encountered. Remember that handleSimpleTag is invoked for elements not defined in the DTD. handleSimpleTag is also invoked for the end of elements not defined in the DTD (such as <foo> ). The callback method can check for this by checking for the key HTML.Attribute.ENDTAG in the passed-in AttributeSet . |
HTML.NULL_ATTRIBUTE_VALUE |
Indicates an attribute of an element did not have an explicit value, and the DTD did not have a default value. For example, <tr rowspan width=10% foo> illustrates the three possible types of attribute values. The width attribute has an explicit value of 10%. The rowspan attribute has an implicit value of 1 (implicit values are defined in the DTD; the attribute rowspan of a TR element has a default value of 1). The foo attribute has no default value which will be indicated with the NULL_ATTRIBUTE_VALUE . The callback method can identify attributes that don't have a defined value by checking for the value HTML.NULL_ATTRIBUTE_VALUE in the AttributeSet as the value for the attribute name. |
handleEndTag
is invoked for closing tags that are known to the DTD, such as </html>
.
handleText
, as the name implies, is invoked when any content is encountered in the document. The text and location (as an integer into the document) are passed in. Each occurrence of white space (any newlines, tabs, carriage returns, or multiple spaces) is coalesced into a single space character.
Any errors encountered are notified via the handleError
method. The default implementation of the callback method ignores any errors, as many pages on the web do not contain valid HTML.
flush
is actually not invoked by the parser, but by HTMLEditorKit
, to indicate that parsing has successfully finished.
Since white space is stripped when parsing, handleEndOfLineString
is invoked after parsing with the best guess for the end of line string. The end of line string will be /n, /r, or /r/n, whichever is encountered the most in the document.
Sample Document
Let's take a look at what happens when a small HTML file is parsed. Consider the following HTML:
<html><p>A <foo>xx</foo><a href=test>link</a>
The following shows the sequence of invocations on the callback as well as some of the values:
Method | Position | Tag | Attributes | Text |
handleStartTag |
0 | html | ||
handleStartTag |
6 | head | IMPLIED=true | |
handleEndTag |
6 | head | ||
handleStartTag |
6 | body | IMPLIED=true | |
handleStartTag |
6 | p | ||
handleError |
16 | tag.unrecognized foo?? | ||
handleText |
9 | A | ||
handleSimpleTag |
11 | foo | ||
handleError |
25 | end.unrecognized foo?? | ||
handleText |
16 | xx | ||
handleSimpleTag |
18 | foo | ENDTAG=true | |
handleStartTag |
24 | a | HREF=test | |
handleText |
37 | link | ||
handleEndTag |
41 | a | ||
handleEndTag |
44 | p | ||
handleEndTag |
44 | body | ||
handleEndTag |
44 | html |
A couple of things are worth noting. Because the parser looks ahead to determine state, it is often possible to get error notification out of order. Notice that the HTML has A <foo>
, but the parser reports that foo
is invalid before it reports the text A
. Also notice that the parser automatically generates start tags for the HEAD
and BODY
, even though they are not specified in the document. Callback methods can detect tags implied by the DTD by checking for the ParserCallback.IMPLIED
key in the MutableAttributeSet
.
Netscape Bookmarks File
In recent versions of Netscape Navigator, bookmarks are saved in a file format that closely resembles HTML. Here is a sample bookmarks.html
file (bookmark.htm
on Windows):
<!DOCTYPE NETSCAPE-Bookmark-file-1> <!-- This is an example file! --> <TITLE>Bookmarks for TreeTableExample 3</TITLE> <H1>Bookmarks for TreeTableExample 3</H1> <DD>Toolbar Folder<<BR> <A <DL><p> <DT><H3 ADD_DATE="871524103">Games</H3> <DL><p> <DT><A HREF="http://www.activision.com" ADD_DATE="917293502" LAST_VISIT="920521850" LAST_MODIFIED="920521850">Activision</A> </DL> </DL> |
Notice that each bookmark directory is represented with DL
, and each bookmark entry is represented with DT
.
To better illustrate these concepts, we will create a JTreeTable
that displays bookmarks from a Navigator bookmarks file, similar to the Edit Bookmarks...
feature of Navigator. As the format of the bookmarks file is almost valid HTML, we can use the Swing HTML parser with the default DTD. A custom implementation of ParserCallback
(called Bookmarks
) is used to create the internal objects used to represent the bookmarks.
Here is a snapshot of the GUI:
The name of the directory and the name of the bookmark entry are both represented as text in the file. As such, when handleText
is invoked on Bookmarks
, a new object is created to represent the directory or bookmark. Bookmarks
sets an instance variable to indicate which object should be created when the handleText
method is invoked: either a representation of a bookmark directory (BookmarkDirectory
) or a representation of a bookmark entry (BookmarkEntry
). This instance variable, state
, is set in the handleStartTag
method based on the passed-in parameters. A DT
followed by the anchor tag A
identifies the accompanying text as a bookmark entry name. A DT
followed by an H3
tag identifies the accompanying text as the name of a bookmark directory. In addition to setting the state
variable, handleStartTag
extracts the ADD_DATE
attribute for bookmark directories, and the HREF
, DATE_CREATED
, LAST_VISITED
, and LAST_MODIFIED
attributes for bookmark entries. Here is the relevant code for handleStartTag
:
public void handleStartTag(HTML.Tag tag, MutableAttributeSet attrSet, int pos) { if (tag == HTML.Tag.A && lastTag == HTML.Tag.DT) { URL url = new URL((String)attrSet.getAttribute(HTML.Attribute.HREF)); Date createDate = convertNetscapeDateToDate ((String)attrSet.getAttribute("add_date")); Date lastVisited = convertNetscapeDateToDate ((String)attrSet.getAttribute("last_visit")); state = BOOKMARK_ENTRY; } else if (tag == HTML.Tag.H3 && lastTag == HTML.Tag.DT) { Date createDate = convertNetscapeDateToDate ((String)attrSet.getAttribute("add_date")); state = BOOKMARK_DIRECTORY; } lastTag = tag; } private Date convertNetscapeDateToDate(String nsDate) { return new Date(1000l * Long.parseLong(nsDate)); } |
state
is used to indicate what should happen when the next block of text is encountered and is used in handleText
to create the correct representation of the data. Here is the relevant code for handleText
:
public void handleText(char[] data, int pos) { switch (state) { case BOOKMARK_ENTRY: createBookmark(new String(data), url, createDate, lastVisited); break; case DIRECTORY_ENTRY: createBookmarkDirectory(new String(data), createDate); break; default: break; } state = NO_ENTRY; } |
The last information we need to track is which directory new bookmark entries are added to. When a new bookmark directory is created via createBookmarkDirectory
, it becomes the directory new entries are added to. Similarly, when an end DL
tag is encountered, the directory new entries are added to should become the current directory's parent directory. handleEndTag
is overridden to handle this case, and looks like:
public void handleEndTag(HTML.Tag t, int pos) { if (t == HTML.Tag.DL && parent != null) { parent = (BookmarkDirectory)parent.getParent(); } } |
Editable JTreeTable
Previous articles on JTreeTable
(Creating TreeTables in Swing and Creating TreeTables: Part 2) have not touched on how to make the JTree
column editable. We have received numerous requests asking how to do this, so, for this example, the JTree
column of the JTreeTable
is editable. There are different approaches that can be taken to make the JTree
column editable, and we describe the most straightforward approach here.
The obvious approach is to make the JTree
itself editable. This does not work. One subtle point of renderers and editors is that the same Component
cannot be both the renderer and the editor. Remember that the renderer is used as a rubber stamp, that is, the renderer Component
is continually added and removed from the containment hierarchy and asked to paint at each step, similar to rubber stamping a document. On the other hand, the editor Component
is much longer lived. The editor Component
exists in the JTable
or JTree
as long as the JTable
or JTree
is editable. The two problems with using the same Component
for both the renderer and editor then become:
- The renderer could be asked to paint when editing, resulting in the editor's current value getting lost.
- After rendering a value, the editor may no longer be in the containment hierarchy, making subsequent editing fail.
To give the illusion the JTree
is editable, a custom TableCellEditor
is used by the JTable
for the JTree
column. In this way, the JTree
is never really editable, rather, the JTable
is responsible for editing the JTree
column. The only other trickery involved is that JTree
editors do not usually take up all the horizontal space allocated to them while JTable
editors do. JTree
editors instead are horizontally positioned based on the depth of the current node. (This behavior actually depends on the current look and feel, but all the look and feels currently defined for JTree
base the horizontal position on the depth.) To solve this problem, the custom JTextField
used for the actual editing Component
locks its horizontal location based on the current node's depth. To do this, reshape
is overridden to look like this:
public void reshape(int x, int y, int w, int h) { int newX = Math.max(x, offset); super.reshape(newX, y, w - (newX - x), h); }
In the above code, offset
is set before the editor Component
is returned from the editor. That is, offset
is computed in JTreeTable
's getTableCellRendererComponent
method by looking at JTree
's getRowBounds
and the width of DefaultTreeCellRenderer
's icon.
The Source
The following files are new or have changed since the last JTreeTable
article:
- Bookmarks.java - Responsible for parsing the Netscape bookmarks file.
- BookmarksModel.java - An implementation of TreeTableModel based on the values from an instance of Bookmarks.
- DynamicTreeTableModel.java - An implementation of TreeTableModel that uses reflection to look up values.
- JTreeTable.java - An implementation of
JTable
with one column containing aJTree
. The new feature is editing support for theJTree
column. - TreeTableExample3.java - Builds the GUI containing all the necessary components.
The following files have not changed since the last JTreeTable
article:
- AbstractTreeTableModel.java - Provides convenience methods in implementing a
TreeTableModel
. - TreeTableModel.java - Model used by a
JTreeTable
. - TreeTableModelAdapter.java - An implementation of the
TableModel
given aTreeTableModel
.
A sample Navigator bookmarks file can be found here: bookmarks.html.
All the of the sources can be downloaded at once from the zip file bookmarks.zip.
The main
method is in TreeTableExample3.java
. By default, TreeTableExample3
looks for the file bookmarks.html
in the ~/.netscape
directory. If this file cannot be found, the bookmarks.html
file in the current directory is used. Or you can specify an alternate file at the command line:
% java TreeTableExample3 myBookmarksFile.html
Conclusion
We have only lightly touched on the capabilities of the HTML parsing support in Swing. Future articles will more fully explore this powerful and flexible feature.