David Wright @ NC State

Master's Research

My Master's thesis research was my first in-depth introduction to academic research, and was itself a very challenging voyage. In fact, the first topic my advisor, Dr. Purush Iyer, and I chose to work on turned out to be a much larger problem than either of us anticipated. As a result, I had to backtrack and identify a more specific problem that would leverage, as much as possible, the literature research I had already conducted while being an appropriate problem for a Master's thesis.

The first problem involved inferring the meaning or semantics of parts of a web document from the structure of the document itself. On the surface, this might seem simple: the document title provides a name for the page, h1 tags identify major divisions or sections of the page, h1 tags subdivide top-level sections, etc. To be sure, there were web developers in 2001 - 2002 that constructed their pages this way. However, the popular web browsers of the day, Internet Explorer 3 & 4 and Netscape Navigator 3, 4, & 4.5, had limited and inconsistent implementation of the existing standards for HTML documents, implemented vendor-specific HTML tags, and allowed ill-formed HTML to be parsed and rendered. The problem of inferring semantics from the document structure was compunded by the widespread use of HTML tables for layout control, often resulting in table elements nested 4 or 5 levels deep and completely obfuscating the actual structure of the document.

Several solutions had been defined to augment web pages with semantic information. All relied on additional markup to carry the extra information. The markup commonly referenced some globally-accessible repository or ontology of definitions and relationships. The W3C was also in the process of developing the "Semantic Web," a framework for describing entities on the web, what they mean, and how they relate to the real world, independent of any particular application or service. Together, the scale of the problem domain, the coverage of proposed solutions, and the difficulty of consistently extracting meaning from web documents made it clear that another research direction should be pursued.

The second research problem was a derivitive of what I had learned about browser compatibility issues, web document structure, and scripting languages: given a particcular HTML document, identify the set of currently-available web browsers that will correctly render that document. We were particularly interested in documentw with dynamically-generated content, as this was the most significant area of incompatibility between the 2 most popular browser families, Netscape Navigator (NN) and Microsoft Internet Explorer (IE). In fact, there serious were compatibility issues between different versions of both browsers.

This research problem was more specific than the first problem I studies, but was still difficult. Different browsers used different HTML parsers and recognized different sets of HTML tags. For example, Netscape defined the layer tag to enable dynamic content generation, while Microsoft chose to develop a formal document object model (DOM) that could be directly addressed by internal or external executable code. These DOM implementations differed between versions of IE. Netscape implemented their own DOM in NN 4.0 and later, but this model was not compatible with Microsoft's implementation. Finally, the 2 browser families implemented different versions of scripting languages - NN implemented JavaScript (or ECAMScript) while IE implemented Microsoft's own VBScript in addition to a Javascript implementation.

For my thesis I developed a framework of inference rules and implemented them along with stripped-down HTML and ECMAScript parsers. The rules relied on the differences between HTML and script-language implementations in the web browsers, and was reasonably accurate at identifying browsers that would "correctly" render a web document. One lingering problem that was not solved was due to the ability to add properties and methods to existing DOM objects. For example, one way to recognize documents designed for IE was the use of the all property of the document object. The document object represented the entire web page, and the all property is an array of all elements (e.g., tags and tag content) making up the document, and was implemented only in the IE browser family. Netscape implemented the document object, but used a tree-structured DOM, so to access inner elements one had to travers down through the tree. It is possible, however, to add a property to the Netscape DOM at runtime (i.e., page loading), and a not-uncommon way of simplifying dynamic content generation was to add an Array-type property named "all" to the NN document object then populate it with a flattened version of the DOM tree.

While browser-compatibility problems still exist, they are not as significant now as they were just 6 years ago. The XML-based standarization of HTML, along with the standardization of the Document Object Model and the adoption of these standards (to varying degrees) by Microsoft, Netscape/Mozilla, and other browser developers has resulted in a more consistent web experience. The development and adoption of Cascading Style Sheets (CSS) has also helped to separate web content from its presentation. Together, these developments, along with better web-development tools, have significantly improved the consistency of web page rendering across browsers and platforms.