"The Design and Delivery of Semi-Structured Data."
Speaker: |
Rich Morin |
Date: Thursday, March 10, 2005
Time: 3:30 - 5:00 PM
Location: Orange Conference Room, Bldg 40
Goodies: Tea and cookies provided
Speaker's Summary:
In structured data, the syntax, structure, and (to some degree) semantics of the data are defined in advance. In "semi-structured" data, the structure can be defined, dynamically, by the data itself.
In RDBMS and XML applications, the syntax is defined by various standards. The specific structure is typically encoded in some form of "schema". The relevant standards and schemas provide enough information to parse and navigate the data structure.
HTML web pages are structurally similar to XML documents (though the syntax rules are much looser); both can be described as "trees" with attributed nodes. The interesting thing about web pages, however, is the graph (ie, graphical data structure) that is created by the links.
Despite the fact that any given link may be outdated, irrelevant, or simply wrong, a great deal of information is stored in the links. We rely on this information as we navigate the web; search engines such as Google analyze this information to build up their indexes.
Similarly, packages such as Doxygen analyze various relationships that are present in collections of source (and sometimes binary) code. By doing so, they can generate web pages with various "views" of the code, providing contextual diagrams, cross-page links, indexes, etc. Again, the input data may be noisy, but the results can be very useful.
Rich Morin created a suite of Perl scripts that generate documentation for the Flight Software used on the Large Area Telescope (LAT), used in the Gamma-ray Large Area Space Telescope (GLAST). Each night, these scripts traverse the development file trees, query databases, etc. The result is a set of cross-linked web pages covering requirements, tasks, code, packets, and more.
Rich will present a tutorial on semi-structured data, using examples from his own and other web pages. He will also discuss graphical data structures and ways of encoding them.
About the Speaker:
Rich Morin has been programming for 35 years, starting with a remote account on Stanford's Wylbur system. He is interested in the use of semi-structured data as a way to combine human-contributed and mechanically-harvested information into useful, readable documentation. He programs computers, writes, and edits for a living.
To contact Rich Morin: Rich Morin - Telephone: 1-650-873-7841
Canta Forda Computer Laboratory
