Web Site Evolution -
Towards a flexible integration of Data and its Representation
Margaret-Anne Storey and Jens H. Jahnke
Department of Computer Science,
University of Victoria, Canada
Abstract
This position paper discusses the inherent
difficulties caused by data-driven as well as user-interface-driven evolution
of Web sites. New data requirements
arise in order to accommodate new data sources, or to delete and change
existing data. User interfaces are constantly evolving to suit new customer
requirements or to take advantages of new technologies. Maintaining the
integrity and security of the data is of utmost importance, but the aesthetics
and usability of the site are equally important. This is especially true in the
eCommerce arena. Users with non-technical backgrounds are doing business using
the Web as a communication medium. Furthermore the Web itself has allowed the
integration of data sources not previously possible. These issues are
particularly pertinent for library Web sites.
In our position paper, we evaluate current technologies and identify the
lack of separation between user interface and data structure concerns as a
major cause for evolution and maintenance problems in web site design.
1. Background
Currently, the World Wide Web is dramatically changing how information is gathered and distributed by institutions, companies, and consumers. During the past four years, the number of Web sites has increased from about 23,500 (June 1995) to 7,078,194 (August 1999) [1]. A major driving force behind this development has been the increasing popularity of Web-based services in the private sector. The Web, with its user friendly and less intimidating interface, has been broadly accepted by the public. In contrast to other Internet-based information services like ftp and gopher, the Web was specifically developed for browsing complex multi-site information networks using graphical visualization and simple point&click navigation techniques. This paradigm has enabled even inexperienced users to explore this new medium without having to learn a textual command language.
The
success of a Web site is due not only to the usefulness of the data it
represents but also to the accessibility of the information by the average
user. Web sites evolve to accommodate changes in the data represented by the Web
site or to accommodate modifications in how this data is accessed, i.e., changes to
the user interface. The following section describes these two aspects of Web site evolution in more detail. Section 3 discusses several techniques for integrating data and its representation in information based Web sites. The remainder of the paper focuses on a specific application domain of Web-based information services, electronic library gateways. Section 4 gives an introduction to the central issues in this area, whereas Section 5 describes a specific project with the McPherson library of the University of Victoria.
2. Web Site Evolution - two sides of the coin
This section describes some of the challenges resulting from data driven and user interface driven evolution.
2.1 Data evolution
The necessity to modify the structure of data maintained in information systems causes severe update problems in today’s software industry. Prominent examples of this phenomenon include the Y2K-Problem [2] and the Euro-conversion [3]. We can transfer many lessons learned in the development of traditional Information Systems (IS) to the domain of net-centric information systems. One important lesson to be learned from IS research is the requirement to separate data from its representation and from the business logic which processes the data. The failure to separate the data from the business logic in legacy software systems developed in the 60’s and 70’s resulted in complex reengineering problems. The reason for these problems is that the data, business logic, and external representations are highly dependent on one another which implies complicated updates in the case of a change in the data structures (meta data), business rules or user interface. The introduction of standardized file access systems (e.g., SAM, ISAM) and database management systems (DBMS) contributed to a decoupling of this tight integration in IS. The middleware and view mechanisms of modern DBMS allows developers to hide many details about the internal representation of a logical data structure from the rest of the application code. As a benefit, changes caused by data evolution can be performed more locally.
We expect that similar techniques will prove beneficial for maintaining complex, data-intensive Web sites. However, there are additional challenging problems which need to be solved. For example, with net-centric IS, an important goal is to develop technologies that facilitate the integration of several distributed and heterogeneous data sources. Today, an increasing number of successful Web sites maintain only a small amount of original data content but provide categorized and annotated pointers to other data sources.
Given the predicted popularity of eCommerce, Web based information systems, such as airline-independent best-price flight reservation systems and electronic library gateways, will gain even more importance in the future. Such information systems need to be flexible as they evolve rapidly due to the integration of additional data sources and to changes in the user interfaces. Some central issues to consider are: how to harmonize different heterogeneous data structures; how to deal with overlapping data; and how to enforce consistency constraints (e.g., transaction management). Similar issues have been encountered for several years in the domain of distributed and federated DBMS [11]. However, since this research is based on assumptions of a relatively small and fixed number of participating data sources, the results have to be reevaluated with respect to the scale and dynamics of Web-based information management.
2.2 User interface evolution
The user interface of a Web site may be viewed as an external representation of the data or information contained in a site. Users need to interact with Web sites either to find specific information or to browse and explore for more knowledge about a particular subject. In addition, users often have to input data (particularly in eCommerce applications) such as their address or product ordering information. In essence, Web site user interface design is as complex a task as application user interface design. Indeed the user interface of many applications are currently slowly migrating to Web based interfaces.
User interface design is without doubt a challenging problem. A good design is usually the result of several iterations. One of the problems with a Web based information system is that the pressure to evolve is persistent as users demand updated or new information as soon as or even before it becomes available. More sophisticated users are also requesting all the bells and whistles that are provided by the latest browsers and software tools. In contrast to this, more and more non-technical users are seeking simple, easy-to-learn and intuitive interfaces. Novices do not want to download and install plug-ins and they are often not aware that there is yet another version of the default browser. These conflicting styles of usage lead to difficulties when making design decisions during site design.
In the next section, we review several existing technologies for building Web based information systems and discuss the impact of these techniques on data evolution and user interface evolution.
3. Techniques to integrate data and its presentation on the Web
In 1992, Connolly introduced the first specification of the Hyper Text Markup language (HTML) as a means to describe Web pages. Although several extensions of HTML have been proposed and implemented (e.g., forms, images and frames), the basic approach has not changed in principle: data is described as text enclosed by formatting directives (HTML tags). While this solution has proven sufficient for static information such as advertisements and contact information for companies and organizations, it is hardly viable to maintain complex information that is updated frequently.
The Common Gateway Interface (CGI) has been introduced to process and represent more dynamic information contents. It allows the developer to embed programs or scripts in HTML pages. These programs are written in various languages and permit the storage and retrieval of data from a Web server. This technique is widely used in combination with HTML forms to access data in files or databases from the Web. However, with respect to maintenance and evolution issues for complex Web information systems, the CGI approach causes significant update problems. This is because CGI scripts, which serve as mediators between the data and their HTML representation, are “hard-coded” at a low level of abstraction. Each change in the meta-data or in the external HTML interface requires a change to these scripts. This means a significant update and subsequent testing overhead for complex applications such as large product catalogs.
An attempt to raise the level of abstraction for the coupling between data and representations are dedicated HTML-database gateways offered by multiple DBMS and third party vendors [4]. These solutions enable developers to embed database queries directly into HTML documents. Still, the queries and the HTML form have to be updated when the meta data evolves. Another approach which provides better support for data evolution is HTML generators for databases, e.g., Ardent Software’s O2Web. [5], which support the automatic generation of HTML representations for data objects without any additional programming. Each object has a generic HTML presentation. Programmers can customize these HTML production methods by overloading the system-supplied methods. The most important disadvantage of such solutions is that they are targeted to a specific DBMS platform and do not support the integration of heterogeneous data sources.
Recently, the Extensible Markup Language (XML) [6] has been proposed as a successor to HTML. A key feature of XML is that the semantics of the tags used are not predefined (as in HTML). The XML document defines the (data) structure of a Web page while its external representation is defined separately by another specification written in the Extensible Style Language (XSL) [6]. XML browsers, if used broadly as a substitution for HTML-based systems, would enable the user interface design to be more independent from the actual data content.
Another platform independent approach which enables access to heterogeneous data sources employs Java applets in combination with the Java DataBase Connectivity (JDBC) [7]. Currently this solution is rarely used for accessing data at Web sites. It is likely to become increasingly popular because it allows the Web server to distribute data-oriented computations among its clients. But a remaining problem is that the client software has to be updated whenever the applet changes. If multiple data sources have to be accessed, maintainability can be improved by employing object-oriented integration services like CORBA and COM [8]. Such services facilitate encapsulation of distributed data sources and enhance their robustness against modifications. These techniques are slowly being adopted.
4. Electronic libraries
The vast variety of services offered through the Web makes it hard to find an approach to tackle the problem of Web site evolution in general. Therefore, we are initially restricting our research to a specific application domain, electronic libraries. Today, an increasing number of libraries employ computers to maintain an electronic database of their catalogs. Traditionally, search and order services have been provided by dedicated information terminals provided in the libraries. Recently, many libraries have started to migrate these services to the Web. This migration has several benefits:
· Users can access the library 24 hours a day from their offices or homes. This improves the accessibility of catalog information and reduces the effort required to maintain numerous information terminals.
· Many new library users have previous experiences with other Web-based services. This means they can search and browse the library with little training;
· Catalogs of remote libraries and bibliographic services can be integrated within the library gateway.
This last point is probably the most important benefit of Web-based library services although significant interoperability and evolution problems arise with the integration of different library catalogs. Because of these problems most current electronic libraries provide only superficial integration: they offer similar user interfaces for different data sources but rely almost entirely on human intelligence to provide coherence for the content. Deep data integration of library data has been identified as a "grand challenge" research problem because of the heterogeneous and evolutionary nature of the different data sources [9]. Still, this deep data integration is of crucial importance, if we are to tackle open issues like overlapping and complementary data and the presentation of uniform and consistent query results to the users. Most current research efforts are located in between these two extremes. They aim to establish a standardized protocol for library database integration and information retrieval. Since 1995, ANSI/NISO has developed the Information Retrieval Standard Z39.50 which has been accepted by ISO in late 1996 [10]. Lunau and Turner of the National Library of Canada summarize experiences with this new standard [10]. They report that in principle, the uniform interface works well as a data integration platform. Still, they point out that the key issue which remains is how to create and maintain a valid mapping between the data objects and attributes of the individual databases and the uniform access layer.
5. Case study: McPherson Library Gateway
The McPherson library is a significant library used by thousands of students and researchers at the University of Victoria. The library has a well established Web presence (called the McPherson Library Gateway). The Gateway is used for accessing information about the library’s physical collections and provides access to electronic indices and texts. Despite their best efforts, the McPherson library staff are overwhelmed by the amount of electronic information that is now available. Currently, the library includes over 60 different indices for articles. They have realized that the current Web site architecture is neither scalable nor flexible enough to suit these rapid changes and needs to be updated.
To solve this problem, they bought a number of dedicated Z39.50 middleware products to integrate the different data sources. We are collaborating with the library management to offer our expertise in data evolution and user interface evolution during the reengineering of their Web site architecture. Some of the issues we will focus on are described in the following paragraphs.
Scalability: The current architecture of the Gateway Web site cannot scale to handle the increasing number of students using the service. In addition, it is becoming increasingly cumbersome to integrate electronic catalogs, journals and texts that become available almost on a daily basis. The architecture needs to support easy insertion and deletion of data resources as well as provide a consistent set of user interfaces to these new resources. Peak usage times should be considered, for example, just before midterm exams. Furthermore, the architecture should be able to support multiple site maintainers working concurrently.
Flexibility: The design needs to be flexible so that data resources and information needs, not currently envisioned, can be integrated in the future. For example, the current architecture does not allow users to concurrently search through video and book collections. Possible future changes may allow users to search through musical resources or even digital videos online. The current library card system is very homogeneous with respect to the varied nature and granularity of the resources it represents. Each card has a set number of fields and works fairly well for written material. The future architecture will need to provide access to more heterogeneous data resources and should provide access to information and their components at varying levels of granularity.
Consistency: The Web site will need to provide a consistent set of user interfaces to many different resources. In addition, many of the users will have had experience or will quickly gain experience as they use or link to other library Web sites. Consistency across multiple libraries is especially important.
Usability: The Gateway Web site will become in essence
a virtual library. Although it will not
be housed in a concrete building, it is still important for the patrons to
achieve a sense of presence as they browse the site. Such characteristics will improve the usability of the site. The user interface needs to be suitable for many
different user profiles, such as students, researchers, teachers, graduate
students and members of the public. The
characteristics for each of these user groups will need to be collected and
continually updated as the sophistication of each of these user groups
continues to evolve.
Information retrieval research shows that there are two main styles of navigation: searching and browsing,. Users frequently switch between these two activities. While browsing, many users don't always know what they are looking for, or may not know the correct label to use during a search. The site should impose a structure on the information so that the library patrons can create their own personal paths to the information they require. Users may also want to create customized views of the library for their own purposes.
Maintainability: The architecture design needs to be such
that it would be relatively easy to add, update or delete heterogeneous data
resources. Maintainers will have the task of deciding which catalogs and
electronic journals/text library to add or remove from the site. In addition, they will also have to deal
with the ambiguity in cataloging items as they become available. However, certain business processes may be
simplified such as book recalls, holds and fines for overdue articles. Other concerns are training issues for the
site maintainers and the library patrons
To date we have begun preliminary discussions with the library management. They are enthusiastic about our involvement, as their task is a very challenging one. We hope to be able to suggest solutions that will ensure that the new architecture for their Web site will be suitable for many years. Although we will initially research the web evolution of a library information system, the results should generalize to other applications.
References
[1] M. Gray. Growth and Usage of the Web and the Internet.
Massachussets Institute of Technology,
1999.
Http://www.mit.edu/people/mkgray/net.
[2] R. A. Martin. Dealing with dates: Solutions for the Year 2000. Computer, 30(3):44-51, 1997.
[3] K. Grotenhuis. Crossing the Euro rubicon. IEEE Spectrum, 35(10):20-33, 1998.
[4] G. Ehmayer and G. Kappel and
S. Reich. Connecting Databases to the Web - A Taxonomy of Gateways. Proc. of
the 8th Intl. Conference on Database and Expert Systems Applications (DEXA 97),
Toulouse, France, 1997.
[5] O2Web User Manual. Ardent
Software, Inc. 50 Wahington Street, Westboro, MA 01581-1021, USA. 1997
[6] S. Holzner. XML Complete.
McGraw Hill, 1998.
[7] P. Patel and K. Moss. Java
Database Programming With JDBC. Coriolis Group Books, 1996.
[8] T. Mowbray and W. A. Ruth.
Inside CORBA: Distributed Object Standards and Appliations. Addison-Wesley,
Reading, MA, USA, 1997.
[9] C. Lynch and H.
Garcia-Molina. Interoperability, Scaling, and the Digital Libraries – Research
Agenda. Report on the IITA Digital Libraries Workshop, May 1995. http://www-diglib.stanford.edu/diglib/pub/reports/iita-dlw/main.html
[10] C. Lunau and F. Turner.
Summary of Issues Related to the Use of Z39.50. National Library of Canada. http://www.nlc-bnc.ca/resource/vcuc/ezarlsum.htm
[11] A.P. Sheth. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. Intl. Conf. On Very Large Data Bases (VLDB '91). Morgan Kaufmann Publishers, 1991.