Selection of Content for the Web Citation Index(TM)

By Angela Martello
Managing Editor, Web Content
Thomson Scientific

Nearly 50 years after Dr. Eugene Garfield began indexing scholarly research bibliographic data, Thomson Scientific continues his emphasis on quality and relevance - carefully selecting and indexing the core literature published in peer-reviewed scholarly journals, books, and proceedings. In 1998, the company decided to complement the extensive bibliographic information it already provides its customers by developing a collection of scholarly Web sites1 – Current Web Contents™. The Web Content Editors in the Editorial Development Department identify Web sites and evaluate them by determining how the Web site adheres to a number of selection criteria (e.g., authority, accuracy, currency, overall design, and quality of writing).

A hallmark of Thomson Scientific has, of course, been its citation indexing. By capturing references and linking them back to source items in our database, Thomson Scientific Web of Science® lets users organize literature around cited references, trace the historic route of scientific discovery, map out relationships among multiple research papers, and even keep track of how well received their publications are by their peers. Thomson Scientific is now applying its strengths in citation tracking and linking to documents resident on the Web in its new offering – the Web Citation Index.


Web Citation Index – What is it?

The Web Citation Index is a multidisciplinary citation index of Web-accessible, scholarly research papers (including articles, preprints, theses, dissertations, proceedings, technical reports, and other gray literature). The Web Citation Index uses some of the Autonomous Citation Indexing2 software developed by NEC Laboratories America in Princeton, NJ. This technology extracts source information and references from documents and builds an index with cited and citing relationships. This technology currently supports the CiteSeer database of computer and information science papers (http://citeseer.ist.psu.edu/). Thomson Scientific has enhanced this technology in many ways. For example, additional software harvests Open Access Initiative (OAI)-compliant metadata (if available) and combines it with full-text indexes of each document.3, 4

Through the Web Citation Index interface, customers can perform cited reference searches and navigate through the Web-based literature by using the cited and citing relationships that exist among the indexed documents. Customers also have the ability to link back to the Web of Science if the Web-based document cited or was cited by a Web of Science article, or if the document also was indexed in the Web of Science.


Institutional and Subject-Specific Repositories

The Web Citation Index differs from the current CiteSeer database in several ways, but the most fundamental difference is the selection of content. The CiteSeer system crawls the Web and harvests content that follows certain rules. The Web Citation Index, on the other hand, contains carefully selected content. The Web Content Editors of Editorial Development serve as the content curators for the Web Citation Index, choosing content that meets defined selection criteria.

For the initial release of the Web Citation Index, Thomson Scientific has concentrated on the scholarly material archived in institutional and subject-specific repositories available via the Internet. Many factors have contributed to the rise of such repositories, most notably the Open Archives Initiative and the development of several open source archiving software packages. A position paper by the Scholarly Publishing & Academic Resources Coalition supported institutional repositories, referring to them as a “compelling response to two strategic issues facing academic institutions,” namely how to reform and complement the current scholarly journal publishing system and how to present to the public an indicator of the quality and relevance of an institution’s research efforts.5

Indeed, seven major institutions participated in the development of the pilot version of the Web Citation Index. These institutions – Australian National University, California Institute of Technology, Cornell University, the Max Planck Society, Monash University, the University of Rochester, and NASA Langley – allowed us to test our software on their repositories. They also offered valuable feedback on the overall design and functionality of the product itself.

Just how many institutional repositories and subject-specific archives there are is hard to pinpoint. A recent article based on a survey conducted jointly by the Coalition for Networked Information, the UK Joint Information Systems Committee, and the SURF Foundation in the Netherlands attempted to summarize the status of the deployment of institutional repositories in 13 countries (Australia, Canada, the United States, Belgium, France, the United Kingdom, Denmark, Norway, Sweden, Finland, Germany, Italy, and the Netherlands).6 The responses to the survey were incomplete (especially with respect to the United States), but to summarize:

  • The respondents reported a total of 305 institutional repositories.
  • Percentages of universities with an institutional repository ranged from 5% (Finland) to 100% (Germany, Norway, and The Netherlands).

A rather extensive list of repositories is the Institutional Archives Registry (http://archives.eprints.org/index.php) maintained by Tim Brody of the University of Southampton on the EPrints.org Web site. The Registry as of February 2006 contains 610 archives.

A number of software packages7 are also represented in the Registry. The most widely used package is GNU EPrints, developed at the University of Southampton. The second most common self-archiving system is DSpace, developed jointly by MIT and Hewlett-Packard.


Selection Criteria for the Web Citation Index

While selecting institutional and subject-specific repositories for the Web Citation Index, the Web Content Editors keep in mind a number of criteria. These criteria include the following:

  • Authority
  • Overall design, maintenance, and ease of use of the archive
  • Frequency of updates
  • Review policy/procedure (if any)

Authority

With respect to authority, the Editors are primarily concerned with identifying the body behind the archive. The Editors look for repositories and subject-specific archives sponsored by universities or colleges with faculty and staff active in research (e.g., MIT, Dartmouth, Australian National University); government research laboratories or agencies that produce a great volume of documents (e.g., NASA, EPA); major non-governmental or intergovernmental bodies (e.g., the United Nations); or prominent not-for-profit or non-profit research institutions, societies, or organizations (e.g., Max Planck Society). The Editors also consider the document collections of large, research-oriented commercial or corporate entities (e.g., Sun Microsystems, Hewlett-Packard), provided they contain free access to full-text documents.

Overall Design and Maintenance

Overall design and maintenance of the repository or archive Web site are important considerations as well. Sites with many broken links or excessive server downtime are problematic and generally rejected, as are sites that have not been updated in quite some time. Other design/maintenance features of a repository the Editors look for are document type metadata tags (e.g., article, thesis, or memo), some type of subject classification, and the ability to access the full-text document.

Frequency of Updates

The Editors also consider the frequency with which new materials are posted to the repository or archive. A regular stream of new articles signals that the collection is current and well-maintained. For institutional repositories, a frequently updated collection also signals that the faculty and staff of the institution have embraced the concept of archiving their works.

Frequency of updates, however, does not necessarily mandate that a repository should have a certain number of total documents. In fact, the total numbers of documents in the repositories reviewed and accepted by the Editors range from fewer than 100 to 350,000+. Fairly new or single-topic repositories, such as those run by the University of Washington’s Structural Informatics Group (http://sigpubs.biostr.washington.edu/) and the Advanced Knowledge Technologies collaboration in the UK (http://eprints.aktors.org/), have considerably fewer papers (e.g., 100-250) than do the more established or multidisciplinary repositories managed by large universities, such as DSpace at MIT (http://dspace.mit.edu/) and the Virginia Tech Electronic Theses and Dissertations Collection (http://scholar.lib.vt.edu/theses/), which each have several thousand documents. The arXiv at Cornell University (http://www.arxiv.org/) and the various technical reports servers maintained by NASA (e.g., http://library-dspace.larc.nasa.gov/) are examples of well-established yet more narrowly focused collections that have very large numbers of documents (350,000+ for the arXiv and 5000+ for NASA Langley Technical Library Digital Repository). The Editors, therefore, must consider the frequency of update, the age of the repository, and the range of subject matter – not just the total number of documents – when evaluating a repository for overall robustness.

Review Policy/Procedure

If the repository follows a self-archiving model, the Editors check to see to what extent – if any – the submissions are moderated. The Editors are not necessarily looking for a peer-review process; rather, they are checking to see if submissions are verified (i.e., reviewed with respect to appropriateness of content and adherence to archive requirements) before being added to the repository.

There are also some technical issues that could affect the ease with which the archive can be crawled and indexed. These include access to the full-text document (via a browse or search), document format (PDF, PostScript, zipped PostScript), and compliance to OAI metadata standards (particularly with respect to the pertinent source item elements (e.g., title, author, abstract).


Examples of Repositories

Some examples of well-designed, well-maintained institutional and subject-specific archives are arXiv (http://www.arxiv.org/), the physics, mathematics, computer science, and quantitative biology open access preprint archive maintained by Cornell University; Caltech Collection of Open Digital Archives (CODA) (http://library.caltech.edu/digital/default.htm/); the Australian National University Eprints Repository (http://eprints.anu.edu.au/), which includes all the scholarly output of the ANU community; and the NASA Langley Technical Library Digital Repository (http://library-dspace.larc.nasa.gov/). These examples fit the editorial guidelines with respect to authority, overall design, maintenance, frequency of updates, and ease of use.


Conclusion

With the development of the Web Citation Index, Thomson Scientific brings its strengths in citation indexing to documents resident on the Web in institutional and subject-specific repositories. This powerful new product initiative allows ISI Web of KnowledgeSM users to search Thomson Scientific’s core database of journal literature as well as the contents of selected repositories, and track the cited and citing relationships among these documents.


References

1. Current Web Contents: Web Site Selection Criteria (essay), (http://thomsonreuters.com/business_units/scientific/free/essays/cwc-criteria/)
2. Lawrence S, Giles CL, Bollacker K: Digital libraries and autonomous citation indexing. IEEE Computer 32(6):67-71, 1999
3. Open Archives Initiative Protocol for Metadata Harvesting, v. 2.0 (http://www.openarchives.org/OAI/openarchivesprotocol.html)
4. OAI for Beginners: The Open Archives Forum Online Tutorial (http://www.oaforum.org/tutorial/)
5. Crow R: The case for institutional repositories: A SPARC position paper. Scholarly Publishing & Academic Resources Coalition, 2002 (http://www.arl.org/sparc/IR/ir.html)
6. van Westrienen G, Lynch CA: Academic institutional repositories: Deployment status in 13 nations as of mid 2005. D-Lib Magazine 11(9), September 2005 (http://www.dlib.org/dlib/september05/westrienen/09westrienen.html)
7. Open Society Institute: A guide to institutional repository software, 3rd edition, 2004