Researchers at the University of Southern California have created a new tool for organizing and visualizing collections of electronic mail. It is designed to help legal researchers, historians, archivists, and others faced with challenges in dealing wtih large email archives.
For examples, consider the following cases:* A large corporation has just received a subpoena for all email messages on a specific question. Traditional keyword searches return an enormous volume of mail that must be scanned by lawyers and paralegals for applicability. In the same way, the recipients of the subpoenaed data must analyze it. Can this process be sped up and made more efficient?
* A historian is analyzing the history of a government decision, using an email archive. Reading all the text gives a great deal of information about the decision, but only careful notes can keep track of such events as shifts over time in the distribution of information, and even then subtle changes are hard to catch. Can software help?
* A library has just received a donation of a famous scientist’s email correspondence. Besides just a simple listing of titles, addresses, and dates, is there a way that the information in the archive can be made more immediately useful and comprehensible to users?
Anton Leuski of the USC School of Engineering’s Information Sciences Institute will demonstrate a system deisgned to speak to such problems July 30 at the Association for Computing Machinery Special Interest Group conference on Information Retrieval, in Toronto, Ontario.
Called “eArchivarius,” Leuski’s system uses sophisticated search software developed for Internet search engines like Google to detect important relationships between messages and people by taking advantage of inherent clues that exist in email collections.
It then automatically creates a vivid and intuitive visual interface, using spheres grouped in space to represent the relationships it discovers.
The display, a system called “Lighthouse” created earlier by Leuski and co-workers, can shuffle the connections to bring different elements to the fore.
In one display configuration, each sphere represents an author in the system. The spheres are visualized in a two- or three-dimensional space in which the distance between them indicates the number of messages exchanged over a given period.
For one collection used as an experimental exercise, exchanges of email among Reagan administration national security officials, this visualization immediately shows some recipients closely packed toward the center with their most frequent correspondents into a tight cluster, while others can immediately be seen to be literally out of the loop, far out on the periphery.
The spheres representing people can also be arranged under the influence of other factors: the content of the authored messages, for example. The resulting configuration shows existing communities of people who converse on the same topic and the relationships among those communities.
Selecting any email recipient can open up another window, which provides a list of all the people with whom the selected person exchanged mail, and a time-graphed record that shows when the exchanges took place.
“For a historian trying to understand the process by which a decision was made over a course of months, this kind of access will be extremely valuable,” said Leuski, a research associate at ISI.
And the same interface can instantly return and display individual pieces of mail in the form of hypertext pages with links to the people who sent and received the email and with links to similar email messages.
“Similar messages” can be defined in terms of recipients, text keywords, or both, and in the display produced using this capability; the spheres are the messages themselves, closer to messages similar in (for example) audience. The spheres can also be colored to show other relationships. Topic similarity, for example — the likelihood of a message to be about a particular topic can be shown by more or less intense color. Different colors indicate different topics creating a map of how the information is distributed among the messages.
“What we have in effect is a four dimensional display, with color added to the three spatial dimensions,” said Douglas Oard, an associate professor of computer science from University of Maryland’s College of Information Studies and its Institute for Advanced Computer Studies who is working at ISI during a sabbatical year.
Leuski and Oard have demonstrated the ability to find interesting patterns in collections as small as a few hundred emails, and the techniques they have developed are now being applied to thousands of emails sent and received by a single individual over 18 years. Scaling up to process millions of emails involving thousands of people will be the next challenge.
The elements of eArchivarius flexible and highly useful interface, Oard says, may someday find their way into email client software.
“Email has become a major element of modern life, and the raw material of history,” said Oard. “We believe that eArchivarius offers a way into the email labyrinth for researchers of all kinds.”
Published on May 22nd, 2003
Last updated on August 10th, 2021