The Size of Thoughts
Page 13
The preparation of a catalogue may seem a light task, to the inexperienced, and to those who are unacquainted with the requirements of the learned world, respecting such works. In truth, however, there is no species of literary labor so arduous and perplexing. The peculiarities of titles are, like the idiosyncrasies of authors, innumerable.
In 1850, the librarian of the American Antiquarian Society was asked to produce a new catalog for the society. “Men have become insane,” the agitated librarian responded,
in their efforts to reduce these labors to a system; and several instances are recorded where life has been sacrificed in consequence of the mental and physical exertion required for the completion of a catalogue in accordance with the author’s view of the proper method of executing such a task.
One Sunday, feeling only semi-sane myself, I called up Jim Ranz, retired dean of the Libraries of the University of Kansas, from whose immortal monograph The Printed Book Catalogue in American Libraries: 1723–1900 (1964) this last quotation is taken, and I asked him to comment on the passing of card catalogs. Mr. Ranz was not terribly concerned about their fate. “Retention of a card catalog would have to be a pretty low priority in most libraries,” he said. What he really wanted to talk about was Charles Ammi Cutter (1837–1903), the author of what is in Mr. Ranz’s opinion the finest library catalog ever made. Cutter’s masterpiece is the five-volume catalog of the Boston Athenaeum, published between 1874 and 1882. “I’m not sure he wasn’t the greatest cataloger that lived,” Mr. Ranz told me. The work is 3,402 pages long, and is elaborately and commonsensically cross-referenced; it cost the Athenaeum almost a hundred thousand dollars to produce. It is still of interest and utility to historians—as is the card catalog for the Athenaeum, which Cutter also developed. So far, the library has held on to its original cards.
Surely, I insisted to Harvard’s Dale Flecker, the Boston Athenaeum’s card catalog, at the very least, ought to be preserved. “Oh, I don’t know,” Mr. Flecker replied. His indifference makes sense, in a way, since he couldn’t very well advocate the preservation of the Athenaeum’s catalog and at the same time defensibly jettison the older and equally rich public catalog at Harvard. The young Charles Cutter had given his energy to Harvard’s cards, too; while working for Ezra Abbot, who was Harvard’s assistant librarian from 1856 to 1872, he had refined his theories about how people actually perform subject searches and what they require from a library’s finding list. In 1861, Ezra Abbot instituted one of the first card catalogs that were “freely and conveniently accessible,” in his words, “to all who use the Library.” By the turn of the century, the traditional bound catalog had become a technical impossibility for large libraries, and card catalogs, predominantly handwritten (despite the existence by then of early typewriters), were everywhere.
In January 1901 the Library of Congress began printing its catalog cards in quantity and selling them in sets to any library that wanted them. These cards—elegant in their own way, accurate, highly readable, and cheap—took off. Even Cutter himself (with good grace, since his advocacy implied the eventual death of his own artful system of subject classification) recommended the purchase of Library of Congress cards, writing in 1904 that “any new library would be very foolish not to make its catalogue mainly of them.” And libraries obeyed. A 1969 study of 1,926 randomly selected cards, all plucked from drawers of the shelf list at Rice University’s Fondren Library, found the following kinds:
15 handwritten
1,275 unmodified Library of Congress
68 modified Library of Congress
472 typewritten
96 miscellaneous, describing maps, musical scores, serials, etc.
(This same pre-computer-age study, published by MIT, determined that the average number of cards in a drawer was 826, that the typical book represented by a card was 276.6 pages long, and that the growth rate of Rice’s library holdings closely tracked that of the United States gross national product.)
The Library of Congress’s handiwork dominated card catalogdom through the early seventies. In 1968, it was distributing about a thousand cards a minute, for around five cents a card. Meanwhile, Fred Kilgour, a chemist turned librarian, sensing that the Library of Congress was failing to exploit the full possibilities of its newly developed machine-readable cataloging techniques, formed OCLC and became, among many other things, the catalog-card printer for the world. (OCLC sprang up in Ohio, according to Kilgour, because “in Ohio, and in the eastern Midwest, people in general are more willing to accept calculated risk with reference to innovation.”) Since 1970, OCLC has printed 1.8 billion catalog cards on its high-volume line printers: they’re the ones with the distinctive, slightly jaunty typewriteresque typeface. Though they cost slightly more than Library of Congress cards, OCLC would automatically sort your duplicates any way you wanted—all together in one alphabet, say, or separately alphabetized for the subject catalog, the author-title catalog, and the shelf list. (A shelf list is a card catalog arranged in call-number order; catalogers use it to help them shelve like books with like.) Since the labor involved in filing cards is an enormous part of the cost of maintaining a card catalog, OCLC’s adaptable presorting was a real advantage, and for years OCLC was esteemed as a card-printing service even by universities that (like Princeton) sniffed at the quality of its growing database.
But it was the massive database itself that became OCLC’s real triumph. For a fee, a library became an OCLC member and got one or more dedicated Beehive terminals (advanced for their time, able to handle the diacritics that catalogers needed, when other computer interfaces generally offered only capital letters), each linked to Ohio. For two dollars per title, a member cataloger could look through OCLC’s records to see whether the book before her had already been cataloged by somebody else—either by the Library of Congress (whose MARC records OCLC bought and loaded into its database) or by another member library. (Each library was identified by a three-letter tag.) If she found a record, and the record looked good, she would request that OCLC print up a set of cards for it and send them to her. In this way, a library could eventually relegate a good deal of the cataloging work that had once been performed by degreed professionals to lower-paid clerks and student assistants.
And the brilliance of Kilgour’s enterprise was that if the cataloger did not find a record, she could undertake to describe the book herself, and contribute her work to the system as a “master record” for that book, for the good of all members. She wrote a sort of poem, following a set of rules more rigorous than a villanelle’s; she sent it off to people in Ohio who published it for her; and then she got paid a few dollars—in the form of a cataloging credit against future OCLC charges. The more fresh “copy” a cataloging department offered OCLC, the cheaper its use of OCLC was, and thus there was plenty of incentive for all libraries, engaged in the creation of a kind of virtual community long before there were such things as Usenet and listservs, to pump up the burgeoning database. What began mainly as a handy, unilateral way of delivering the Library of Congress MARC files to member libraries turned into a highly democratic, omnidirectional collaboration among hundreds of thousands of once-isolated documentalists: currently, there are close to thirty million records in the database, only a quarter of which originally came from the Library of Congress, the majority being the work of nearly seven thousand member libraries.
But amid this public-spirited hubbub there were some signs of trouble. “Distributed computing,” in the recent words of Paul Lindner, one of the architects of Gopherspace on the Internet, “is like driving a wagon pulled by a thousand chickens”—and distributed cataloging, although its principal database is anchored in central Ohio, exhibits a similar noisy, gabbling, drifting quality. Quality was, indeed, a serious problem from the start: predictably, some libraries were much more careful and skillful at describing books than others. Wright State University, out of misguided zeal or a lust for cataloging credits, reportedly pushed thousands of unwholesome records
into the OCLC database—at least, Wright State is often dumped on now, perhaps undeservedly, by the folklorists of OCLC history. Libraries began to “blacklist” institutions whose three-letter tags were sure signs of bibliographic corruption. “The scuttlebutt got around fast as to who did sloppy cataloging,” one librarian told me. In truth, though, everyone made mistakes. The interactive, cooperative group authorship of a resource of this complexity was something utterly new, and since OCLC exercised no editorial control over the contributions pouring in from its members, the cumulative perils of Fred Kilgour’s forward-thinking system took perhaps longer than they should have to emerge.
One source of entropy was OCLC’s laissez-faire concept of the “master record.” The very first attempt to catalog a book on the database, no matter how unmasterly, how inadequate it might be to the needs of other libraries, became by default the “master record” for that book. For years—until, in 1984, OCLC granted a small group of libraries enhanced-member status, allowing them to improve upon faulty or skimpy records they encountered on their own—any sort of change to the master record was a laborious manual process. If a cataloger noticed the typo herself a week after she had conclusively pressed the send key at her terminal, she could not (if another library had tagged the record with its initials by then) correct her mistake onscreen; she had to fill out an error report and mail it (not electronically but with a stamp) to OCLC. I have heard librarians and professors of library science mention errors enshrined in the OCLC database that they haven’t bothered to take the time to try to fix—in some cases, serious errors affecting the retrievability of books to which they themselves have contributed.
The other serious weakness of the OCLC database was its lack of “authority control”—librarianship’s grand term for the act of naming entities (people, churches, government departments, periodicals, subject headings, and so on) consistently. Assume, to take a simple example using a university database, that you are assigned the task of cataloging an eminently hummable document by a person named Pjotr Iljics Csajkovszkij. Who is he? Is he perhaps the same individual as P. I. Cajkovskij? And does P. I. Cajkovskij bear some intimate relation to P. Caikovskis? Could it be that Peter Iljitch Tschaikowsky, Peter Iljitch Tchaikowsky, Pjotr Iljc Ciaikovsky, P. I. Cajkovskij, Peter Iljitsj Tsjaikovsky, Piotr Czajkowski, P. I. Chaikovsky, Pjotr Iljics Csajkovszkij, Pjotr Iljietsj Tsjaikovskiej, Pjotr Ilitj Tjajkovskij, P. Caikovskis, Petr Il’ich Chaikovskii, 1840–1893, Peter Illich Tchaikovsky, 1840–1893, Peter Ilych Tchaikovsky, 1840–1893, and Peter Ilyich Tchaikovsky, 1840–1893, are actually all the same man? If so (and this degree of title-page variation is by no means unusual for voluminous authors, many of them less well known than Tchaikovsky), the computer has to be informed of that fact outright; otherwise, symphonies and string serenades will be sprinkled haphazardly over the alphabet and a searcher won’t have any idea what he is missing.
Authority control has always bedeviled the makers of catalogs, and the bigger the catalog, the more eras of publishing history it covers, the hairier things become. For Sirine and Sirin and Nabokoff-Sirin, see Nabokov. For House & Garden, see HG. For Alexander Drawcansir, Petrus Gualterus, Conny Keyber, Scriblerus Secundus, John Trottplaid, and Hercules Vinegar, see Fielding, Henry (1707–1754). For Ogdred Weary and St. John Gorey, see Gorey, Edward (1925– ). In the late seventies, the second version of the Anglo-American cataloging rules caused a convulsion of despair in libraries when it demanded that Samuel Clemens be officially called Mark Twain, just because more of his books appeared under his primary pseudonym than under his real name. The whine of power erasers was heard through the land. (In librarianship, “eraser lung” was the seventies equivalent of carpal tunnel syndrome.) It is safe to say, however, that the apostles of St. MARC completely failed to foresee how abysmally poor the computer would be at grasping the concept of human identity. A person—even a fairly inattentive person-paid to file cards in a card catalog all day can tell that “Alexander the Great, 356–323 B.c.” is the same man as “Alexander, the Great, 356–323 b.c.” and “Alexandria the Great, 356–323 B.C.”; we would also expect him to sense the unitary presence behind cards for “Montagu, Lady Mary (Pierrepont) Wortley, 1689–1762” and “Montagu, Mary (Pierrepont) Wortley, Lady” and “Montagu, Mary Pierrepont Wortley, Lady, 1689–1762”—to use examples from one online catalog. “The card catalog,” as Tom Delsey, of the National Library of Canada, wrote in 1989, “exhibited a relatively high tolerance for deviation from literal and logical norms.… Typographical errors or inconsistencies in headings could be silently corrected in the process of filing the card; added entries that did not match exactly the corresponding main entry on the card to which they were related could nevertheless be placed in their proper sequence in the file.”
The OCLC database, on the other hand, was, until quite recently, intolerant of deviation. Authors get married, they receive honorific titles, they die and have a year put to the right of the hyphen. Or suddenly The New York Times starts spelling Mao Tse-tung “Mao Zedong.” In the face of all this bewildering variability, the object of a catalog, as Charles Cutter himself suggested in his Rules for a Printed Dictionary Catalogue, is to group together, or collocate, all the works by a given writer, and all the editions of a given work by a given writer, and all the works about a given writer’s work, and all the biographies of a given writer, in the proper groups and subgroups, rationally.
For instance, we would prefer (this example is from a search of Harvard’s HOLLIS, which I did in October 1993), when attempting to view the books written by Alfred Tennyson, that they weren’t arbitrarily distributed under three separately alphabetized, unpunctuated headings: TENNYSON ALFRED TENNYSON BARON 1809 1892 and TENNYSON ALFRED TENNYSON 1ST BARON 1809 1892 and TENNYSON ALFRED TENNYSON 1809 1892. Moreover, it would be nice if the first work listed as by TENNYSON ALFRED TENNYSON BARON 1809 1892 (in response to the command “Find Au Tennyson”) were in fact a work by Alfred Tennyson, and not a work by Tuningius, Gerardus (1566–1610), called Apophthegmata graeca, latina, italica, gallica, hispanica (“Imperfect: title-page slightly mutilated”), that happens to be autographed on the front endpaper by Tennyson. And we would prefer that the second work listed as by Alfred Tennyson were not The Kraken: for solo trombone, by Deborah Barnekow, 7 pp. (1978). (Ms. Barnekow is right, though: if Tennyson’s sea monster played an instrument, it probably would be the trombone.) It would be nice, too, if Neuronal Information Transfer, co-edited by Virginia Tennyson, didn’t intrude between several books published by the Tennyson Society and a tempting entry for a work called “Tennysoniana”—an entry that, when I accepted it, plucked me from the Tennyson list and dropped me into a list of twenty-three books by SHEPHERD RICHARD HERNE 1842–1895, none of which was “Tennysoniana.” (Many of these oddities mysteriously disappeared shortly before this article went to press, but there are thousands more. A quick check of HOLLIS on March 21, 1994, revealed that Bolingbroke, Villiers de L’Isle-Adam, Edward Bulwer-Lytton, and Bernard Berenson all have works wrongly segregated under at least three different forms of their names. Charles George Lamb’s Alternating Currents and Charles W. Lamb, Jr.’s The Market for Guayule Rubber come between editions of Charles Lamb’s Essays of Elia. And 462 records for works by Thomas Macaulay are separately alphabetized under eight versions of his name.) I have no doubt that Dale Flecker believed what he was saying when he told me that “the machine catalog is in almost no cases worse and in most cases better than the card catalog was.” But in my experience, five minutes with any online catalog is sufficient time to uncover states of disorder that simply would not have arisen in what library administrators call a “paper environment.”
When I visited OCLC, some of the staff freely admitted to me that card catalogs currently do a better job of collocation than online catalogs do. “We’re only partway there,” Barbara Strauss, then a senior product support specialist at OCLC, told me. (Ms. Strauss “knows cataloging like your tongue knows the inside of your mouth,” one o
f her colleagues said.) Her boss, Martin Dillon, the director of OCLC’s Library Resources Management Division, recently told an interviewer that browsing the OCLC database using “keyword indexes, author indexes, and subject-term indexes sheds a harsh light on misspellings and errors of all types.” A random sample in one 1989 OCLC study found a hundred and ten separate records for Tobias Smollett’s The Expedition of Humphry Clinker in the database, nearly half of which were potential duplicates, kept separate by minuscule variations and typos. You have to feel sorry for the sophomore accounting major who is hired as a part-time “copy-cataloger” by his university’s library, given a week’s training, and handed an old edition of Clinker left to his university’s library by an alumnus; you have to forgive him when, having drifted for a time through some of the seemingly endless, code-disfigured series of records, looking for a hit, he swears, gives up, and decides that it’s faster just to make up another record on the fly, further cluttering the system with the hundred and eleventh “edition” of The Expedition of Humphry Clinker.
In the past few years, fortunately, OCLC has done a lot of automated cleanup. (The cleanup has to be automated, for, laments Martin Dillon, “when databases get as large as ours the contribution of individual humans is severely limited. The task is so large that no practical number of humans could handle it.”) OCLC’s “DDR” software—“Duplicate Detection and Resolution”—which was first installed in 1991, compares two records at as many as fourteen points and decides whether they stand for the same book, and thus should be fused, or not: if they differ only by an ellipsis (…) at the end of a truncated subtitle, say, or if one calls the publisher “Wiley” and the other calls it “John Wiley & Sons,” the two become one. Common but hard-to-see typos like “Great Britian” and “Untied States” no longer force fictional duplicates. Over six hundred thousand redundant records are gone as a result of this work. And OCLC is now refining authority-control software that becomes more experience—crisscrossed by more specific links among separate forms of the same person’s name, for example—the more it works through new data. Millions of orphaned records have been united since 1990. (There have been a few embarrassments along the way, naturally: “Madonna” was globally altered by OCLC to “Mary, Blessed Virgin, Saint” as part of an authority-control routine—a change that, before it was corrected, caused problems for libraries interested in cataloging the recent work of Ms. Ciccone.)