Text encoding at the Kolb-Proust Archive for Research

Susan Kelsch, Dec. 2002

The Kolb-Proust Archive for Research at the University of Illinois is in the process of digitizing its holdings and publishing them on its Web site, gateway.library.uiuc.edu/kolbp/. The following information was shared by the Archive's librarian, Caroline Szylowicz. She generously gave her time to explain the collection and the digitization and publishing process.

The Kolb-Proust Archive for Research

The Kolb-Proust Archive for Research focuses on the work of Dr. Philip Kolb, a professor in the University of Illinois' French department between 1945 and 1992. His life's work was researching and publishing the letters of French author Marcel Proust. The Archive's holdings fall into three groups: 1.) the letters of Proust (some photocopies, some microfilms, some transcripts-any original letters are held in the University of Illinois' Rare Book Room), 2.) the notes of Kolb, which provide a survey of Proust's life and, by extension, life in Paris in Proust's time, and 3.) reference sources used to study Proust's life, including address directories, microfilmed copies of newspapers, and a complete collection of Proust's works. At this time the Archive's electronic publishing efforts are focused on Kolb's notes.

Proust's letters

Proust, who lived from 1871 to 1922, did not keep a journal but did write thousands of letters to family, friends, critics, readers, and others. Unfortunately, Proust did not date his letters, and most of them are in a casual, conversational tone, often without clear indication of the persons, places, or events being discussed. Kolb's work with Proust's letters was three-fold: 1.) finding the letters (both to and from Proust), 2.) piecing together the date of each letter and explaining the persons, places, and details described within, and 3.) transcribing the letters and writing annotations for their publication. The letters, around 5,000 in number, were ultimately published in 21 volumes, the last of which was published in 1993.

Kolb had access to some of the letters through Correspondance générale de Marcel Proust, 6 volumes of letters published by Proust's brother Robert between 1930 and 1936. In addition, some letters were made public by Proust's correspondents soon after his death. However, these early publications were problematic because they lacked explanation and because some text was suppressed in order to protect the privacy of persons still living. Kolb recognized the need to create a complete, carefully researched collection of Proust's letters. He started this project as his Ph. D. dissertation in 1938 and continued until his death in 1992. Kolb was given a large number of Proust's letters by Proust's niece. Other letters Kolb acquired or was able to see by contacting Proust's friends or the estates of these friends. Proust died at age 51 in 1922, so many of his friends were still alive in Kolb's early days of research.

Proust's letters discuss a wide range of interests, as befitted a well-read, educated man of a wealthy class. He lived all his life in Paris and most of his letters are set in that city. His letters speak at great length of the people in his life, revealing a keen interest in human relations and a generous, warm spirit. He was an avid newspaper reader and discussed many current events in his letters. He often quoted from literature, giving the reader a glimpse into his influences. All these factors make the letters extremely valuable to Proust scholars. But the letters, and Kolb's notes about them, do something more--they give life to lesser-known persons and paint a unique portrait of Paris at this time.

Researching the letters

Kolb went to remarkable lengths to identify the dates, people, places, and events described in Proust's letters. Every bit of information within a letter was a clue that could be used to determine when the letter was written. And the date of one letter could lead to dates of others.

Consider an example, a letter written by Proust, discussing a recent election and anticipating a dinner party given by his good friend Prince Bibesco on the following Wednesday. By investigating elections in a newspaper of the time, such as Le Figaro, Le Gaulois, L'Humanité, or L'Action Française, Kolb may have been able to make a guess about the letter's date. If he was lucky, he would find an article in Le Figaro around the time of an election that described a lovely party that was hosted by Prince Bibesco, and identified the people who were present. This could be enough to prove the date of the letter. And Kolb might also have another letter, presumably sent a few days later, describing a dinner party at Bibesco's house, and including a newcomer whom Proust was most happy to meet. Using the clues, Kolb could determine that the two letters described the same party. The two letters together fill out details that are missing in the letters individually. In this way, Kolb built a web of information that could help him date and annotate as many letters as possible. The information was then recorded on Kolb's note cards.

In one extreme case, Kolb used a discussion of three days of thick fog in Paris to help date a letter. A graduate assistant investigated newspaper weather reports in order to make use of that information. Kolb also leaned heavily on letters that were written to Proust. Often these correspondents, unlike Proust, dated their letters. So a dated letter from a friend discussing an event in Proust's life was always a welcome discovery.

The Archive also includes address books from Paris in the early 1900s. These books have been used to identify names in more complete form and to identify dates. If Proust mentioned the neighborhood where Bibesco lived in his discussion of the party, Kolb could check the address books to get an idea of the year. Or in some cases Kolb had the original envelopes to help identify the address of the recipient. The address books and the newspapers of this time are extremely valuable tools in the Archive's collection. In fact, the newspapers, which are stored on microfilm, remain part of the University's Newspaper Library collection. But after years of use by Kolb and his assistants they gained permanent residence in Kolb's office (now the Kolb-Proust Archive space).

Often Kolb had very little information within the letters' text from which to glean clues. He made careful notes of the watermarks and other characteristics of the paper the letters were written on. Proust wrote on common, mass-produced paper, and Kolb was often able to group letters from the same ream of paper, and therefore the same time period. In some cases Kolb was able to ask Proust's friends for information directly. However, this could be problematic as well, since the information was subject to the inconsistencies of memory.

Kolb's notes

Kolb's notes are the object of the Archive's electronic publishing efforts at this time. These notes were written on 3" x 5" note cards, nearly 40,000 in total. The text is written almost entirely in French. The print is tiny, sometimes barely legible. (A few of the cards are typed, but most are handwritten.) Kolb used his own code system to indicate different pieces of information. For the cards that referred to letters, Kolb often drew little sketches of the paper's watermark or other markings. His philosophy was to keep like information on one card, so in some cases the cards are well-worn from use and crammed with miniscule text. (The bibliography cards were not index cards at all, but rather slips of paper cut to index-card size.)

Kolb organized the cards into ten categories: letters, persons, places, Proust's works (bibliography), events (chronology), stationery/watermarks, literary or artistic works, periodicals, literary citations, and persons of importance not named in Proust's letters. The bibliography cards have been digitized, that is to say they have been transcribed into SGML files, and are searchable through the Archive's Web site. The chronology cards have been also digitized and are now searchable on the Web for entries through 1912. The person cards are being digitized now.

Note cards: correspondence/letters

Kolb created a card for each letter he acquired, photocopied, or was able to see. (And he noted on the correspondence cards which of the three categories a letter belonged to.) He recorded who was or had been in possession of the letter. He quoted important text from the letter, usually referring to persons, places, or events. He made notes about these quotes, explaining the references. He recorded any outside sources he used to identify items in the letter. These cards were organized by date. Letters that he could not date exactly were sorted according to Kolb's best guess. He did not make any attempt to devise a numbering system for the letters. In many cases, Kolb relied on his own memory to find the cards he needed. He was able to remember minute details about letters and references without any kind of numbering system or the benefit of a full-text search. These cards are not being keyed at this time, although they may be in the future. (See Figure 1, appendix.)

Note cards: chronology

Kolb made around 9,500 note cards for events in Proust's life. In some cases these are famous events such as a trial or an election, but usually they are personal events such as a dinner party or a cruise. Not all the events were specifically mentioned in Proust's letters, but they were added to the file because they were useful in Kolb's research. The chronology basically begins around the 1840s, with the lives of Proust's parents. (Although the earliest date is 1633, referring to one of Proust's ancestors being appointed to a government position.) Again Kolb carefully recorded the sources of his information on the chronology cards. These sources were often journals or newspapers of the time, but Kolb also used almanacs, biographies, memoirs, directories, or literary works. The chronology cards have all been keyed, and Caroline is editing them now. Cards for events up to 1912 are available on the Web at this time. (See Figure 3, appendix.)

Note cards: bibliography

Kolb's bibliography file was around 1,300 slips of paper containing references to a work by or about Proust. The notes refer to works by Proust, including letters, articles, interviews, prefaces, and translations, as well as his original literary works, published during his lifetime or posthumously. The notes also refer to texts about Proust, either periodical articles or monographs published before 1922 (the year Proust died), or a series of tributes to Proust published between his death and Jan 1, 1923. The bibliography note "cards" are digitized and available on the Web.

Note cards: persons

Kolb kept note cards of people Proust wrote to or referred to in his letters. These cards are full of information about when they lived, where they lived, whom they married, and titles they held. Kolb noted relevant quotes from the letters that refer to these people. And he carefully recorded the sources of this information. The Archive is in the process of keying these cards. (See Figure 2, appendix.)

Digitization process, early decisions

The digitization project was initiated in the fall of 1993 by the French department and the University Library, with the endorsement of Kolb's family. At that time the keepers of the Proust Archive reviewed their options for digitizing the materials. They quickly rejected an idea to capture digital images of the cards and make them available as simple image snapshots. They felt that server space would be an issue. But, more importantly, they recognized the need for free-text searching and the possibilities of hypertext links from card to card. And, on a basic level, digitization would make the text more usable, since many of the cards contained cryptic codes and were difficult to read. The planners considered using databases or other methods. They eventually settled on marked-up text files because they wanted to keep the note card model: keeping information together as discrete cards, but including some mechanism that would maintain Kolb's original sort order of cards within the 10 file types.

The project planners decided to key the text into text files marked up with SGML tags, using the TEI Lite DTD. Caroline describes the tags that are used as "structural," rather than entirely descriptive. This relates to another reason the planners decided against using a database to store all the cards. Because of the various types of data on the cards, it was not feasible to devise standard database fields for all the information, or to use purely descriptive markup tags on every element in the text files. So the planners compromised by using descriptive tags in some cases, but generic tags in others. For example, in the person cards, the birth and death dates of the person are recorded as items in a list (the TEI Lite "List" structure is used for the person cards). They are not identified as dates per sè. However, dates are identified as such when they are encoded (in two different ways, in fact) in the chronology cards. The planners had to make decisions about how detailed the tags would be, considering the resources of the project itself and the future needs of searchers using the data.

The TEI Lite tag set has worked very well for the project's needs. Caroline reports the addition of only one tag outside the standard TEI Lite set: a tag that is generated by scripts run on the files after they are keyed. Caroline notes that the Kolb note cards are regular enough in form that the keying project planners are able to map out a template of relevant tags for data on a file-by-file basis, and these templates have rarely changed as the text is entered. This contrasts with other keying projects that Caroline has worked on, where the project managers must identify new tags for unanticipated text features as the keying project moves along.

The process

The Archive transcribers use Author/Editor, an SGML editor created by SoftQuad. In the three files that have been touched so far, the transcribers have used about 30 different TEI Lite markup tags. The cards are keyed into Author/Editor in the order they appeared in Kolb's files, usually in batches of 10-20 records. After the records are keyed into a batch, the batch file is run through a script that adds a unique identifying number to each record by way of the tag. (The start of a new card is identified by this tag.) Caroline must launch the script, and she must tell the script what number to start with as it numbers the cards. The numbers are preceded with a single letter to identify which file they belong to (c = chronology, p = person, etc). These unique identifiers can then be used to create hyperlinks among records. Caroline, the editor for the keying work, adds the hypertext links to the records by hand in Author/Editor after the cards are keyed (usually by a graduate assistant). Caroline keeps an ever-growing notebook of reminders to herself to go back to cards to add hyperlinks or to add hyperlinks to upcoming cards. At this point, hyperlinks are only being made to and from cards within the same file type.

Caroline or the keyers sometimes add notes to indicate items they feel are overt mistakes on Kolb's part, or when they can cite scholarship that has brought new information to light since Kolb's death. These notes are identified with Caroline's initials or the initials of another researcher.

Some special considerations were made for certain fields. For example, date information in the chronology cards was keyed in two ways. First the information was keyed exactly as it was found on the note cards. Then it was keyed into a different field in a normalized 8-digit format: yyyymmdd. Caroline points out that the normalized date is a drawback to the digitization process, as it does not capture the nuances that Kolb identified in his work. He often identified dates as "a few days before Sep 1," or "in the week of Mar 16." This goes into normalized form as Sep 1 or Mar 16, and the nuance is lost. The transcribers have to make these kinds of compromises for the normalized format.

Authority control: person names

Caroline and the transcribers quickly realized a need to identify clearly the persons being discussed in the chronology and bibliography cards. Caroline started using a home-grown (written in PERL) authority control system to capture names information. Thus far the authority control file includes over 6,000 names. The system stores complete names, pseudonyms, titles, and names of spouses. It also includes other information, such as birth, death, and marriage dates. Each person is given a unique identifying number. Person names are searchable on the Web site in a separate person search. After searching for a name (as the author of a work or within a note card's text), the user may click to chronology records that discussed the specified person.

This authority control work has proven to be one of the most time-consuming aspects of the digitization process. Caroline researches all the names. She uses the sources sited in Kolb's cards or other sources as necessary. She must carefully analyze personal titles to identify variant names. Researching women's names can be particularly difficult, as women of this time were often identified as someone's wife rather than being named specifically. So if a count or lord had more than one wife in his lifetime, Caroline will have a particularly difficult task determining which lady is which. The creators of these title and name conventions clearly were not thinking of late-20th-century digitization and authority control projects. The Paris newspapers and address directories are particularly useful in this name research. Caroline expects the names authority file to continue to grow as the Archive digitizes more cards (particularly the person cards). Caroline also maintains a titles authority file (similar in structure to the person names authority file) to manage bibliographic information.

File management

As mentioned earlier, the cards are typically digitized in batches of 10 to 20 cards. In some cases the groups correspond to a specific aspect of the cards. For example, in the case of the chronology cards the keyers sometimes organized all cards for one year into a single file. But for some years, the number of cards by far exceeds a reasonable number to contain in one batch, so the batches are split arbitrarily. The information contained within each batch file of cards is identified by its file name. These names indicate the type of card, the dates involved (in the case of chronology cards) or the alphabetical range covered (in the case of persons), the number of cards digitized within each file, and the initials of the last person to edit the file.

Chronology and bibliography files on the Archive's Web site

The Kolb-Proust Archive Web site is hosted on a server in the University of Illinois' Grainger Engineering Library. This seemingly incongruous placement is due to Grainger's involvement in the Digital Libraries Initiative project, which was starting at Grainger in the early 1990s around the time the Kolb-Proust digitization process began. The keyed chronology and bibliography files, as well as the person names and work names authority file records, are stored as text files on the Grainger server. After the text is keyed and edited, a copy of every file is archived in the Archive's server space, and another copy (a text file) is sent to the server at Grainger that operates the search engine. Sending the data to Grainger is a complex process in itself. Through a series of scripts, Caroline manipulates the data and builds the search indexes. At this time the field is added to the data. Characters with diacritical marks are translated to ISO codes and Caroline searches for specific characters, such as the ampersand, that are problematic for the Web site's search engine. Updating records is a time-consuming process for Caroline, because of the time spent in rebuilding indexes and re-sending the files.

The Archive's Web site offers end users several search options. Users may search for text strings or date ranges within the chronology or bibliography cards. The search screens are in French only, unlike the rest of the Web site which is available in either French or English. As the text of the cards themselves are in French, users are expected to search for strings in French. Users may also search for names of persons or names of works. These two searches access the name authority work that Caroline has done so far.

Current and future issues

The Grainger server and search engine continue to support the SGML-encoded Kolb-Proust data, despite a move to XML text encoding in many of its other projects. The Grainger managers have said that a move to XML and the use of a newer SQL-based search engine could result in more robust search options and an easier process for building indexes. A move to XML will be transparent to Caroline as far as the text encoding is concerned, as it is largely XML-compliant now. However, changes to files name extensions and other minor details will be made. The most important change from Caroline's point of view will probably be a change in her software editor. Also, a move to newer tools may reduce the number of scripts outside of Author/Editor that she has to run before the files are ready for publishing on the Web.


The Kolb-Proust Archive digitization project stands out in two important ways.

  1. The Archive is not making any immediate plans to digitize the letters of Proust. The Archive's managers have chosen instead to focus on the scholarship of Kolb and its importance in its own right.
  2. The digitization project will bring access to materials beyond what is possible in the materials' original form. Already the full-text searching option, made possible by digitization, makes the Archive's note cards accessible in a powerful way. And when all the cards are digitized, the hyperlinks between cards will be a useful navigation option.

    But the digitization has added something else to the Archive-the digital versions of the cards are more useful than the print versions. In most archives, individual items are unique treasures, useful in their own right. The challenge is finding them at all. But Kolb's note cards were most useful to Kolb himself. Simply finding the paper card about Prince Bibesco is not enough to use it. The digitization project has added clarity and context to the information written on the cards, making them accessible in new, important ways.

In the future, Caroline hopes that Proust's letters will be re-published, both in print and in electronic form. She hopes that the letters will appear electronically in an interactive online forum where Proust scholars can contribute their knowledge to what has been published and documented. Issues of copyright and resources will need to be addressed before this forum can be built. Caroline predicts that Kolb's note cards will be integrated in this forum in some way. In the meantime, the Archive focuses on publishing Dr. Kolb's work on the Kolb-Proust Archive Web site.


Figure 1. A letter card, referring to a letter to Proust's friend Prince Bibesco, discussing a cruise Proust took in 1904. All cards in this Appendix are enlarged. The original cards are standard 3" x 5" index cards.

Letter card.

Figure 2. A person card, discussing various references to, and sources about, Bibesco.

Person card.

Figure 3. A chronological "card" (slip of paper cut to size), with information about the cruise.

Chronological card.

Figure 4. The same chronological card in digitized form, as seen in Author/Editor.

Chronological card, Author/Editor form.

Figure 5. The SGML form of the same chronological card.

Chronological card, SGML form.

Created on ... June 15, 2003