Summer 2024
Modernist Archives Publishing Project
Project Lead: Alice Staveley
The Modernist Archives Publishing Project is a critical digital archive of early 20th-century publishing history. With rich metadata, the site displays, curates, and describes documents that contribute to the “life cycle” of a book. It uncovers the often invisible industry actors—editors, illustrators, reviewers, printers—who bring works into the public eye. The collection contains thousands of images from archives and special collections relating in the first instance to Virginia and Leonard Woolf’s Hogarth Press—letters, dust jackets, financial records, paper samples, illustrations, sketches, production sheets, and other “ephemera”—but is actively expanding into other presses, with the long term goal of building the infrastructure currently lacking in book historical studies to engage a comprehensive comparative landscape of 20th-century book publishing.
Project Members
Project Team
Alice Staveley
Senior Lecturer in English
Sophie Wu
Undergraduate Researcher - Summer, 2024
Delaney Swinton
Undergraduate Researcher - Summer, 2024
Born Analog, Made Digital
Overall, my work on this project focused on experimenting with ways of applying digital tools to transcribe text from archival images related to Virginia Woolf—largely her reading notebooks and financial records from the publishing company she ran, The Hogarth Press. Creating machine-readable versions of these images would hopefully allow for new and novel inquiry into Woolf, her reach, and the economics not only of the business she ran, but the impact that it had on the early 20th-century book industry at large.
My work over the summer was largely divided between these two bodies of images.
For her reading notebooks, I wrote code to both scrape and prepare images for automatic transcription, otherwise known as optical character recognition (OCR), on the platform Transkribus. This process mainly involved automatic scaling and cropping, binarization (turning the images black and white) and de-noising the images. Once the images were ready, they were fed into Transkribus, which automatically output its best guess at what Woolf had written. Given that these transcriptions were imperfect, I then fine tuned the model internal to Transkribus in an attempt to improve its performance. To do so, I compared the Transkribus output to a manual, verifiably correct, “ground-truth” transcription, and corrected the Transkribus output. Following this, I also put together a dataset of the final Transkribus transcriptions as compared to the “ground-truth” transcriptions, which could potentially be used for future fine-tuning of other types of OCR AI models.



For the financial records from Hogarth Press, I largely worked with Google’s Gemini, a large-language model (LLM) similar to ChatGPT. In doing this, I researched and experimented with the design of different prompts to best extract tabular and heterogenous data. In parallel, I also wrote code and modified computer vision algorithms to improve the accuracy of automatic image segmentation, which in turn makes Gemini’s transcriptions more accurate.






All of this is put together in a pipeline that will allow for these financial records to be transcribed in bulk, and to reduce the human effort needed to transcribe these records, and otherwise correct them.
My work continues on in finetuning the process of extracting the other columns of information from each image, using additional image manipulation techniques, modified prompting, and leveraging LLMs for data cleaning.


In all, my work focuses on bringing born-analog texts into the modern age in order to further novel and unique inquiry into Woolf’s work and its continued influence.
Exploring Woolf’s Reception in South Africa
My work on the Modernist Archives Publishing Project focuses on Virginia Woolf’s reception in South Africa. I filtered the Hogarth Press Order Book dataset to find the total amount of books sold by Woolf in South Africa during the 1920s and 1930s. I found that her works were bought by the Argus Printing & Publishing Company, South Africa’s largest press conglomerate of the 20th century. My initial goal was to discover more about the company, a goal that was initially impeded due to a research gap surrounding South African print culture and bookselling. To address this gap, I searched through GoogleScholar, JSTOR, Searchworks, and the EBSCO MLA international bibliography for mentions of the Argus company, cataloging my findings in a Zotero bibliography. My first research breakthrough came when I came upon the 1906 International Directory Booksellers and Bibliophile’s Manual. Using the records contained within the manual, I was able to overcome the research gap I had previously encountered. Amongst these findings, the most notable were records of the Argus Printing & Publishing Company’s bookselling storefronts located throughout South Africa in Bloemfontein, Johannesburg, and Cape Town.

Upon the completion of my bibliography, Professor Alice Staveley, my project mentor, gave me a new research objective: create a biography of the Argus Printing & Publishing Company for the Modernist Archives Publishing Project website using the records I had found. Professor Staveley and I contacted Karen Fung, the African Studies Librarian at Green Library, and Professor Isabel Hofmeyr, a renowned South African academic. We intended to make use of their expertise in the field of South African literary history to further investigate the role played by the Argus Printing & Publishing Company as a bookselling entity. Professor Hofmeyer recommended reading through Today’s News Today : The Story of the Argus Company, which encapsulated the company’s history from its inception up until its 100th anniversary in 1956. It proved particularly helpful in painting a more detailed picture of the company’s book distribution history. By the end of the summer, I had crafted a biography of the Argus Printing & Publishing Company centering its relationship to the broader history of South African publishing history.
