LDA Modeling identifies clusters of words with a disproportionately high probability of occurring together in text.
The algorithm groups highly correlated words into numbers of topics. The node size represents the likelihood of the word existing in the topic. Select a topic to see related words and how those words exist in other topics.
Mouse over a word to highlight the topics that it occurs in.
The "Linked Reading" project, currently in prototype phase, brings together two formerly discrete digital humanities projects, “Global Renaissance” (http://www.renaissanceglobe.com) and “Shakeosphere” (shakeosphere.lib.uiowa.edu). The projects use different but related datasets, and linking them gives us a powerful new perspective on the texts contained in each. Global Renaissance draws on the full text corpus of Early English Books Online / Text Creation Partnership - currently more 25,000 English works published from 1470-1700. Shakeosphere draws on the bibliographic records (not full text) in the English Short Title Catalogue, nearly half a million works published in England, Scotland, Ireland, and the Americas, before 1800.
Both projects have developed very different methodologies. Global Renaissance mines full text documents for terms and concepts that have a high likelihood of occurring together - a process known as “topic modeling,” which uses the Latent Dirichlet Allocation (LDA) algorithm. LDA measures co-occurrence by identifying clusters of words that have a disproportionately high probability of existing together in the same text or paragraph.
Shakeosphere, by contrast, uses natural language processing to extract names and roles, like “printer,” “publisher,” or “bookseller” from the publication field of the English Short Title Catalogue. Shakeosphere not only makes it possible to search for all the works associated with particular printers or booksellers, but also to illustrate the vast social network of printers, publishers, authors, and booksellers active in England, Ireland, Scotland, and the Americas.
We built a new system that would link the projects and allow investigators to pivot between the two approaches and datasets. Any given text yields dozens of topics, and any given topic is constituted by a network of hundreds of texts, authors, publishers, and booksellers.
In the prototype phase, we have visualized topic models for all texts produced by Shakespeare's printers and publishers during his career, from 1594 to the publication of the second Folio in 1632, alongside the print networks from this time range. We are in the process of expanding the linked interface to include topic models and print networks from every year between 1500-1700 to allow users to visualize the discourses shaping English Renaissance culture
Blaine Greteman (University of Iowa),
James Lee (Grinnell College)
Ezra Edgerton (Grinnell College),
David Eichmann (University of Iowa),
Jason Lee (Independent)
Past project members:
The Andrew W. Mellon Foundation Digital Bridges for Humanistic Inquiry Grant,
The Obermann Center for Advanced Studies, University of Iowa,
The Data Analysis and Social Inquiry Lab, Grinnell College,
The Innovation Fund, Grinnell College
Visualizations adapted from Termite:
Jason Chuang, Ashley Jin, and the Stanford Visualization Group.