Featured Speaker Bios

JeffersonBaileyJefferson Bailey is Director, Web Archiving Programs at Internet Archive which includes Archive-It, IA’s web archiving service used by 300+ libraries, archives, and museums, as well as contract crawling, research services, and grant and collaborative special projects. He is PI on the recently-awarded IMLS grant “Systems Interoperability and Collaborative Development for Web Archiving” and is on the Steering Committee of the International Internet Preservation Consortium. 

VinayGoelVinay Goel joined the Internet Archive’s Web Group in 2006. At the Archive, he has run focused crawls, deployed web archive access and index infrastructures and developed automated tools to help improve the quality of web crawls, and to extract and analyze large portions of the Web Archive. He also administers the Web Group’s Hadoop cluster and applies big data solutions to gain insights from web scale datasets.

He graduated from Lehigh University with a M.S. in Computer Science. While at Lehigh, he researched techniques to combat web spam, and mobility management schemes in Disruption Tolerant Networking. Outside the office, he enjoys exploring the outdoors and has a thorough love of food, movies and books.

Tom Smyth is a senior librarian and Manager of the Digital Capacity group at Library and Archives Canada. He has been the business owner of LAC’s Web Archiving Program and multiple digital special collections at LAC since 2009. He holds a Master of Arts degree and a Master of Information Studies degree from the University of Toronto.

He will be speaking on LAC’s recent web archiving accomplishments , and will also introduce a sample dataset they are making available for the hackathon.

Attendees and Biographies

Alejandro Paz is an Assistant Professor of Anthropology at the University of Toronto Scarborough, with graduate affiliations in the departments of Anthropology and Linguistics. His new research on the globalization of Israeli online news includes a digital component to track, archive, and visualize the use of Israeli English-language news across the digital news-scape. With an interdisciplinary team, they have been developing “mediaCAT as a curated search and archive web-app. It’s current version is available on Github, and a public portal will be available here very soon!

Alexander Nwala is currently a Computer Science Ph.D student and research assistant at Old Dominion University, Norfolk Virginia, USA. He is interested in processing human-generated data, so my research interest includes Text Mining and Web Sciences. A research question he is interested exploring in this hackathon concerns the possibility of providing entity access to Web archives (as opposed to the current URI/datetime access). His hobbies include drawing, listening to movie scores, and jazz.

Allison Hegel is a PhD student in the Department of English at UCLA. Her research uses digital humanities methods to investigate contemporary literature. She is also an editor for The Programming Historian, a website that publishes peer-reviewed tutorials to help humanists learn to use digital tools. At the Archives Unleashed Hackathon, she hopes to investigate how people talk about books on the web.

Emily Maemura is a second year doctoral student and a research assistant with the Digital Curation Institute at the University of Toronto’s Faculty of Information (iSchool). Her research focuses on research data management and digital curation for social sciences and humanities disciplines that study the Internet. In particular, she is interested in approaches and methods for viewing and accessing web archive data and research collections through different forms of visual representation and interaction.

Eric Oosenbrug is a PhD candidate in the History and Theory of Psychology at York University. His research explores a social history of postwar behavioural medicine. He is interested in using social networks and geospatial data to explore the institutional development of health psychology and text analysis to contextualize changes in the meanings of health and wellness in the late twentieth century.

Evan Light is an FRQSC Postdoctoral Fellow at Concordia University’s Mobile Media Lab where he does research on telecommunications, surveillance, privacy and finance. Evan is a collaborator with the Snowden Digital Surveillance Archive and creator of the Snowden Archive-in-a-Box. He’d like to bring the following questions: What best practices can be established for archiving and making accessible whistleblower archives and other controversial materials? What sort of archival tools can be used to analyze the Snowden documents and to tie them to other datasets and analytical processes, be they grassroots, political or commercial?

Federico Nanni is a PhD student at the Centre for History of Universities and Science (University of Bologna) and a visiting researcher at the Data and Web Science Group (University of Mannheim). He has a background in contemporary European history and digital humanities. His research interests include considering web materials as primary sources for historical studies and applying natural language processing methods to answer relevant research questions in the humanities and social sciences.

He has experience in working with computational methods for the detection of political positions and on the use of supervised learning methods for the classification of political speeches in topics.

During the Hackathon, he would be interested in investigating the possibility to automatically detect the position of political parties by using their website text-contents as “manifestos”. He believes it could also be fascinating to understand if it is possible to diachronically detect “political changes” by analysing the different “layers” of the web archive.”

George Raine helped develop the Snowden Digital Surveillance Archive, a publicly available archival repository of documents leaked by Edward Snowden to media outlets. He is interested in improving societal access to information in order to facilitate an informed citizenry in the issues of digital surveillance.

Helge Holzmann is a PhD student at the L3S Research Center in Hanover, Germany. Currently, he is working in the EU project Alexandria with a focus on investigating methods to use, process and analyze Web archives. One of his interests in this context is searching Web archives by incorporating external sources, like social networks or external websites, which provide additional information, such as temporal relevance. Furthermore, he has a great interest in tools related to big data analysis as well as Web archives. ArchiveSpark, a Spark framework that he has been developing in collaboration with Vinay Goel from the Internet Archive facilitates access to Web Archives in an efficient way. If you want to analyze archives or extract corpora for your own research, please give it a try:

He sees the Archives Unleashed Hackathon as an opportunity to get in touch with researchers using Web archives as their scholarly source, better understand their requirements and existing challenges, and build suitable search and access tools tailored to their needs.

Jaspreet Singh is a Ph.D. candidate from Leibniz University Hannover interested in Archival Information Retrieval. He is particularly interested in building a search engine that can support researchers like historians to better explore archives. He enjoys prototyping and experimenting with web based applications.

At the moment he is looking at how best to design search engines for complex information needs in archives.

Jeremy Wiebe is a PhD candidate in History at the University of Waterloo, where he studies twentieth-century Canadian Mennonite culture. He also has a background in Computer Science, and has been working for the past two years on projects for the Web Archives for Historical Research Group, including Warcbase. He is happy to assist with your Warcbase projects at the hackaton

Jillian Harkness is currently a graduate student at the University of Toronto’s Faculty of Information. She began working as an Archival Intern on the Snowden Archive in January 2015, and, alongside others from the Snowden Archive team, will be exploring the potential of web archives at the hackathon from the perspective of the Snowden project.

Jonathan Armoza is currently a PhD student at New York University in English Literature where he develops and critiques methods for digital humanities text mining. He is interested in understanding the relationships between patterns in large and small scale textual data, and has focused on using topic modeling in the past few years. He has recently developed, “Topic Words in Context” (TWiC), an interactive workspace for concurrently visualizing corpus-, cluster-, and text-level patterns in topic models. For this hackathon he would like to run materials through TWiC and other topic modeling visualizations to see how those data relationships and the semantic coherence of web archive-generated topics differ from past print-genre forms.

Katherine Cook is a PhD student in the Department of Archaeology at the University of York (UK) and a sessional instructor at McMaster University. Her research uses cemeteries and commemoration to explore emotion, heritage and identity in the colonial histories of Canada, the Caribbean and Europe.

During this hackathon, she would like to engage with issues of diversity, multi-vocality, and representation/underrepresentation in archiving and historical narratives, particularly in relation to material culture, experiences of the past and multimedia. How do we make marginalized histories more visible in digital archives, given their traditional invisibility in historical records?

Kelsey Utne is a PhD student in the Department of History at Cornell University, where she focuses on public history in Modern South Asia, with a focus on commemorations of traumatic historical ruptures and nationalist movements. She is interested in public engagement with ideas of national memory and heritage on digital platforms, including web forum message boards, digital memorials, and wikipedia entries. Prior to beginning her studies at Cornell, Kelsey earned an MA in South Asia Studies from the University of Washington in Seattle and a dual BA/BS in History and Political Science at Salem State University.

Kim Pham is the Digital Projects and Technologies Librarian at the University of Toronto Scarborough Library’s Digital Scholarship Unit. She is responsible for analysing system requirements, providing technical support, digital instruction, and product management for projects.

Krista Stapelfeldt is coordinator of the Digital Scholarship Unit at the University of Toronto Scarborough, and previously Repository Manager at the University of Prince Edward Island and Manager of the Islandora Project. She is interested in learning more about how to help Historians and create useful archives of digital material, as well as come up to date on recent developments after a maternity leave.

Kyle Parry is CLIR Postdoctoral Fellow at the University of Rochester, jointly associated with Visual and Cultural Studies and the Digital Humanities Center. Having been involved in web archiving through both practice and scholarship during doctoral study at Harvard University, Kyle is excited to explore critically engaged methods/tools for visual, media, and environmental studies and other fields. He also sees an opportunity to explore the .gov domain as a vector of response and rhetoric around events of environmental violence like Katrina and the BP Oil Spill.

Mat Kelly is a PhD student of computer science at Old Dominion University. He has authored and publicly deployed multiple web archiving tools including WARCreate, WAIL, and Mink during his academic career. His research focus is on the facets of personal and private web archiving as it fits within the structure and dynamics of institutional web archives.

Neha Gupta is SSHRC Postdoctoral Fellow in Geography at Memorial University. At #hackarchives, she will examine the range of sources on post-colonial Indian archaeology as they reflect a deep interest in the Indian past, and engagement of the Indian middle class with Web tools and technologies. She wants to deepen our knowledge of Indian archaeologists as members of highly structured social groups while facilitating the development of Web-based platforms and tools.

Nick Ruest is the Digital Assets Librarian at York University, and co-Principal Investigator of the SSHRC grant “A Longitudinal Analysis of the Canadian World Wide Web as a Historical Resource, 1996-2014”.

At York University, he oversees the development of data curation, asset management and preservation initiatives, along with creating and implementing systems that support the capture, description, delivery, and preservation of digital objects having significant content of enduring value. He is also active in the Islandora and Fedora communities, serving as Project Director for the Islandora CLAW project, member of the Islandora Foundation’s Roadmap Committee and Board of Directors, and contributes code to the project. In the past he has served as the Release Manager for Islandora, the moderator for the OCUL Digital Curation Community, the President of the Ontario Library and Technology Association, and President of McMaster University Academic Librarians’ Association.

Niel Chah is student in the Master of Information (MI) program at the University of Toronto iSchool, with a concentration in Information Systems and Design. He has lived most of his life in Vancouver, where I studied political science at UBC. Professionally, he has been working online as a freelancer for the past few years. For this hackathon, he is interested in exploring how trends in the frequency of word usages in digital web archives can be related to social, political, and cultural phenomenon in the physical world. Are certain trends easier to measure in the archives, and what are the limits of analyzing archival data?

Patrick Egan is a second year PhD researcher working on a digital humanities and ethnomusicology project at University College Cork in Ireland.

It is the Seán Ó Riada Digital Arts and Humanities project: he is developing an immersive interface that will let specialists and general audiences interact with digital material from a special collection and references from around the world wide web.

The initial prototype for this may be accessed at: Currently, he is combining a number of these datasets through a web service built in PHP and MySQL with JSON, and also testing front-end frameworks and libraries using Javascript.

His interest in this hackathon comes from some burning questions that he has about user interaction in browser based interfaces. What are the challenges and possibilities for us as developers in our attempts to unleash the archive?

Petra Galuscakova is a PhD student at the Institute of Formal and Applied Linguistics at the Charles University in Prague, Czech Republic. Petra is mainly interested in retrieval of relevant segments of audio and video from large multimedia archives. At the hackathon, she would like to explore intentions of users searching particular topics in the video archives. Specifically, she is interested in detection of intentions of users browsing the video collection using automatically generated links between similar segments of videos.

Richard Rath Rath is associate professor of History and Director of the Digital Arts and Humanities Initiative at the University of Hawaiʿi at Mānoa. He has had a hand in various web development projects since the early 1990s, including the first truly global Online news aggregator the (long defunct) Omnivore News and Information Service and the longer lived Electronic Policy Network. He has long been a proponent of the DIY ethic and open source open access approaches. The project he hopes to develop at Archives Unleashed is a timeline based interface to the Internet Archive snapshots of starting with the Obama Admnistration.

Rosa Iris R. Rovira is a bachelor student in Social Sciences with a specialization in Communication at University of Quebec in Outaouais. She studied Art History at University of Oriente, Cuba (2007-2010). She is currently working as a research assistant at University of Quebec in Outaouais.

Ruqin Ren is a PhD student at University of Southern California, Annenberg School for Communication. His research interest is in studying the co-evolution process between the collective learning and individual learning on crowdsourcing websites. He uses social network analysis, semantic analysis, and quantitative methods as his major analytical tools.

Ryan Deschamps is a PhD student studying at the Johnson Shoyama Graduate School of Public Policy and a former public librarian interested in the Government of Canada web archive as a web service and source for institutional memory within the public service.

Sawood Alam is a PhD Student of Computer Science at Old Dominion University, Norfolk, Virginia. Sawood received his B.Tech. degree in Computer Science from Jamia Millia Islamia, New Delhi, India in 2008 and his M.Sc. in Computer Science from Old Dominion University, Norfolk, Virginia in 2013. His Master’s Thesis title was “HTTP Mailbox – Asynchronous Restful Communication”. Sawood is currently working on his Ph.D. thesis titled, “Archive X-Ray – Web Archive Profiling for Efficient Memento Aggregation”. Apart from his academic research in Web Science and Web Archiving field, he is also interested in solving technical challenges of Urdu and other Right-to-Left complex script languages. Sawood actively follows latest Web technologies.

Shane Martin is a full-stack software and technology consultant for the PsyBorgs Lab ( He recently completed his undergraduate degree in Psychology, and is focused on marrying his 11 years of software development experience with his interests in historical psychology. One of his recent projects, called Psychology’s EloRater (, is a community-based project with the goal of rating the most influential people in the history of psychology. For the purposes of this hackathon, he is interested in new and interesting ways of gleaning information from large volumes of historical texts.

Sylvain Rocheleau is a PhD student in Cognitive Informatics at Téluq and UQAM, in Montreal Canada. His work focuses on media and social media content analysis using digital methods such as web scraping and data mining on large scale data sets. He is a research assistant at Centre de recherche interuniversitaire sur la communication, l’information et la société (CRICIS) and co-founder of the Information Flow Observatory.

Teis M. Kristensen is a PhD student at the School of Communication and Information. He received his master’s degree in 2014 from the Brain Lamb School of Communication at Purdue University. His research is focused on the role communication plays in organizing and technological processes. In doing so Teis’ work primarily takes a network approach to understand the role of communicative interactions in regards to creativity and innovation.

Todd Suomela is currently the CLIR-DLF Postdoctoral Fellow for Data Curation in the Sciences and Social Sciences at the University of Alberta. He is working on research data management in the social sciences and the humanities. Web archiving is one of the technologies he is using in the library and digital humanities communities. He is interested in how to assess the data quality of web archives, improving collection management, and improving tools for text extraction from web archives.

Yu Xu is now a second-year Ph.D. student at USC Annenberg School for Communication and Journalism. His current research interests include social networks, organizational communication, political communication and social movements, and computational methods. Yu has presented papers in conferences such as ICA, AOM, ASA, INSNA Sunbelt, NCA, and IAMCR. Specifically, he wants to explore how to combine evolutionary theories with archival Web data to examine the factors that drive the creation, maintenance, and dissolution of network ties over time in this hackathon.