Computational Analysis of text for APSA 2018

Proposal for a Research Cafe for APSA 2018: "Computational Analysis of text and online data repositories as force multipliers of traditional data collection in political science"

Recent advances in the computational analysis of text, together with the exponential rise of online, incl. social media data repositories, have opened up exciting possibilities for the rapid growth in data available to political scientists to answer old and new questions of interest to the discipline. In this round-table, participants will outline some of the ways in which they have managed to bring these new tools and possibilities to their own research, and will seek cross-pollination between the techniques they study, and the fields they represent. Human rights, protest, propaganda, interstate relations are some areas that are being transformed by the creative appropriation of techniques developed by computer scientists and digital humanities experts to classify images, retrievce information and train artificial intelligence algorithms. We will be especially curious about the possibility of setting up a dialogue with scholars relying on more more traditional data gathering methods. We will ask how the new techiniques fit into current empirical approaches in different fields, and wether innovative solutions can be deployed in tandem with more traditional data gathering techniques to expand the quality and quantity of available data.

Nikolay Marinov, University of Mannheim

Marinov can speak about applying machine-learning techniques to a corpus of labelled data on international conflict and cooperation to generate a new measure of alignment between states. The measure relies on careful case-study work (to label the corpus), documented judgments on variable codings (what pieces of text justify a negative or positive coding in relations), and potentially scales up to produce a new, possibly better measure of relations than existing, less than satisfactory measures such as proximity in UN voting, joint membership in international organizations and the like. Another issue in Marinov’s research comes up in the updating and expansion of the NELDA dataset of national elections around the world. Linking up the data to data on leaders, to existing websites with election information, setting up named entity recognition for elections as events and linking them to relevant discussion in media and government documents (USC Record) promises to dramatically expand the quantity and quality of cross national data on elections.

Anita Gohdes, University of Zurich

Gohdes can speak about using supervised machine-learning for text-classification for different types of data projects. An advantage of using supervised methods is that researchers can establish a clear codebook that is driven by theoretical concepts decided on a priori. For example, supervised ML can be used on qualitative accounts of individual instances of human rights violations to establish more fine grained measures of violence in contentious environments. Gohdes used supervised ML to classify 60 thousand individual records of fatalities in the Syrian conflict to establish whether individuals were killed in a targeted or indiscriminate way. In a different project, she and co-authors used a small hand-labelled training set of social media posts to classify all Twitter and Facebook posts shared by world leaders. While supervised methods have a lot of advantages, their performance is dependent on a number of important factors that will be subject of discussion in the research cafe.

Rochelle Terman, Stanford University

Rochelle can discuss her experiences applying computational tools and techniques to issues of culture, norms, and identity. These topics areas have historically relied on qualitative and/or critical methods, but recent advances in computational methods have provided new opportunities for engagement. Rochelle will discuss her usage of text-as-data methods, webscraping, and other techniques to examine American media coverage of women's rights and gender norms around the world. She will also discuss her experience as a data science instructor to students across the social sciences and humanities, who apply these techniques to a range of substantive topics using a variety of empirical and epistemological approaches.

Walter Mebane, University of Michigan

Mebane can speak about using Twitter to extract observations of election incidents by individuals across large elections.  Automated machine classification methods in an active learning framework have so far been used in the 2016 election in the United States (including primaries, caucuses and the general election) to classify Tweets for relevance and by type of election incident.  Even though humans use both text and images to decide how to label Tweets, the machine classifiers currently use text only.  Mebane will discuss ongoing work to build neural networks that use both text and images.  The project also uses a database of Tweet and user information to support analyzing the data.  For example, the user database is useful for filtering out both bots and users identified as bad actors created by Russia, as well as for developing attributes of individual users and of networks of users.  For the general election we develop from 16.5 million raw Tweets hundreds of thousands of incident observations that occur at varying rates in different states, that vary over time and by type and that depend on state election and demographic conditions.

Pamela Ban, Harvard

Pamela will discuss how she uses text-as-data methods on congressional text sources to shed new light on theories of congressional politics and organization.  Much of the existing empirical work on Congress revolves around using roll call voting data or Congressional Record speech data, which largely limits empirical analyses to the floor-voting stage.  Pamela will discuss how she uses new text datasets of committee speeches and committee reports to open up the black box of the congressional committee stage.  In particular, using these text sources, she constructs measures of disagreement during the committee stage and investigates how this disagreement affects committee decisions and subsequent floor voting.  She explains how incentives present in a strong committee system can lead legislators to deviate in their voting and contribute to bipartisanship.  More broadly, she will discuss how using text-as-data can help us understand deliberation processes in Congress.

Jeff Arnold, UW [TBA]