The verbatim records of democratic legislatures represent a source of untapped information of singular importance for the study of democratic societies. A typical year of any legislative record includes tens of thousands of speeches, and tens of millions of spoken words. We seek to leverage these records to gain new insight into the dynamics of the political agenda and answer questions like:
We describe a method for statistical learning from speech documents that we apply to the Congressional Record (the transcripts of speeches made in the US Senate and House) in order to gain new insight into the dynamics of the political agenda. Our method infers, through the patterns of word choice in each speech and the dynamics of word choice patterns across time, (a) what the topics of speeches are, and (b) the probability that attention will be paid to any given topic or set of topics over time.
The input data used consists of Senate speeches from the 105th-108th Congresses (1997-2004). The basic units of the electronic version of the Congressional Record are "html documents", which correspond roughly to titled subsections in the printed Record. Each document can contain zero, one or several speakers, discussing one or more items or topics. There are 71,181 documents from 1997-2004.
This work is supported by the National Science Foundation under Grant No. 0527513, "DHB: The dynamics of Political Representation and Political Rhetoric". Any opinions, findings, and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the National Science Foundation.