Language and Information

SI760-001, LING792-004, EECS597-001

http://tangra.si.umich.edu/~radev/760

Fall 2002

Fridays, 2-5 PM (two 75-minute lectures)

409 West Hall

Instructor:
Dragomir R. Radev (radev@umich.edu)
Credits:
3
Course type:
survey course, lecture format
Prerequisites:
Audience:
Mostly doctoral and master's students but also advanced undergraduates.

COURSE DESCRIPTION

A survey of techniques used in language studies and information processing. Students will learn how to explore and analyze textual data in the context of Web-based information retrieval systems. At the conclusion of the course, students will be able to work as information designers and analysts.

TENTATIVE SYLLABUS

Each class represents a 75 minute lecture.


1. The study of Language. Linguistic Fundamentals. 

2. Mathematical and Probabilistic Fundamentals. Descriptive
   Statistics. Measures of central tendency. The z score. Hypothesis
   testing. 

3. Information theory. Entropy, joint entropy, conditional
   entropy. Relative entropy and mutual information. Chain rules. The
   entropy of English.  

4. Working with corpora. N-grams.

5. Language models. Noisy channel models. Hidden Markov Models.

6. Cluster analysis. Clustering of terms according to semantic
   similarity. Distributional clustering.

7. Collocations. Syntactic criteria for collocability. 

8. Literary detective work. The statistical analysis of writing
   style. Decipherment and translation. 

9. Information Retrieval

10. Text summarization. Single-document summarization. Multi-document
    summarization. Maximal Marginal Relevance. Cross-document
    structure theory. Trainable methods. 

11. Information Extraction. Message understanding. 

12. Question Answering. Semantic representation. Predictive annotation.

13. Word sense disambiguation and lexical acquisition. Supervised
    disambiguation. Unsupervised disambiguation. Attachment
    ambiguity. Computational lexicography. 

14. Other topics. Text alignment. Word alignment. Statistical machine
    translation.  Statistical text generation. Discourse
    segmentation. Text categorization.   

ASSIGNMENTS AND GRADES

Assignments (45%)
The assignments will involve analysis of real textual data using both manual and automated techniques.
Project (30%)
Data analysis and/or programming project.
Final (25%)
A mixture of short-answer and essay-type questions

READING LIST

Required books:

Reference readings:

A small number of articles will be assigned to complement the major readings. These articles will be primarily from ACL, AAAI, SIGIR proceedings and/or the following journals: Computational Linguistics, Information Retrieval, Artificial Intelligence.