Lexical networks (Lexnets)

Overview

Recently, there has been a lot of interest in looking at language from a network or graph-based perspective. Networks are a natural representation for many linguistic structures and almost all levels of language have been examined using graph-based methods. Network representations have been used for tasks such as document summarization, word sense disambiguation, and information retrieval.

Using graph-based methods, we look at latent semantic structure in lexical networks. Lexical networks are generated from collections of documents, with edges in the network corresponding the the similarity between the two documents. The standard cosine similarity measure is used. A collection of networks can be generated by varying the cosine value, and this collection of networks is called a latent network or semantic similarity network.

We look at cosine distributions and network structure across different collections of documents. In particular, the network structure of semantically cohesive collections is compared to semantically diverse collections. We also examine the predicted cosine distribution of documents of varying lengths and vocabulary sizes based on a Zipfian language model.

For comparison, we also examine the growth of several non-lexical networks.

Small lexical network
with a cosine threshold of 0%
Small lexical network
with a cosine threshold of 20%

Small lexical network
with a cosine threshold of 30%
Small lexical network
with a cosine threshold of 39%



Degree distribution as a function of cosine threshold for a larger cosine network.

Applications

Network models may provide new insight into the semantic content of large text collections. For example, semantic similarity networks may be used to identify shifts in topics and identify fake or forged articles. In addition, semantic similarity networks can be used for traditional information retrieval tasks such as document clustering.

Relevant papers

Lexnets bibliography

Funding

This work has been partially supported by the National Science Foundation grant "Collaborative Research: BlogoCenter - Infrastructure for Collecting, Mining and Accessing Blogs", jointly awarded to UCLA and UMich as IIS 0534323 to UMich and IIS 0534784 to UCLA