Today the Web has become a huge information repository that covers almost any interesting topics that human users search for. However, despite the recent advancements on web search technologies, there are still many situations in which search engine users are contemplated with non-relevant search results. One of the major reasons for this problem is that web search engines have difficulties in recognizing users' specific search interest given his initial query. On one hand, this is due to the ambiguity that arises naturally in the diversity of language itself, and that no dialog (discourse) context structure is available in search engines. On the other hand, untrained Web search engine users are often not clear of the exact terms that best represent their specific information needs. In the worst case, users are even incapable of formulating queries representing their specific information need. Therefore, it is desirable to study users' search patterns and recognize (at least approximate) their search interests.
Search engine query logs is a good resource that records users' search histories and thus the necessary information for such studies. Query logs capture explicit descriptions of users' information needs. Logs of interactions that follow a user's query (e.g., click-though and navigation patterns) capture derivative traces that further characterize the user and their interests. For example, some user might be interested in purchasing cars and therefore submitted the query "car'' first. Looking at the returned results, he found that he needed more information about a particular brand of sedans, e.g. Ford. So he next inputted the query "ford sedan'' and get a new set of results. In this scenario, though the two queries are lexically different, they are semantically related and represent a common search interest. The transition from ''car'' to ''ford sedan'' indicates the semantic relatedness of the two queries. If such a transition is widely observed in various users' search processes, we may extract it as a common pattern.
We propose to build a directed graph of queries by modeling such query transition patterns existing in users' query sessions, based on real-world Web search engine query log data. In query graphs, the nodes represent distinct queries and the arcs indicate the semantic relatedness revealed by these query transition patterns. We suggest using statistical dependencies to capture the strength of such semantic relatedness. We also propose to find from the query graph the community structures (which is well studied in social network studies) that are potentially capable of representing users' underlying search interests in their search processes. We define such community structures as semantic taxonomies that carry query concepts, which can be utilized for diverse NLP and IR tasks.
Data Source
The dataset that we study is adapted from the query log of AOL search engine 1 (www.aol.com) (Pass et al., 2006). The entire collection consists of around 36M query records. These records contain about 20M distinct queries submitted from about 650k users over three months (from March to May 2006). Each record is in the same format: {AnonID, Query, QueryTime, ItemRank, ClickURL}.
The descriptions for these elements are listed below:
Below is a sample query log segment of an anonymous user in AOL query log data.
Below are some statistics about the AOL query log dataset.
Query Graphs
The construction of a query graph can be delegated into two subtasks. In the first one, we extract all users’ query histories and segment the query histories intoquery sessions. Then in the second one, we capture the semantic relatedness between the queries present in those query sessions. Once we are able to extract the queries and model the semantic relatedness between queries, a query graph G = (V , E) is ready to be built.
The following query graph consists of 989 distinct queries and 4948 edges, which are extracted from a sample of 1 million query records. Nodes are resized according to their degree centrality in the query graph. Click it to enlarge the picture.
The query graph possesses typical small world properties, which can be concluded from the following comparison between the query graph and a comparable random graph with similar sizes. It is observed that the query graph has a much larger clustering coefficient than the random graph; however, its average length of shortest paths is not significantly different from that of the random network. In other words, the query graph has a high clustering coefficient while low average length of shortest paths. Therefore it is a small world graph.
|
Num. Nodes |
Num. Edges |
Clustering Coefficient |
Avg. Length of Shortest Paths |
Diameter |
Query Network |
989 |
4846 |
0.3495 |
2.52 |
7 |
Erdos-Renyi Network |
989 |
4837 |
0.0106 |
2.34 |
6 |
Community structures are studied in many social network researches. In query graphs, we can also observe well-formed community structures. Below are some extracted communities in the query graph. From the queries included in these communities, we can observe a consistency of the topics or concepts they represent. We can utilize these extracted community structures to represent query concepts or search interests of search engine users.
Publications
to be completed.