Two of our papers about weighting citations and terms in the context of user modeling got accepted at the iConference 2017. Here are the abstracts, and links to the pre-print versions:
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?
In the domain of academic search engines and research-paper recommender systems, CC-IDF is a common citation-weighting scheme that is used to calculate semantic relatedness between documents. CC-IDF adopts the principles of the popular term-weighting scheme TF-IDF and assumes that if a rare academic citation is shared by two documents then this occurrence should receive a higher weight than if the citation is shared among a large number of documents. Although CC-IDF is in common use, we found no empirical evaluation and comparison of CC-IDF with plain citation weight (CC-Only). Therefore, we conducted such an evaluation and present the results in this paper. The evaluation was conducted with real users of the recommender system Docear. The effectiveness of CC-IDF and CC-Only was measured using click-through rate (CTR). For 238,681 delivered recommendations, CC-IDF had about the same effectiveness as CC-Only (CTR of 6.15% vs. 6.23%). In other words, CC-IDF was not more effective than CC-Only, which is a surprising result. We provide a number of potential reasons and suggest to conduct further research to understand the principles of CC-IDF in more detail
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections.
TF-IDF is one of the most popular term-weighting schemes, and is applied by search engines, recommender systems, and user modeling engines. With regard to user modeling and recommender systems, we see two shortcomings of TF-IDF. First, calculating IDF requires access to the document corpus from which recommendations are made. Such access is not always given in a user-modeling or recommender system. Second, TF-IDF ignores information from a user’s personal document collection, which could – so we hypothesize – enhance the user modeling process. In this paper, we introduce TFIDuF as a term-weighting scheme that does not require access to the general document corpus and that considers information from the users’ personal document collections. We evaluated the effectiveness of TF-IDuF compared to TF-IDF and TF-Only and found that TF-IDF and TF-IDuF perform similarly (clickthrough rates (CTR) of 5.09% vs. 5.14%), and both are around 25% more effective than TF-Only (CTR of 4.06%) for recommending research papers. Consequently, we conclude that TF-IDuF could be a promising term-weighting scheme, especially when access to the document corpus for recommendations is not possible, and thus classic IDF cannot be computed. It is also notable that TF-IDuF and TF-IDF are not exclusive, so that both metrics may be combined to a more effective term-weighting scheme.