ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

Modeling Search Engine Performance Measurement

The information retrieval (IR) community is rightly proud of its passion for evaluation. This conference has been a welcome refuge when passion becomes obsession. ICTIR's transformation from a largely mathematically based theoretical forum to one that seeks generalizable observations from all areas perfectly suits the needs of IR. However, how much have researchers sought to generalize or model search from evaluation? I will present a set of research papers by others as well as my collaborators and I that since the early 1990s have reported generalizing observations from large scale tests. It's only relatively recently that I've come to realise that these results have been missed by many in the community, yet the models produced carry a great deal of valuable generalizing information about our retrieval systems.

Personalising and Diversifying the Listening Experience

of June 2020, over 250 million monthly active users across 92 markets worldwide listening to over 60 million tracks and 1.5M podcast titles. We help this audio find the right audience via our recommendation products, which include playlist recommendation, playlist sequencing, and podcast show and episode recommendation. A large percentage of audio consumption is from Home, which make it valuable spaces for surfacing personalised and diverse content. This talk will present some of the research we completed on how to personalize the listening experience, and what diversity means in the context of a personalised listening experience.

SESSION: Session 1: Ranking

Unbiased Pairwise Learning from Biased Implicit Feedback

Implicit feedback is prevalent in real-world scenarios and is widely used in the construction of recommender systems. However, the application of implicit feedback data is much more complicated than its explicit counterpart because it provides only positive feedback, and we cannot know whether the non-interacted feedback is positive or negative. Furthermore, positive feedback for rare items is observed less frequently than popular items. The relevance of such rare items is often underestimated. Existing solutions to such challenges are subject to bias toward the ideal loss function of interest or accept a simple pointwise approach, which is inappropriate for a ranking task. In this study, we first define an ideal pairwise loss function defined using the ground-truth relevance parameters that should be used to optimize the ranking metrics. Subsequently, we propose a theoretically grounded unbiased estimator for this ideal pairwise loss and a corresponding algorithm, Unbiased Bayesian Personalized Ranking. A pairwise algorithm addressing the two major difficulties in using implicit feedback has yet to be investigated, and the proposed algorithm is the first pairwise method for solving these challenges in a theoretically principal manner. Through theoretical analysis, we provide the critical statistical properties of the proposed unbiased estimator and a practical variance reduction technique. Empirical evaluations using real-world datasets demonstrate the practical strength of our approach.

Training Data Optimization for Pairwise Learning to Rank

This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy- and semi- supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi- supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn- ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy- and semi-supervised scenarios.

Learning to Rank Entities for Set Expansion from Unstructured Data

We propose using learning-to-rank for entity set expansion (ESE) from unstructured data, the task of finding "sibling" entities within a corpus that are from the set characterized by a small set of seed entities. We present a two-channel neural re-ranking model, NESE, that jointly learns exact and semantic matching of entity contexts through entity interaction features. Although entity set expansion has drawn increasing attention in the IR and NLP communities for its various applications, the lack of massive annotated entity sets has hindered the development of neural approaches. We describe DBpedia-Sets, a toolkit that automatically extracts entity sets from a plain text collection, thus providing a large amount of distant supervision data for neural model training. Experiments on real datasets of different scales from different domains show that NESE outperforms state-of-the-art approaches in terms of precision and MAP. Furthermore, evaluation through human annotations shows that the knowledge learned from the training data is generalizable.

Using Sentiment Analysis for Pseudo-Relevance Feedback in Social Book Search

Book search is a challenging task due to discrepancies between the content and description of books, on one side, and the ways in which people query for books, on the other. However, online reviewers provide an opinionated description of the book, with alternative features that describe the emotional and experiential aspects of the book. Therefore, locating emotional sentences within reviews, could provide a rich alternative source of evidence to help improve book recommendations. Specifically, sentiment analysis (SA) could be employed to identify salient emotional terms, which could then be used for query expansion? This paper explores the employment ofSA based query expansion, in the book search domain. We introduce a sentiment-oriented method for the selection of sentences from the reviews of top rated book. From these sentences, we extract the terms to be employed in the query formulation. The sentence selection process is based on a semi-supervised SA method, which makes use of adapted word embeddings and lexicon seed-words.Using the CLEF 2016 Social Book Search (SBS) Suggestion TrackCollection, an exploratory comparison between standard pseudo-relevance feedback and the proposed sentiment-based approach is performed. The experiments show that the proposed approach obtains 24%-57% improvement over the baselines, whilst the classic technique actually degrades the performance by 14%-51%.

SESSION: Session 2: Query Models

Cluster-Based Document Retrieval with Multiple Queries

The merits of using multiple queries representing the same information need to improve retrieval effectiveness have recently been demonstrated in several studies. In this paper we present the first study of utilizing multiple queries in cluster-based document retrieval; that is, using information induced from clusters of similar documents to rank documents. Specifically, we propose a conceptual framework of retrieval templates that can adapt cluster-based document retrieval methods, originally devised for a single query, to leverage multiple queries. The adaptations operate at the query, document list and similarity-estimate levels. Retrieval methods are instantiated from the templates by selecting, for example, the clustering algorithm and the cluster-based retrieval method. Empirical evaluation attests to the merits of the retrieval templates with respect to very strong baselines: state-of-the-art cluster-based retrieval with a single query and highly effective fusion of document lists retrieved for multiple queries. In addition, we present findings about the impact of the effectiveness of queries used to represent an information need on (i) cluster hypothesis test results, (ii) percentage of relevant documents in clusters of similar documents, and (iii) effectiveness of state-of-the-art cluster-based retrieval methods.

Optimizing Hyper-Phrase Queries

A hyper-phrase query (HPQ) consists of a sequence of phrase sets. Such queries naturally arise when attempting to spot knowledge graph (KG) facts or sets of KG facts in large document collections to establish their provenance. Our approach addresses this challenge by proposing query operators to detect text regions in documents that correspond to the HPQ as combinations of n-grams and skip-grams. The optimization lies in identifying the most cost-efficient order of query operators that can be executed to identify the text regions containing the HPQ. We show the efficiency of our optimizations on spotting facts from Wikidata in document collections amounting to more than thirty million documents.

Query Performance Prediction for Multifield Document Retrieval

The goal of the query performance prediction (QPP) task is to estimate retrieval effectiveness in the absence of relevance judgments. We consider a novel task of predicting the performance of multifield document retrieval. In this setting, documents are assumed to consist of several different textual descriptions (fields) on which the query is being evaluated. Overall, we study three predictor types. The first type applies a given basic QPP method directly on the retrieval's outcome. Building on the idea of reference-lists, the second type utilizes several pseudo-effective (PE) reference-lists. Each such list is retrieved by further evaluating the query over a specific (single) document field. The third predictor is built on the assumption that, a high agreement among the single-field PE reference-lists attests to a more effective retrieval. Using three different multifield document retrieval tasks we demonstrate the merits of our extended QPP methods. Specifically, we show the important role that the intrinsic agreement among the single-field PE reference-lists plays in this extended QPP task.

Search Result Diversification with Guarantee of Topic Proportionality

Search result diversification based on topic proportionality considers a document as a bag of weighted topics and aims to reorder or down-sample a ranked list in a way that maintains topic proportionality. The goal is to show the topic distribution from an ambiguous query at all points in the revised list, hoping to satisfy all users in expectation. One effective approach, PM-2, greedily selects the best topic that maintains proportionality at each ranking position and then selects the document that best represents that topic. From a theoretical perspective, this approach does not provide any guarantee that topic proportionality holds in the small ranked list. Moreover, this approach does not take query-document relevance into account. We propose a Linear Programming (LP) formulation, LP-QL, that maintains topic proportionality and simultaneously maximizes relevance. We show that this approach satisfies topic proportionality constraints in expectation. Empirically, it achieves a 5.5% performance gain (significant) in terms of alpha-NDCG compared to PM-2 when we use LDA as the topic modelling approach. Furthermore, we propose LP-PM-2 that integrates the solution of LP-QL with PM-2. LP-PM-2 achieves 3.2% performance gain (significant) over PM-2 in terms of alpha-NDCG with term based topic modeling approach. All of our experiments are based on a popular web document collection, ClueWeb09 Category B, and the queries are taken from TREC Web Track's diversity task.

SESSION: Session 3: User Interaction

A Mixed-Method Analysis of Text and Audio Search Interfaces with Varying Task Complexity

Voice-based assistants have become a popular tool for conducting web search, particularly for factoid question answering. However, for more complex web searches, their functionality remains limited, as does our understanding of the ways in which users can best interact with audio-based search results. In this paper, we compare and contrast user behaviour through the representation of search results over two mediums: text and audio. We begin by conducting a crowdsourced study exposing the differences in user selection of search results when those are presented in text and audio formats. We further confirm these differences and investigate the reasons behind them through a mixed-methods laboratory study. Through a qualitative analysis of the collected data, we produce a list of guidelines for an audio-based presentation of search results.

Towards Memorable Information Retrieval

Information overload is a problem many of us can relate to nowadays. The deluge of user generated content on the Internet, and the easy accessibility to a vast amount of data compounds the problem of remembering and retaining information that is consumed. To make information consumed more memorable, strategies such as note-taking have been found to be effective by augmenting human memory under specific conditions. This is based on the rationale that humans tend to recall information better if they have produced the information themselves. Previous works in online education have shown that conversational systems can improve learning effects. Although memorization is an important part of learning, the effect of conversation on human memorability remains unexplored. We aim to address this knowledge gap through an experimental study, by investigating human memorability in a classical information retrieval setup. We explore the impact of note-taking affordances and conversational interfaces on the memorability of information consumed by users. Our results show that traditional web search and note-taking have positive effects on knowledge gain, while the search engine with a conversational interface has the potential to augment long-term memorability. This work highlights the benefits of using note-taking and conversational interfaces to aid human memorability. Our findings have important implications on building information retrieval systems that cater to optimizing memorability of information consumed.

The Effects of Learning Objectives on Searchers' Perceptions and Behaviors

In recent years, the "search as learning" community has argued that search systems should be designed to support learning. We report on a lab study in which we manipulated the learning objectives associated with search tasks assigned to participants. We manipulated learning objectives by leveraging Anderson and Krathwohl's taxonomy of learning (A&K's taxonomy)[2], which situates learning objectives at the intersection of two orthogonal dimensions: the cognitive process and the knowledge type dimension. Participants in our study completed tasks with learning objectives that varied across three cognitive processes (apply, evaluate, and create) and three knowledge types (factual, conceptual, and procedural knowledge). We focus on the effects of the tasks cognitive process and knowledge type on participants' pre-/post-task perceptions and search behaviors. Our results found that the three knowledge types considered in our study had a greater effect than the three cognitive processes. Specifically, conceptual knowledge tasks were perceived to be more difficult and required more search activity. We discuss implications for designing search systems that support learning.

Interactive Evaluation of Conversational Agents: Reflections on the Impact of Search Task Design

Undertaking an interactive evaluation of goal-oriented conversational agents (CAs) is challenging, it requires the search task to be realistic and relatable while accounting for the users cognitive limitations. In the current paper we discuss findings of two Wizard of Oz studies and provide our reflections regarding the impact of different interactive search task designs on participants? performance, satisfaction and cognitive workload. In the first study, we tasked participants with finding a cheapest flight that met a certain departure time. In the second study we added an additional criterion: "travel time" and asked participants to find a fight option that offered a good trade-off between price and travel time. We found that using search tasks where participants need to decide between several competing search criteria (price vs. time) led to a higher search involvement and lower variance in usability and cognitive workload ratings between different CAs. We hope that our results will provoke discussion on how to make the evaluation of voice-only goal-oriented CAs more reliable and ecologically valid.

SESSION: Session 4: Recommendation

A Hybrid Conditional Variational Autoencoder Model for Personalised Top-n Recommendation

The interactions of users with a recommendation system are in general sparse, leading to the well-known cold-start problem. Side information, such as age, occupation, genre and category, have been widely used to learn latent representations for users and items in order to address the sparsity of users' interactions. Conditional Variational Autoencoders (CVAEs) have recently been adapted for integrating side information as conditions to constrain the learned latent factors and to thereby generate personalised recommendations. However, the learning of effective latent representations that encapsulate both user (e.g. demographic information) and item side information (e.g. item categories) is still challenging. In this paper, we propose a new recommendation model, called Hybrid Conditional Variational Autoencoder (HCVAE) model, for personalised top-n recommendation, which effectively integrates both user and item side information to tackle the cold-start problem. Two CVAE-based methods -- using conditions on the learned latent factors, or conditions on the encoders and decoders -- are compared for integrating side information as conditions. Our HCVAE model leverages user and item side information as part of the optimisation objective to help the model construct more expressive latent representations and to better capture attributes of the users and items (such as demographic, category preferences) within the personalised item probability distributions. Thorough and extensive experiments conducted on both the MovieLens and Ta-feng datasets demonstrate that the HCVAE model conditioned on user category preferences with conditions on the learned latent factors can significantly outperform common existing top-n recommendation approaches such as MF-based and VAE/CVAE-based models.

Approximate Nearest Neighbor Search and Lightweight Dense Vector Reranking in Multi-Stage Retrieval Architectures

In the context of a multi-stage retrieval architecture, we explore candidate generation based on approximate nearest neighbor (ANN) search and lightweight reranking based on dense vector representations. These results serve as input to slower but more accurate rerankers such as those based on transformers. Our goal is to characterize the effectiveness-efficiency tradeoff space in this context. We find that, on sentence-length segments of text, ANN techniques coupled with dense vector reranking dominate approaches based on inverted indexes, and thus our proposed design should be preferred. For paragraph-length segments, ANN-based and index-based techniques share the Pareto frontier, which means that the choice of alternatives depends on the desired operating point.

Sentiment Prediction using Attention on User-Specific Rating Distribution

For document-level sentiment prediction, many methods try to first capture opinion words then infer sentiments based on these words. We observe that different users may use same words to express different levels of satisfaction, e.g., 'great' may mean very satisfaction to some users, or simply a general description to others. Intuitively, we expect the choice of a sentiment expression follows a distribution specific to a user and her sentiment to a product. In this paper, we propose a hierarchical neural network model with user-specific rating distribution attention (H-URA) to learn document representation for sentiment prediction. Our model learns local sentiment distributions from a user's expression, at word-level and at sentence-level respectively. We also learn a global sentiment distribution by using both user and product information. The attention weight is then computed from the local and global sentiment distributions. Experimental results show superiority of our H-URA model compared to strong baselines on benchmark datasets.

A Multistage Ranking Strategy for Personalized Hotel Recommendation with Human Mobility Data

To increase user satisfaction and own income, more and more hotel booking sites begin to pay attention to personalized recommendation. However, almost all user preference information only comes from the user actions in the hotel reservation scenario. Obviously, this approach has its limitations in particular in situation of user cold start, i.e., when only little information is available about an individual user. In this paper, we focus on the hotel recommendation in mobile map applications, which has abundant human mobility data to provide extra personalized information for hotel search ranking. For this purpose, we propose a personalized multistage pairwise learning-to-ranking model, which can capture more personalized information by utilizing full scenarios hotel click data of users in map applications. At the same time, the multistage model can effectively solve the problem of cold start. Both offline and online evaluation results show that the proposed model significantly outperforms multiple strong baseline methods.

Leveraging Personalized Sentiment Lexicons for Sentiment Analysis

We propose a novel personalized approach for the sentiment analysis task. The approach is based on the intuition that the same sentiment words can carry different sentiment weights for different users. For each user, we learn a language model over a sentiment lexicon to capture her writing style. We further correlate this user-specific language model with the user's historical ratings of reviews. Additionally, we discuss how two standard CNN and CNN+LSTM models can be improved by adding these user-based features. Our evaluation on the Yelp dataset shows that the proposed new personalized sentiment analysis features are effective.

SESSION: Session 5: Conversational

Length Adaptive Regularization for Retrieval-based Chatbot Models

Chatbots aim to mimic real conversations between humans. They have started playing an increasingly important role in our daily life. Given past conversations, a retrieval-based chatbot model selects the most appropriate response from a pool of candidates. Intuitively, based on the nature of the conversations, some responses are expected to be long and informative while others need to be more concise. Unfortunately, none of the existing retrieval-based chatbot models have considered the effect of response length. Empirical observations suggested the existing models over-favor longer candidate responses, leading to sub-optimal performance.

To overcome this limitation, we propose a length adaptive regularization method for retrieval-based chatbot models. Specifically, we first predict the desired response length based on the conversation context and then apply a regularization method based on the predicted length to adjust matching scores for candidate responses. The proposed length adaptive regularization method is general enough to be applied to all existing retrieval-based chatbot models. Experiments on two public data sets show the proposed method is effective to significantly improve retrieval performance.

Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs

Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here. Importantly, these findings extend beyond the particular flavor of question answering task we studied and raise a series of difficult questions around template-based synthetic data generation that will necessitate additional research.

Analysing the Effect of Clarifying Questions on Document Ranking in Conversational Search

Recent research on conversational search highlights the importance of mixed-initiative in conversations. To enable mixed-initiative, the system should be able to ask clarifying questions to the user. However, the ability of the underlying ranking models (which support conversational search) to account for these clarifying questions and answers has not been analysed when ranking documents, at large. To this end, we analyse the performance of a lexical ranking model on a conversational search dataset with clarifying questions. We investigate, both quantitatively and qualitatively, how different aspects of clarifying questions and user answers affect the quality of ranking. We argue that there needs to be some fine-grained treatment of the entire conversational round of clarification, based on the explicit feedback which is present in such mixed-initiative settings. Informed by our findings, we introduce a simple heuristic-based lexical baseline, that significantly outperforms the existing naive baselines. Our work aims to enhance our understanding of the challenges present in this particular task and inform the design of more appropriate conversational ranking models.

Bias in Conversational Search: The Double-Edged Sword of the Personalized Knowledge Graph

Conversational AI systems are being used in personal devices, providing users with highly personalized content. Personalized knowledge graphs (PKGs) are one of the recently proposed methods to store users' information in a structured form and tailor answers to their liking. Personalization, however, is prone to amplifying bias and contributing to the echo-chamber phenomenon. In this paper, we discuss different types of biases in conversational search systems, with the emphasis on the biases that are related to PKGs. We review existing definitions of bias in the literature: people bias, algorithm bias, and a combination of the two, and further propose different strategies for tackling these biases for conversational search systems. Finally, we discuss methods for measuring bias and evaluating user satisfaction.

SESSION: Session 6: Learning to Rank

Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking

Counterfactual evaluation can estimate Click-Through-Rate (CTR) differences between ranking systems based on historical interaction data, while mitigating the effect of position bias and item-selection bias. We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which optimizes the policy for logging data so that the counterfactual estimate has minimal variance. As minimizing variance leads to faster convergence, LogOpt increases the data-efficiency of counterfactual estimation. LogOpt turns the counterfactual approach - which is indifferent to the logging policy - into an online approach, where the algorithm decides what rankings to display. We prove that, as an online evaluation method, LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleaving methods. Furthermore, we perform large-scale experiments by simulating comparisons between thousands of rankers. Our results show that while interleaving methods make systematic errors, LogOpt is as efficient as interleaving without being biased.

Permutation Equivariant Document Interaction Network for Neural Learning to Rank

How to leverage cross-document interactions to improve ranking performance is an important topic in information retrieval research. The recent developments in deep learning show strength in modeling complex relationships across sequences and sets. It thus motivates us to study how to leverage cross-document interactions for learning-to-rank in the deep learning framework. In this paper, we formally define the permutation equivariance requirement for a scoring function that captures cross-document interactions. We then propose a self-attention based document interaction network that extends any univariate scoring function with contextual features capturing cross-document interactions. We show that it satisfies the permutation equivariance requirement, and can generate scores for document sets of varying sizes.

Our proposed methods can automatically learn to capture document interactions without any auxiliary information, and can scale across large document sets. We conduct experiments on four ranking datasets: the public benchmarks WEB30K and Istella, as well as Gmail search and Google Drive Quick Access datasets. Experimental results show that our proposed methods lead to significant quality improvements over state-of-the-art neural ranking models, and are competitive with state-of-the-art gradient boosted decision tree (GBDT) based models on the WEB30K dataset.

Understanding BERT Rankers Under Distillation

Deep language models, such as BERT pre-trained on large corpora, have given a huge performance boost to state-of-the-art information retrieval ranking systems. Knowledge embedded in such models allows them to pick up complex matching signals between passages and queries. However, the high computation cost during inference limits their deployment in real-world search scenarios. In this paper, we study if and how the knowledge for search within BERT can be transferred to a smaller ranker through distillation. Our experiments demonstrate that it is crucial to use a proper distillation procedure, which produces up to nine times speedup while preserving the state-of-the-art performance.

Utilizing Axiomatic Perturbations to Guide Neural Ranking Models

Axiomatic approaches aim to utilize reasonable retrieval constraints to guide the search for optimal retrieval models. Existing studies have shown the effectiveness of axiomatic approaches in improving the performance through either the derivation of new basic retrieval models or modifications of existing ones. Recently, neural network models have attracted more attention in the research community. Since these models are learned from training data, it would be interesting to study how to utilize the axiomatic approaches to guide the training process so that the learned models can satisfy retrieval constraints and achieve better retrieval performance.

In this paper, we propose to utilize axiomatic perturbations to construct training data sets for neural ranking models. The perturbed data sets are constructed in a way to amplify the desirable properties that any reasonable retrieval models should satisfy. As a result, the models learned from the perturbed data sets are expected to satisfy more retrieval constraints and lead to better retrieval performance. Experiment results show that the models learned from the perturbed data sets indeed perform better than those learned from the original data sets.

Analyzing the Influence of Bigrams on Retrieval Bias and Effectiveness

Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relationship between retrieval effectiveness and retrieval bias. While various factors influencing bias have been examined, there has been no work examining the impact of using bigram within the index on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the bias of a system changes depending on how the documents are represented using unigrams, bigrams or both. Our analysis of three different retrieval models on three TREC collections, shows that using a bigram only representation results in the lowest bias compared to unigram only representation, but at the expense of retrieval effectiveness. However, when both representations are combined it results in reducing the overall bias, as well as increasing effectiveness. These findings suggest that when configuring and indexing the collection, that the bag-of-words approach (unigrams), should be augmented with bigrams to create better and fairer retrieval systems.

SESSION: Session 7: Evaluation

Declarative Experimentation in Information Retrieval using PyTerrier

The advent of deep machine learning platforms such as Tensorflow and Pytorch, developed in expressive high-level languages such as Python, have allowed more expressive representations of deep neural network architectures. We argue that such a powerful formalism is missing in information retrieval (IR), and propose a framework called PyTerrier that allows advanced retrieval pipelines to be expressed, and evaluated, in a declarative manner close to their conceptual design. Like the aforementioned frameworks that compile deep learning experiments into primitive GPU operations, our framework targets IR platforms as backends in order to execute and evaluate retrieval pipelines. Further, we can automatically optimise the retrieval pipelines to increase their efficiency to suite a particular IR platform backend. Our experiments, conducted on TREC Robust and ClueWeb09 test collections, demonstrate the efficiency benefits of these optimisations for retrieval pipelines involving both the Anserini and Terrier IR platforms.

Exploiting Stopping Time to Evaluate Accumulated Relevance

Evaluation measures are more or less explicitly based on user models which abstract how users interact with a ranked result list and how they accumulate utility from it. However, traditional measures typically come with a hard-coded user model which can be, at best, parametrized. Moreover, they take a deterministic approach which leads to assign a precise score to a system run.

In this paper, we take a different angle and, by relying on Markov chains and random walks, we propose a new family of evaluation measures which are able to accommodate for different and flexible user models, allow for simulating the interaction of different users, and turn the score into a random variable which more richly describes the performance of a system. We also show how the proposed framework allows for instantiating and better explaining some state-of-the-art measures, like AP, RBP, DCG, and ERR.

Efficient Test Collection Construction via Active Learning

To create a new IR test collection at low cost, it is valuable to carefully select which documents merit human relevance judgments. Shared task campaigns such as NIST TREC pool document rankings from many participating systems (and often interactive runs as well) in order to identify the most likely relevant documents for human judging. However, if one's primary goal is merely to build a test collection, it would be useful to be able to do so without needing to run an entire shared task. Toward this end, we investigate multiple active learning strategies which, without reliance on system rankings: 1) select which documents human assessors should judge; and 2) automatically classify the relevance of additional unjudged documents. To assess our approach, we report experiments on five TREC collections with varying scarcity of relevant documents. We report labeling accuracy achieved, as well as rank correlation when evaluating participant systems based upon these labels vs. full pool judgments. Results show the effectiveness of our approach, and we further analyze how varying relevance scarcity across collections impacts our findings. To support reproducibility and follow-on work, we have shared our code online\footnote\urlhttps://github.com/mdmustafizurrahman/ICTIR_AL_TestCollection_2020/ .

Offline Evaluation without Gain

We propose a simple and flexible framework for offline evaluation based on a weak ordering of results (which we call "partial preferences") that define a set of ideal rankings for a query. These partial preferences can be derived from from side-by-side preference judgments, from graded judgments, from a combination of the two, or through other methods. We then measure the performance of a ranker by computing the maximum similarity between the actual ranking it generates for the query and elements of this ideal result set. We call this measure the "compatibility" of the actual ranking with the ideal result set. We demonstrate that compatibility can replace and extend current offline evaluation measures that depend on fixed relevance grades that must be mapped to gain values, such as NDCG. We examine a specific instance of compatibility based on rank biased overlap (RBO). We experimentally validate compatibility over multiple collections with different types of partial preferences, including very fine-grained preferences and partial preferences focused on the top ranks. As well as providing additional insights and flexibility, compatibility avoids shortcomings of both full preference judgments and traditional graded judgments.

SESSION: Tutorials

Question Answering over Curated and Open Web Sources

The last few years have seen an explosion of research on the topic of automated question answering (QA), spanning the communities of information retrieval, natural language processing, and artificial intelligence. This tutorial would cover the highlights of this really active period of growth for QA to give the audience a grasp over the families of algorithms that are currently being used. We partition research contributions by the underlying source from where answers are retrieved: curated knowledge graphs, unstructured text, or hybrid corpora. We choose this dimension of partitioning as it is the most discriminative when it comes to algorithm design. Other key dimensions are covered within each sub-topic: like the complexity of questions addressed, and degrees of explainability and interactivity introduced in the systems. We would conclude the tutorial with the most promising emerging trends in the expanse of QA, that would help new entrants into this field make the best decisions to take the community forward. This tutorial was recently presented at SIGIR 2020.

ICTIR Tutorial: Modern Query Performance Prediction: Theory and Practice

Query performance prediction (QPP) is a core information retrieval (IR) task whose primary goal is to assess retrieval quality in the absence of relevance judgments. Applications of QPP are numerous, and include, among others, automatic query reformulation, fusion and ranker selection, distributed search and content analysis. The main objective of this tutorial is to introduce recent advances in the sub-research area of QPP in IR, covering both theory and applications. On the theoretical side, we will introduce modern QPP frameworks, which have advanced our understanding of the core QPP task. On the application side, the tutorial will set the connection between QPP theory and its usage in various modern IR applications, discussing the pros and cons, limitations, challenges and open research questions.