Cropped 2880x1100 of library with books in background and magnifying glass on top of books in the foreground

Publications 

Publishing papers in scientific journals and at research-focused conferences and workshops helps ensure that our work continues to be aligned with the state of the art in our fields

Multi-Modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

Chou, Shih-Han, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan Mehdi, Richard Pito, Cheng Zhang, et al. “Multi-Modal News Understanding with Professionally Labelled Videos (ReutersViLNews).” In Canadian Conference on Artificial Intelligence, 2024. https://arxiv.org/abs/2401.12419

“While progress has been made in the domain of video-language understanding, current state-of-the-art algorithms are still limited in their ability to understand videos at high levels of abstraction, such as news-oriented videos. Alternatively, humans easily amalgamate information from video and language to infer information beyond what is visually observable in the pixels. An example of this is watching a news story, where the context of the event can play as big of a role in understanding the story as the event itself. Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news. The ReutersViLNews Dataset consists of long-form news videos collected and labeled by news industry professionals over several years and contains prominent news reporting from around the world. Each video involves a single story and contains action shots of the actual event, interviews with people associated with the event, footage from nearby areas, and more. ReutersViLNews dataset contains videos from seven subject categories: disaster, finance, entertainment, health, politics, sports, and miscellaneous with annotations from high-level to low-level, title caption, visual video description, high-level story description, keywords, and location. We first present an analysis of the dataset statistics of ReutersViLNews compared to previous datasets. Then we benchmark state-of-the-art approaches for four different video-language tasks. The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms and we conclude by providing future directions in designing approaches to solve the ReutersViLNews dataset.”

Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence

Yang, Hsiu-Wei, Abhinav Agrawal, Pavlos Fragkogiannis, and Shubham Nitin Mulay. “Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence.” In Workshop on Psychology-Informed Information Access Systems (PsyIAS), 2024. https://arxiv.org/abs/2403.18183

“A well-designed document communicates not only through its words but also through its visual eloquence. Authors utilize aesthetic elements such as colors, fonts, graphics, and layouts to shape the perception of information. Thoughtful document design, informed by psychological insights, enhances both the visual appeal and the comprehension of the content. While state-of-the-art document AI models demonstrate the benefits of incorporating layout and image data, it remains unclear whether the nuances of document aesthetics are effectively captured. To bridge the gap between human cognition and AI interpretation of aesthetic elements, we formulated hypotheses concerning AI behavior in document understanding tasks, specifically anchored in document design principles. With a focus on legibility and layout quality, we tested four aspects of aesthetic effects: noise, font-size contrast, alignment, and complexity, on model confidence using correlational analysis. The results and observations highlight the value of model analysis rooted in document design theories. Our work serves as a trailhead for further studies and we advocate for continued research in this topic to deepen our understanding of how AI interprets document aesthetics.”

Evaluating Interactive Topic Models in Applied Settings

Gao, Sally, Milda Norkute, and Abhinav Agrawal. “Evaluating Interactive Topic Models in Applied Settings.” In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA’24), 2024. https://doi.org/10.1145/3613905.3637133

“Topic modeling is a text analysis technique for automatically discovering common themes in a collection of documents. “Human-in-the-loop” topic modeling (HLTM) allows domain experts to steer and adjust the creation of topic models. In this case study, we use a custom-built HLTM interface to assess the impact of human refinement on model interpretability and predictive performance in collaboration with an analytics team within our organization. Using a small dataset (≈ 12k documents) of responses drawn from an organizational employee satisfaction survey, we compare the pre- and post-refinement models using both human judgments and automated metrics. We find that human refinement can enhance interpretability and predictive performance in some cases, but may lead to overfitting on the training data, which negatively impacts model quality. Furthermore, we observe that existing evaluation methods don’t sufficiently and clearly capture topic model quality in applied settings, and propose guidance for further HLTM tool development.” 

The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

Forster, Martina, Claudia Schulz, Prudhvi Nokku, Melicaalsadat Mirsafian, Jaykumar Kasundra, and Stavroula Skylaki. “The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines,” 2024. https://arxiv.org/abs/2401.11852

“Multi-Label Classification (MLC) is a common task in the legal domain, where more than one label may be assigned to a legal document. A wide range of methods can be applied, ranging from traditional ML approaches to the latest Transformer-based architectures. In this work, we perform an evaluation of different MLC methods using two public legal datasets, POSTURE50K and EURLEX57K. By varying the amount of training data and the number of labels, we explore the comparative advantage offered by different approaches in relation to the dataset properties. Our findings highlight DistilRoBERTa and LegalBERT as performing consistently well in legal MLC with reasonable computational demands. T5 also demonstrates comparable performance while offering advantages as a generative model in the presence of changing label sets. Finally, we show that the CrossEncoder exhibits potential for notable macro-F1 score improvements, albeit with increased computational costs.”

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

Frohmann, Markus, Carolin Holtermann, Shahed Masoudian, Anne Lauscher, and Navid Rekabsaz. “ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale.” In Findings of the Association for Computational Linguistics ACL 2024, edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 11743–76. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics, 2024. https://aclanthology.org/2024.findings-acl.699.

Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~0.35% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only 8 transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer. Our code is available at https://github.com/CPJKU/ScaLearn.

Measuring the Groundedness of Legal Question-Answering Systems

Trautmann, Dietrich, Natalia Ostapuk, Quentin Grail, Adrian Alan Pol, Bonifazi, Shang Guglielmo Gao, and Martin Gajek. “Measuring the Groundedness of Legal Question-Answering Systems.” In Proceedings of the Natural Legal Language Processing Workshop 2024, 2024.

In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

LLM-Based Robust Product Classification in Commerce and Compliance

Gholamian, Sina, Gianfranco Romani, Bartosz Rudnikowicz, and Stavroula Skylaki. “LLM-Based Robust Product Classification in Commerce and Compliance.” In Proceedings of the EMNLP Workshop on Customizable NLP 2024, 2024.

Product classification is a crucial task in international trade, as compliance regulations are verified and taxes and duties are applied based on product categories. Manual classification of products is time-consuming and error-prone, and the sheer volume of products imported and exported renders the manual process infeasible. Consequently, e-commerce platforms and enterprises involved in international trade have turned to automatic product classification using machine learning. However, current approaches do not consider the real-world challenges associated with product classification, such as very abbreviated and incomplete product descriptions. In addition, recent advancements in generative Large Language Models (LLMs) and their reasoning capabilities are mainly untapped in product classification and e-commerce. In this research, we explore the real-life challenges of industrial classification and we propose data perturbations that allow for realistic data simulation. Furthermore, we employ LLM-based product classification to improve the robustness of the prediction in presence of incomplete data. Our research shows that LLMs with in-context learning outperform the supervised approaches in the clean-data scenario. Additionally, we illustrate that LLMs are significantly more robust than the supervised approaches when data attacks are present.

Towards an Automated Pointwise Evaluation Metric for Generated Long-Form Legal Summaries

Tan, Shao Min, Quentin Grail, and Lee Quartey. “Towards an Automated Pointwise Evaluation Metric for Generated Long-Form Legal Summaries.” In Proceedings of the EMNLP Workshop on Natural Legal Language Processing (NLLP) 2024. Miami, FL, USA, 2024.

Long-form abstractive summarization is a task that has particular importance in the legal domain. Automated evaluation metrics are important for the development of text generation models, but existing research on the evaluation of generated summaries has focused mainly on short summaries. We introduce an automated evaluation methodology for generated long-form legal summaries, which involves breaking each summary into individual points, comparing the points in a human-written and machine-generated summary, and calculating a recall and precision score for the latter. The method is designed to be particularly suited for the complexities of legal text and is also fully interpretable. We also created and released a small meta-dataset for the benchmarking of evaluation methods, focusing on long-form legal summarization. Our evaluation metric corresponds better with human evaluation compared to existing metrics which were not developed for legal data.

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-Training

Brandfonbrener, David, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, and Sham Kakade. “CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-Training.” In Proceedings of Neural Information Processing Systems (NeurIPS) 2024. Vancouver, Canada, 2024.

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models.
In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks.

Composing Knowledge and Compression Interventions for Language Models

Kolbeinsson, Arinbjorn, Kyle O’Brien, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag Vaidya, Faisal Mahmood, Marinka Zitnik, and Tianlong Chen. “Composing Knowledge and Compression Interventions for Language Models.” In Proceedings of CLR 2024 Workshop on Reliable and Responsible Foundation Models. Vienna, Austria, 2024.

Test-time interventions for language models aim to enhance factual accuracy, reduce harmful outputs, and improve model efficiency while avoiding excessive training costs. But existing interventions are developing independently. In practice, multiple interventions must be applied to the same model sequentially. We introduce composable interventions, a framework for studying the impacts of repeatedly intervening on the same language model. To showcase our framework, we compose interventions for two burgeoning interventions: knowledge editing and model compression. We find that compression undoes knowledge edits faster than it decays general model performance. We also find that compressing models makes them harder to edit and show that composing interventions impacts predicted logits.

Online Adaptation of Language Models with a Memory of Amortized Contexts

Tack, Jihoon, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, and Jonathan Richard Schwarz. “Online Adaptation of Language Models with a Memory of Amortized Contexts.” In Proceedings of Neural Information Processing Systems (NeurIPS) 2024. Vancouver, Canada, 2024.

Due to the rapid generation and dissemination of information, large language models (LLMs) quickly run out of date despite enormous development costs. Due to this crucial need to keep models updated, online learning has emerged as a critical necessity when utilizing LLMs for real-world applications. However, given the ever-expanding corpus of unseen documents and the large parameter space of modern LLMs, efficient adaptation is essential. To address these challenges, we propose Memory of Amortized Contexts (MAC), an efficient and effective online adaptation framework for LLMs with strong knowledge retention. We propose an amortized feature extraction and memory-augmentation approach to compress and extract information from new documents into compact modulations stored in a memory bank. When answering questions, our model attends to and extracts relevant knowledge from this memory bank. To learn informative modulations in an efficient manner, we utilize amortization-based meta-learning, which substitutes the optimization process with a single forward pass of the encoder. Subsequently, we learn to choose from and aggregate selected documents into a single modulation by conditioning on the question, allowing us to adapt a frozen language model during test time without requiring further gradient updates. Our experiment demonstrates the superiority of MAC in multiple aspects, including online adaptation performance, time, and memory efficiency.

Unleashing the Power of Meta-Tuning for Few-Shot Generalization Through Sparse Interpolated Experts

Chen, Shengzhuang, Jihoon Tack, Yunqiao Yang, Yee Whye Teh, Jonathan Richard Schwarz, and Ying Wei. “Unleashing the Power of Meta-Tuning for Few-Shot Generalization Through Sparse Interpolated Experts.” In Proceedings of Forty-First International Conference on Machine Learning, ICML 2024. Vienna, Austria, 2024. https://arxiv.org/abs/2403.08477.

Conventional wisdom suggests parameter-efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the best of both worlds, meta-tuning introduces a subsequent optimization stage of foundation models but has so far only shown limited success and crucially tends to underperform on out-of-domain (OOD) tasks. In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches and trained to isolate subsets of pre-trained parameters automatically for meta-tuning on each task. SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient finetuning. We establish new state-of-the-art results on a challenging combination of Meta-Dataset augmented with additional OOD tasks in both zero-shot and gradient-based adaptation settings. In addition, we provide a thorough analysis of the superiority of learned over hand-designed sparsity patterns for sparse expert methods and the pivotal importance of the sparsity level in balancing between in-domain and out-of-domain generalization. Our code is publicly available.

AI @ Thomson Reuters

Thomson Reuters and Generative AI: Defining a new era for how legal and tax professionals work