Publications

2025
StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples. Ajay Patel, Jiacheng Zhu, Justin Qiu, Zachary Horvitz, Marianna Apidianaki, Kathleen McKeown, Chris Callison-Burch. NAACL 2025. Abstract StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications. Data Code BibTex StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples @inproceedings{patel-etal-2025-styledistance, title = "{S}tyle{D}istance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples", author = "Patel, Ajay and Zhu, Jiacheng and Qiu, Justin and Horvitz, Zachary and Apidianaki, Marianna and McKeown, Kathleen and Callison-Burch, Chris", editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = apr, year = "2025", address = "Albuquerque, New Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.naacl-long.436/", pages = "8662--8685", ISBN = "979-8-89176-189-6", abstract = "Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications." }
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi. CVPR 2025. Abstract Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog Data Code Website BibTex Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models @inproceedings{deitke2025molmo, title={Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models}, author={Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and Lu, Jiasen and Anderson, Taira and Bransom, Erin and Ehsani, Kiana and Ngo, Huong and Chen, YenSung and Patel, Ajay and Yatskar, Mark and Callison-Burch, Chris and Head, Andrew and Hendrix, Rose and Bastani, Favyen and VanderBilt, Eli and Lambert, Nathan and Chou, Yvonne and Chheda, Arnavi and Sparks, Jenna and Skjonsberg, Sam and Schmitz, Michael and Sarnat, Aaron and Bischoff, Byron and Walsh, Pete and Newell, Chris and Wolters, Piper and Gupta, Tanmay and Zeng, Kuo-Hao and Borchardt, Jon and Groeneveld, Dirk and Nam, Crystal and Lebrecht, Sophie and Wittlif, Caitlin and Schoenick, Carissa and Michel, Oscar and Krishna, Ranjay and Weihs, Luca and Smith, Noah A. and Hajishirzi, Hannaneh and Girshick, Ross and Farhadi, Ali and Kembhavi, Aniruddha}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2025}, address={Nashville, Tennessee}, url={https://arxiv.org/abs/2409.17146} }
Concept Lancet: Image Editing with Compositional Representation Transplant. Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Hancheng Min, Chris Callison-Burch, René Vidal. CVPR 2025. Abstract Concept Lancet: Image Editing with Compositional Representation Transplant Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation. Data Code Website BibTex Concept Lancet: Image Editing with Compositional Representation Transplant @inproceedings{luo2025concept, title={Concept Lancet: Image Editing with Compositional Representation Transplant}, author={Luo, Jinqi and Ding, Tianjiao and Chan, Kwan Ho Ryan and Min, Hancheng and Callison-Burch, Chris and Vidal, Ren{\'e}}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2025}, address={Nashville, Tennessee}, url={https://arxiv.org/abs/2504.02828} }
ViUniT: Visual Unit Tests for More Robust Visual Programming. Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles. CVPR 2025. Abstract ViUniT: Visual Unit Tests for More Robust Visual Programming Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%. Code Website BibTex ViUniT: Visual Unit Tests for More Robust Visual Programming @inproceedings{panagopoulou2025viunit, title={ViUniT: Visual Unit Tests for More Robust Visual Programming}, author={Panagopoulou, Artemis and Zhou, Honglu and Savarese, Silvio and Xiong, Caiming and Callison-Burch, Chris and Yatskar, Mark and Niebles, Juan Carlos}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2025}, address={Nashville, Tennessee}, url={https://arxiv.org/abs/2412.08859}, doi={10.48550/arXiv.2412.08859} }
Calibrating Large Language Models with Sample Consistency. Veronica Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-Burch. AAAI 2025. Abstract Calibrating Large Language Models with Sample Consistency Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. We perform an extensive evaluation across various open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency have the potential to enhance model performance. Finally, we offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs. BibTex Calibrating Large Language Models with Sample Consistency @inproceedings{lyu2025calibrating, title={Calibrating Large Language Models with Sample Consistency}, author={Lyu, Veronica Qing and Shridhar, Kumar and Malaviya, Chaitanya and Zhang, Li and Elazar, Yanai and Tandon, Niket and Apidianaki, Marianna and Sachan, Mrinmaya and Callison-Burch, Chris}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)}, year={2025}, url={https://arxiv.org/abs/2402.13904} }
NSF-SciFy: Mining the NSF Awards Database for Scientific Claims. Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch. arXiv 2025. Unpublished preprint. Abstract NSF-SciFy: Mining the NSF Awards Database for Scientific Claims We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in proposals. Using zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research. BibTex NSF-SciFy: Mining the NSF Awards Database for Scientific Claims @misc{rao2025nsfscify, title={{NSF-SciFy}: Mining the NSF Awards Database for Scientific Claims}, author={Rao, Delip and You, Weiqiu and Wong, Eric and Callison-Burch, Chris}, year={2025}, eprint={2503.08600}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.08600} }
mStyleDistance: Multilingual Style Embeddings and their Evaluation. Justin Qiu, Jiacheng Zhu, Ajay Patel, Marianna Apidianaki, Chris Callison-Burch. arXiv 2025. Unpublished preprint. Abstract mStyleDistance: Multilingual Style Embeddings and their Evaluation Style embeddings are useful for stylistic analysis and style transfer; however, only English style embeddings have been made available. We introduce Multilingual StyleDistance (mStyleDistance), a multilingual style embedding model trained using synthetic data and contrastive learning. We train the model on data from nine languages and create a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess the embeddings' quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing models on these multilingual style benchmarks and generalize well to unseen features and languages. We make our model publicly available at https://huggingface.co/StyleDistance/mstyledistance. Website BibTex mStyleDistance: Multilingual Style Embeddings and their Evaluation @misc{qiu2025mstyledistance, title={{mStyleDistance}: Multilingual Style Embeddings and their Evaluation}, author={Qiu, Justin and Zhu, Jiacheng and Patel, Ajay and Apidianaki, Marianna and Callison-Burch, Chris}, year={2025}, eprint={2502.15168}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.15168} }
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation. Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark. arXiv 2025. Unpublished preprint. Abstract Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments. Website BibTex Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation @misc{yang2025scaling, title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation}, author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and Clark, Christopher}, year={2025}, eprint={2502.14846}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.14846} }
Media Bias Detector: Designing and Implementing a Tool for Real-Time Selection and Framing Bias Analysis in News Coverage. Jenny S Wang, Samar Haider, Amir Tohidi, Anushkaa Gupta, Yuxuan Zhang, Chris Callison-Burch, David Rothschild, Duncan J Watts. CHI 2025. Abstract Media Bias Detector: Designing and Implementing a Tool for Real-Time Selection and Framing Bias Analysis in News Coverage Mainstream media, through their decisions on what to cover and how to frame the stories they cover, can mislead readers without using outright falsehoods. Therefore, it is crucial to have tools that expose these editorial choices underlying media bias. In this paper, we introduce the Media Bias Detector, a tool for researchers, journalists, and news consumers. By integrating large language models, we provide near real-time granular insights into the topics, tone, political lean, and facts of news articles aggregated to the publisher level. We assessed the tool's impact by interviewing 13 experts from journalism, communications, and political science, revealing key insights into usability and functionality, practical applications, and AI's role in powering media bias tools. We explored this in more depth with a follow-up survey of 150 news consumers. This work highlights opportunities for AI-driven tools that empower users to critically engage with media content, particularly in politically charged environments. BibTex Media Bias Detector: Designing and Implementing a Tool for Real-Time Selection and Framing Bias Analysis in News Coverage @inproceedings{wang2025media, title={Media Bias Detector: Designing and Implementing a Tool for Real-Time Selection and Framing Bias Analysis in News Coverage}, author={Wang, Jenny S and Haider, Samar and Tohidi, Amir and Gupta, Anushkaa and Zhang, Yuxuan and Callison-Burch, Chris and Rothschild, David and Watts, Duncan J}, booktitle={Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25)}, year={2025}, address={Yokohama, Japan}, url={https://arxiv.org/abs/2502.06009} }
GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge. Liam Dugan, Andrew Zhu, Firoj Alam, Preslav Nakov, Marianna Apidianaki, Chris Callison-Burch. GenAIDetect @ COLING 2025. Abstract GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate -- suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research. Website BibTex GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge @inproceedings{dugan-etal-2025-genai, title = "{G}en{AI} Content Detection Task 3: Cross-Domain Machine Generated Text Detection Challenge", author = "Dugan, Liam and Zhu, Andrew and Alam, Firoj and Nakov, Preslav and Apidianaki, Marianna and Callison-Burch, Chris", editor = "Alam, Firoj and Nakov, Preslav and Habash, Nizar and Gurevych, Iryna and Chowdhury, Shammur and Shelmanov, Artem and Wang, Yuxia and Artemova, Ekaterina and Kutlu, Mucahid and Mikros, George", booktitle = "Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "International Conference on Computational Linguistics", url = "https://aclanthology.org/2025.genaidetect-1.45/", pages = "377--388", abstract = "Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99{\%} on machine-generated text from RAID while maintaining a 5{\%} False Positive Rate{---}suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research." }
WithdrarXiv: A Large-Scale Dataset for Retraction Study. Delip Rao, Jonathan Young, Thomas Dietterich, Chris Callison-Burch. arXiv 2025. Unpublished preprint. Press Press Coverage Nature News - January 6, 2024 - ‘WithdrarXiv’ database of 14,000 retracted preprints launches Abstract WithdrarXiv: A Large-Scale Dataset for Retraction Study Retractions play a vital role in maintaining scientific integrity, yet systematic studies of retractions in computer science and other STEM fields remain scarce. We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv, containing over 14,000 papers and their associated retraction comments spanning the repository's entire history through September 2024. Through careful analysis of author comments, we develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations. We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy, an enriched version including scripts for parsed full-text PDFs, specifically designed to enable research in scientific feasibility studies, claim verification, and automated theorem proving. These findings provide valuable insights for improving scientific quality control and automated verification systems. Finally, and most importantly, we discuss ethical issues and take a number of steps to implement responsible data release while fostering open science in this area. Data Code BibTex WithdrarXiv: A Large-Scale Dataset for Retraction Study @misc{rao2024withdrarxiv, title={{WithdrarXiv}: A Large-Scale Dataset for Retraction Study}, author={Delip Rao and Jonathan Young and Thomas Dietterich and Chris Callison-Burch}, year={2024}, eprint={2412.03775}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.03775}, }
2024
ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems. Andrew Zhu, Liam Dugan, Chris Callison-Burch. EMNLP 2024. Abstract ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems Recently, there has been increasing interest in using Large Language Models (LLMs) to construct complex multi-agent systems to perform tasks such as compiling literature reviews, drafting consumer reports, and planning vacations. Many tools and libraries exist for helping create such systems, however none support recursive multi-agent systems—where the models themselves flexibly decide when to delegate tasks and how to organize their delegation structure. In this work, we introduce ReDel: a toolkit for recursive multi-agent systems that supports custom tool-use, delegation schemes, event-based logging, and interactive replay in an easy-to-use web interface. We show that, using ReDel, we are able to achieve significant performance gains on agentic benchmarks and easily identify potential areas of improvements through the visualization and debugging tools. Our code, documentation, and PyPI package are open-source at https://github.com/zhudotexe/redel, and free to use under the MIT license Data Code BibTex ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems @inproceedings{zhu-etal-2024-redel, title = "{R}e{D}el: A Toolkit for {LLM}-Powered Recursive Multi-Agent Systems", author = "Zhu, Andrew and Dugan, Liam and Callison-Burch, Chris", editor = "Hernandez Farias, Delia Irazu and Hope, Tom and Li, Manling", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-demo.17", pages = "162--171", abstract = "Recently, there has been increasing interest in using Large Language Models (LLMs) to construct complex multi-agent systems to perform tasks such as compiling literature reviews, drafting consumer reports, and planning vacations. Many tools and libraries exist for helping create such systems, however none support recursive multi-agent systems{---}where the models themselves flexibly decide when to delegate tasks and how to organize their delegation structure. In this work, we introduce ReDel: a toolkit for recursive multi-agent systems that supports custom tool-use, delegation schemes, event-based logging, and interactive replay in an easy-to-use web interface. We show that, using ReDel, we are able to achieve significant performance gains on agentic benchmarks and easily identify potential areas of improvements through the visualization and debugging tools. Our code, documentation, and PyPI package are open-source at https://github.com/zhudotexe/redel, and free to use under the MIT license.", }
TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings. Zachary Horvitz, Ajay Patel, Kanishk Singh, Chris Callison-Burch, Kathleen McKeown, Zhou Yu. EMNLP 2024. Abstract TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings The goal of text style transfer is to transform the style of texts while preserving their original meaning, often with only a few examples of the target style. Existing style transfer methods generally rely on the few-shot capabilities of large language models or on complex controllable text generation approaches that are inefficient and underperform on fluency metrics. We introduce TinyStyler, a lightweight but effective approach, which leverages a small language model (800M params) and pre-trained authorship embeddings to perform efficient, few-shot text style transfer. We evaluate on the challenging task of authorship style transfer and find TinyStyler outperforms strong approaches such as GPT-4. We also evaluate TinyStyler’s ability to perform text attribute style transfer (formal ↔ informal) with automatic and human evaluations and find that the approach outperforms recent controllable text generation methods. Website BibTex TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings @inproceedings{horvitz-etal-2024-tinystyler, title = "{T}iny{S}tyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings", author = "Horvitz, Zachary and Patel, Ajay and Singh, Kanishk and Callison-Burch, Chris and McKeown, Kathleen and Yu, Zhou", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.781", pages = "13376--13390", abstract = "The goal of text style transfer is to transform the style of texts while preserving their original meaning, often with only a few examples of the target style. Existing style transfer methods generally rely on the few-shot capabilities of large language models or on complex controllable text generation approaches that are inefficient and underperform on fluency metrics. We introduce TinyStyler, a lightweight but effective approach, which leverages a small language model (800M params) and pre-trained authorship embeddings to perform efficient, few-shot text style transfer. We evaluate on the challenging task of authorship style transfer and find TinyStyler outperforms strong approaches such as GPT-4. We also evaluate TinyStyler{'}s ability to perform text attribute style transfer (formal $\leftrightarrow$ informal) with automatic and human evaluations and find that the approach outperforms recent controllable text generation methods.", }
MiRAGeNews: Multimodal Realistic AI-Generated News Detection. Runsheng (Anson) Huang, Liam Dugan, Yue Yang, Chris Callison-Burch. emnlp 2024. Abstract MiRAGeNews: Multimodal Realistic AI-Generated News Detection The proliferation of inflammatory or misleading “fake” news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two—AI-generated fake news content—is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60% F-1) and state-of-the-art multi-modal LLMs (< 24% F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1% F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content. Data BibTex MiRAGeNews: Multimodal Realistic AI-Generated News Detection @inproceedings{huang-etal-2024-miragenews, title = "{M}i{RAG}e{N}ews: Multimodal Realistic {AI}-Generated News Detection", author = "Huang, Runsheng and Dugan, Liam and Yang, Yue and Callison-Burch, Chris", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.959", pages = "16436--16448", abstract = "The proliferation of inflammatory or misleading {``}fake{''} news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two{---}AI-generated fake news content{---}is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60{\%} F-1) and state-of-the-art multi-modal LLMs ({\textless} 24{\%} F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1{\%} F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content.", }
A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis. Yue Yang, Mona Gandhi, Yufei Wang, Yifan Wu, Michael S. Yao, Chris Callison-Burch, James C. Gee, Mark Yatskar. NeurIPS 2024. Press Press Coverage Penn Engineering - October 15, 2024 - Training Medical AI with Knowledge, Not Shortcuts Abstract A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis While deep networks have achieved broad success in analyzing natural images, when applied to medical scans, they often fail in unexcepted situations. We investigate this challenge and focus on model sensitivity to domain shifts, such as data sampled from different hospitals or data confounded by demographic variables such as sex, race, etc, in the context of chest X-rays and skin lesion images. A key finding we show empirically is that existing visual backbones lack an appropriate prior from the architecture for reliable generalization in these settings. Taking inspiration from medical training, we propose giving deep networks a prior grounded in explicit medical knowledge communicated in natural language. To this end, we introduce Knowledge-enhanced Bottlenecks (KnoBo), a class of concept bottleneck models that incorporates knowledge priors that constrain it to reason with clinically relevant factors found in medical textbooks or PubMed. KnoBo uses retrieval-augmented language models to design an appropriate concept space paired with an automatic training procedure for recognizing the concept. We evaluate different resources of knowledge and recognition architectures on a broad range of domain shifts across 20 datasets. In our comprehensive evaluation with two imaging modalities, KnoBo outperforms fine-tuned models on confounded datasets by 32.4% on average. Finally, evaluations reveal that PubMed is a promising resource for making medical models less sensitive to domain shift, outperforming other resources on both diversity of information and final prediction performance. Code Website BibTex A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis @inproceedings{yang2024textbookremedydomainshifts, title={A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis}, author={Yang, Yue and Gandhi, Mona and Wang, Yufei and Wu, Yifan and Yao, Michael S. and Callison-Burch, Chris and Gee, James C. and Yatskar, Mark}, booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, year={2024}, url={https://neurips.cc/virtual/2024/poster/95098}, }
PaCE: Parsimonious Concept Engineering for Large Language Models. Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal. NeurIPS 2024. Abstract PaCE: Parsimonious Concept Engineering for Large Language Models Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities. BibTex PaCE: Parsimonious Concept Engineering for Large Language Models @inproceedings{luo2024pace, title={{PaCE}: Parsimonious Concept Engineering for Large Language Models}, author={Luo, Jinqi and Ding, Tianjiao and Chan, Kwan Ho Ryan and Thaker, Darshan and Chattopadhyay, Aditya and Callison-Burch, Chris and Vidal, Ren{\'e}}, booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, year={2024}, url={https://peterljq.github.io/project/pace/PaCE.pdf} }
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization. Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark. COLM 2024. Abstract CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. While recent work, e.g., Reflexion, has demonstrated how such agents can also self-improve by adding a textual memory of “hints” learned from prior experience, such improvements have been limited both in size and scope. In contrast, our goal is a language agent that can robustly improve performance over time, including when both the task and environment are varied. Our approach is to have the agent learn a textual representation of how the world works (rather than just isolated hints), expressed as a memory of causal abstractions, to guide future decision-making. In experiments, we find CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 points in ScienceWorld and 1.4 points in ALFWorld benchmarks. CLIN can also transfer its learning to new environments and tasks, enhancing performance by 21 points in ScienceWorld and 11 points in ALFWorld. This suggests that language agents with a textual causal memory can play a significant role in interactive environments, including being able to rapidly improve over time. Website BibTex CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization @inproceedings{clin-continual-learning-from-interactions, title = {{CLIN}: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization}, author = {Majumder, Bodhisattwa Prasad and Dalvi Mishra, Bhavana and Jansen, Peter and Tafjord, Oyvind and Tandon, Niket and Zhang, Li and Callison-Burch, Chris and Clark, Peter}, booktitle = {Conference on Language Modeling ({COLM})}, year = {2024}, url = {https://www.cis.upenn.edu/~ccb/publications/faithful-model-explanations-survey.pdf}, address = {Philadelphia, PA}, month = {October} }
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows. Ajay Patel, Colin Raffel, Chris Callison-Burch. ACL 2024. Abstract DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at this https URL . Code Website BibTex DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows @inproceedings{patel2024datadreamer, title={{DataDreamer}: A Tool for Synthetic Data Generation and Reproducible LLM Workflows}, author={Ajay Patel and Colin Raffel and Chris Callison-Burch}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)}, year={2024}, address={Bangkok, Thailand}, month={August 11--16}, eprint={2402.10379}, archivePrefix={arXiv}, primaryClass={cs.CL} }
FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models. Andrew Zhu, Alyssa Hwang, Liam Dugan, Chris Callison-Burch. ACL 2024. Abstract FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models One type of question that is commonly found in day-to-day scenarios is ``fan-out'' questions, complex multi-hop, multi-document reasoning questions that require finding information about a large number of entities. However, there exist few resources to evaluate this type of question-answering capability among large language models. To evaluate complex reasoning in LLMs more fully, we present FanOutQA, a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia as the knowledge base. We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B, finding that contemporary models still have room to improve reasoning over inter-document dependencies in a long context. We provide our dataset and open-source tools to run models to encourage evaluation at fanoutqa.com Data Code Website BibTex FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models @inproceedings{zhu2024fanoutqa, title={{FanOutQA}: Multi-Hop, Multi-Document Question Answering for Large Language Models}, author={Andrew Zhu and Alyssa Hwang and Liam Dugan and Chris Callison-Burch}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)}, year={2024}, address={Bangkok, Thailand}, month={August 11--16}, eprint={2402.14116}, archivePrefix={arXiv}, primaryClass={cs.CL} }
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. Liam Dugan, Alyssa Hwang, Filip Trhlik, Josh Magnus Ludan, Andrew Zhu, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. ACL 2024. Press Press Coverage Axios - September 3, 2024 - Teachers still can't trust AI text checkers Penn Engineering Today - August 12, 2024 - Detecting Machine-Generated Text: An Arms Race With the Advancements of Large Language Models Abstract RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors Many commercial and open-source models claim to detect machine-generated text with very high accuracy (99\% or higher). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging -- lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our dataset and tools to encourage further exploration into detector robustness. Video BibTex RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors @inproceedings{dugan2024raid, title={{RAID}: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors}, author={Liam Dugan and Alyssa Hwang and Filip Trhlik and Josh Magnus Ludan and Andrew Zhu and Hainiu Xu and Daphne Ippolito and Chris Callison-Burch}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)}, year={2024}, address={Bangkok, Thailand}, month={August 11--16}, eprint={2405.07940}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Holodeck: Language Guided Generation of 3D Embodied AI Environments. Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark. CVPR 2024. Press Press Coverage Tech Briefs - June 11, 2024 - Recreating ‘Star Trek’ Virtual Environments: Holodeck generates a virtually limitless range of indoor environments, using AI to interpret users’ requests. Abstract Holodeck: Language Guided Generation of 3D Embodied AI Environments 3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents. Code Website BibTex Holodeck: Language Guided Generation of 3D Embodied AI Environments @inproceedings{yang-etal-2024-holodeck, title = {Holodeck: Language Guided Generation of 3D Embodied AI Environments}, author = {Yue Yang and Fan-Yun Sun and Luca Weihs and Eli VanderBilt and Alvaro Herrasti and Winson Han and Jiajun Wu and Nick Haber and Ranjay Krishna and Lingjie Liu and Chris Callison-Burch and Mark Yatskar and Aniruddha Kembhavi and Christopher Clark }, booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)}, year = {2024}, address = {Seattle, Washington}, publisher = "IEEE/CVF", url = {https://yueyang1996.github.io/papers/holodeck.pdf} }
CoMo: Controllable Motion Generation through Language Guided Pose Code Editing. Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu. ECCV 2024. Abstract CoMo: Controllable Motion Generation through Language Guided Pose Code Editing Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as "left knee slightly bent". Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities. Website BibTex CoMo: Controllable Motion Generation through Language Guided Pose Code Editing @inproceedings{huang2024como, title={{CoMo}: Controllable Motion Generation through Language Guided Pose Code Editing}, author={Huang, Yiming and Wan, Weilin and Yang, Yue and Callison-Burch, Chris and Yatskar, Mark and Liu, Lingjie}, booktitle={Computer Vision--ECCV 2024}, pages={180--196}, year={2024}, publisher={Springer}, doi={10.1007/978-3-031-73397-0_11}, url={https://link.springer.com/chapter/10.1007/978-3-031-73397-0_11} }
BORDIRLINES: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation. Bryan Li, Samar Haider, Fiona Luo, Adwait Agashe, Chris Callison-Burch. First Workshop on Advancing Natural Language Processing for Wikipedia 2024. Abstract BORDIRLINES: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs’ responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM’s response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges. BibTex BORDIRLINES: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation @inproceedings{li-etal-2024-bordirlines, title = "{B}ord{IR}lines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation", author = "Li, Bryan and Haider, Samar and Luo, Fiona and Agashe, Adwait and Callison-Burch, Chris", editor = "Lucie-Aim{\'e}e, Lucie and Fan, Angela and Gwadabe, Tajuddeen and Johnson, Isaac and Petroni, Fabio and van Strien, Daniel", booktitle = "Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.wikinlp-1.3", pages = "1--13", abstract = "Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs{'} responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM{'}s response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges.", }
Uncovering Differences in Persuasive Language in Russian versus English Wikipedia. Bryan Li, Aleksey Panasyuk, Chris Callison-Burch. First Workshop on Advancing Natural Language Processing for Wikipedia 2024. Abstract Uncovering Differences in Persuasive Language in Russian versus English Wikipedia We study how differences in persuasive language across Wikipedia articles, written in either English and Russian, can uncover each culture’s distinct perspective on different subjects. We develop a large language model (LLM) powered system to identify instances of persuasive language in multilingual texts. Instead of directly prompting LLMs to detect persuasion, which is subjective and difficult, we propose to reframe the task to instead ask high-level questions (HLQs) which capture different persuasive aspects. Importantly, these HLQs are authored by LLMs themselves. LLMs over-generate a large set of HLQs, which are subsequently filtered to a small set aligned with human labels for the original task. We then apply our approach to a large-scale, bilingual dataset of Wikipedia articles (88K total), using a two-stage identify-then-extract prompting strategy to find instances of persuasion. We quantify the amount of persuasion per article, and explore the differences in persuasion through several experiments on the paired articles. Notably, we generate rankings of articles by persuasion in both languages. These rankings match our intuitions on the culturally-salient subjects; Russian Wikipedia highlights subjects on Ukraine, while English Wikipedia highlights the Middle East. Grouping subjects into larger topics, we find politically-related events contain more persuasion than others. We further demonstrate that HLQs obtain similar performance when posed in either English or Russian. Our methodology enables cross-lingual, cross-cultural understanding at scale, and we release our code, prompts, and data. BibTex Uncovering Differences in Persuasive Language in Russian versus English Wikipedia @inproceedings{li-etal-2024-uncovering, title = "Uncovering Differences in Persuasive Language in {R}ussian versus {E}nglish {W}ikipedia", author = "Li, Bryan and Panasyuk, Aleksey and Callison-Burch, Chris", editor = "Lucie-Aim{\'e}e, Lucie and Fan, Angela and Gwadabe, Tajuddeen and Johnson, Isaac and Petroni, Fabio and van Strien, Daniel", booktitle = "Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.wikinlp-1.8", pages = "21--35", abstract = "We study how differences in persuasive language across Wikipedia articles, written in either English and Russian, can uncover each culture{'}s distinct perspective on different subjects. We develop a large language model (LLM) powered system to identify instances of persuasive language in multilingual texts. Instead of directly prompting LLMs to detect persuasion, which is subjective and difficult, we propose to reframe the task to instead ask high-level questions (HLQs) which capture different persuasive aspects. Importantly, these HLQs are authored by LLMs themselves. LLMs over-generate a large set of HLQs, which are subsequently filtered to a small set aligned with human labels for the original task. We then apply our approach to a large-scale, bilingual dataset of Wikipedia articles (88K total), using a two-stage identify-then-extract prompting strategy to find instances of persuasion. We quantify the amount of persuasion per article, and explore the differences in persuasion through several experiments on the paired articles. Notably, we generate rankings of articles by persuasion in both languages. These rankings match our intuitions on the culturally-salient subjects; Russian Wikipedia highlights subjects on Ukraine, while English Wikipedia highlights the Middle East. Grouping subjects into larger topics, we find politically-related events contain more persuasion than others. We further demonstrate that HLQs obtain similar performance when posed in either English or Russian. Our methodology enables cross-lingual, cross-cultural understanding at scale, and we release our code, prompts, and data.", }
Evaluating Vision-Language Models on Bistable Images. Best paper award at the Workshop on Cognitive Modeling and Computational Linguistics (CMCL). Artemis Panagopoulou, Coby Melkin, Chris Callison-Burch. Workshop on Cognitive Modeling and Computational Linguistics 2024. Abstract Evaluating Vision-Language Models on Bistable Images Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced. BibTex Evaluating Vision-Language Models on Bistable Images @inproceedings{panagopoulou-etal-2024-evaluating, title = "Evaluating Vision-Language Models on Bistable Images", author = "Panagopoulou, Artemis and Melkin, Coby and Callison-Burch, Chris", editor = "Kuribayashi, Tatsuki and Rambelli, Giulia and Takmaz, Ece and Wicke, Philipp and Oseki, Yohei", booktitle = "Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.cmcl-1.2", doi = "10.18653/v1/2024.cmcl-1.2", pages = "8--29", abstract = "Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously, by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 121 different manipulations in brightness, resolution, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the models{'} preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.", }
PROC2PDDL: Open-Domain Planning Representations from Texts. Tianyi Zhang, Harry Li Zhang, Zhaoyi Hou, Ziyu Wang, Yuling Gu, Peter Clark, Chris Callison-Burch, Niket Tandon. Natural Language Reasoning and Structured Explanations Workshop 2024. Abstract PROC2PDDL: Open-Domain Planning Representations from Texts Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations. Using this dataset, we evaluate state-of-the-art models on defining the preconditions and effects of actions. We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%. Our analysis shows both syntactic and semantic errors, indicating LMs' deficiency in both generating domain-specific prgorams and reasoning about events. We hope this analysis and dataset helps future progress towards integrating the best of LMs and formal planning. Data Video BibTex PROC2PDDL: Open-Domain Planning Representations from Texts @inproceedings{zhang2024proc2pddlopendomainplanningrepresentations, title={{PROC2PDDL}: Open-Domain Planning Representations from Texts}, author={Tianyi Zhang and Li Zhang and Zhaoyi Hou and Ziyu Wang and Yuling Gu and Peter Clark and Chris Callison-Burch and Niket Tandon}, booktitle={Proceedings of the Natural Language Reasoning and Structured Explanations Workshop at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)}, year={2024}, month={August}, address={Bangkok, Thailand}, eprint={2403.00092}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2403.00092}, }
PDDLego: Iterative Planning in Textual Environments. Li Zhang, Peter Jansen, Tianyi Zhang, Peter Clark, Chris Callison-Burch, Niket Tandon. SEM 2024. Abstract PDDLego: Iterative Planning in Textual Environments Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%). BibTex PDDLego: Iterative Planning in Textual Environments @inproceedings{zhang-etal-2024-pddlego, title = "{PDDLEGO}: Iterative Planning in Textual Environments", author = "Zhang, Li and Jansen, Peter and Zhang, Tianyi and Clark, Peter and Callison-Burch, Chris and Tandon, Niket", editor = "Bollegala, Danushka and Shwartz, Vered", booktitle = "Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (SEM 2024)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.starsem-1.17/", doi = "10.18653/v1/2024.starsem-1.17", pages = "212--221", abstract = "Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43{\%} more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98{\%}) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4{\%})." }
Large Language Models Can Self-Improve At Web Agent Tasks. Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter. arXiv 2024. Unpublished preprint. Abstract Large Language Models Can Self-Improve At Web Agent Tasks Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31\% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement. BibTex Large Language Models Can Self-Improve At Web Agent Tasks @misc{patel2024largelanguagemodelsselfimprove, title={Large Language Models Can Self-Improve At Web Agent Tasks}, author={Ajay Patel and Markus Hofmarcher and Claudiu Leoveanu-Condrei and Marius-Constantin Dinu and Chris Callison-Burch and Sepp Hochreiter}, year={2024}, eprint={2405.20309}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2405.20309}, }
Choice-75: A Dataset on Decision Branching in Script Learning. Zhaoyi Joey Hou, Li Zhang, Chris Callison-Burch. COLING 2024. Abstract Choice-75: A Dataset on Decision Branching in Script Learning Script learning studies how daily events unfold. Previous works tend to consider a script as a linear sequence of events while ignoring the potential branches that arise due to people's circumstantial choices. We hence propose Choice-75, the first benchmark that challenges intelligent systems to predict decisions given descriptive scenarios, containing 75 scripts and more than 600 scenarios. While large language models demonstrate overall decent performances, there is still notable room for improvement in many hard scenarios. BibTex Choice-75: A Dataset on Decision Branching in Script Learning @inproceedings{hou2024choice75, title={Choice-75: A Dataset on Decision Branching in Script Learning}, author={Zhaoyi Joey Hou and Li Zhang and Chris Callison-Burch}, booktitle={Proceedings of The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, year={2024}, address={Torino, Italy}, month={May}, pages={20--25} }
Towards Faithful Model Explanation in NLP: A Survey. Qing Lyu, Marianna Apidianaki, Chris Callison-Burch. Computational Linguistics 2024. Abstract Towards Faithful Model Explanation in NLP: A Survey End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model’s prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation- based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP. BibTex Towards Faithful Model Explanation in NLP: A Survey @article{Lyu-et-al:CL:2024, author = {Qing Lyu and Marianna Apidianaki and Chris Callison-Burch}, title = {Towards Faithful Model Explanation in {NLP}: A Survey}, journal = {Computational Linguistics}, year = {2024}, volume = {50}, number = {2}, pages = {657–723}, url = {https://doi.org/10.1162/coli_a_00511} }
Outwit, Outplay, Out-Generate: A Framework for Designing Strategic Generative Agents in Competitive Environments. Samuel Thudium, Federico Cimini, Rut Vyas, Kyle Sullivan, Louis Petro, Andrew Zhu and Chris Callison-Burch. Wordplay Workshop 2024. Abstract Outwit, Outplay, Out-Generate: A Framework for Designing Strategic Generative Agents in Competitive Environments We explore the strategic capabilities of generative agents in a series of social competitive games that emulate the television show Survivor. Large Language Models have been shown to act as intelligent agents through the addition of external cognitive architectures, but it is still unknown how these agents perform in competitive, multi-agent environments. We suggest a framework, built on top of a frozen large language model, GPT-4, for designing generative agents in competitive, episodic environments and we evaluate their game performance. We provide new modules which enable strategic agents to set and evaluate goals, develop theories of mind about other agents, and embody descriptive personas that affect their behavior. Across many simulations, though agents with varied cognitive abilities displayed even performance in a "last-agent-standing" scenario, strategic agents exhibit more diverse action selection, including an affinity for exploratory behaviors. We also observe that agents with personas tailored to the social environment are significantly more likely to win these challenges. Finally, goal-driven agents perform well in a search-based game, displaying an emergent capacity to rapidly utilize information perceived from their environment to inform task-related actions. BibTex Outwit, Outplay, Out-Generate: A Framework for Designing Strategic Generative Agents in Competitive Environments @inproceedings{thudium-et-al-2024, title = {Outwit, Outplay, Out-Generate: A Framework for Designing Strategic Generative Agents in Competitive Environments}, author = {Samuel Thudium and Federico Cimini and Rut Vyas and Kyle Sullivan and Louis Petro and Andrew Zhu and Chris Callison-Burch}, booktitle = {Proceedings of the Wordplay Workshop}, year = {2024}, url = {https://www.cis.upenn.edu/~ccb/publications/survivor-sim.pdf}, address = {Bangkok, Thailand}, date = {August 16, 2024} }
You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling. Jaewoo Song, Andrew Zhu and Chris Callison-Burch. Wordplay Workshop 2024. Abstract You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling Developing a consistent and reliable AI game master for text-based games is a challenging task due to the limitations of large language models (LLMs) and the complexity of the game master's role. This paper presents a novel approach to enhance AI game masters by leveraging function calling in the context of the table-top role-playing game "Jim Henson’s Labyrinth: The Adventure Game." Our methodology involves integrating game-specific controls through functions, improving narrative quality and state update consistency. The experimental results, based on human evaluations and unit tests, demonstrate the effectiveness of our approach in enhancing the gameplay experience and maintaining coherence with the game state. BibTex You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling @inproceedings{song-et-al-2024, title = {You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling}, author = {Jaewoo Song and Andrew Zhu and Chris Callison-Burch}, booktitle = {Proceedings of the Wordplay Workshop}, year = {2024}, url = {https://www.cis.upenn.edu/~ccb/publications/labyrinth-ai-gm.pdf}, address = {Bangkok, Thailand}, date = {August 16, 2024}
DAGGER: Data Augmentation for Generative Gaming in Enriched Realms. Chris Callison-Burch, Ajay Patel, James Dennis, and Andrew Zhu. Wordplay Workshop 2024. Abstract DAGGER: Data Augmentation for Generative Gaming in Enriched Realms DAGGER is a synthetically generated dataset for creating text adventure games from fiction and for generating prose from game states. It extends the LIGHT dataset by enhancing the original game locations with new characters and items generated using GPT-4. Additionally, each location is paired with a piece of short fiction, totaling 259,582 words across the entire dataset. DAGGER was created using a novel methodology that leverages the capabilities of GPT-4 and a new tool called DataDreamer, which simplifies the implementation of complex prompting workflows and enables the generation of high-quality synthetic training data. The DAGGER dataset is used to train two baseline models: one that converts game states into fictional narrative descriptions and another that predicts game states from fiction. These models demonstrate the usefulness of the dataset for tasks involving the mapping between stories and playable text adventure games. We make the dataset, the trained models, and a Python game engine publicly available to facilitate further research. BibTex DAGGER: Data Augmentation for Generative Gaming in Enriched Realms @inproceedings{callisonburch-et-al-2024, title = {DAGGER: Data Augmentation for Generative Gaming in Enriched Realms}, author = {Chris Callison-Burch and Ajay Patel and James Dennis and Andrew Zhu}, booktitle = {Proceedings of the Wordplay Workshop}, year = {2024}, url = {https://www.cis.upenn.edu/~ccb/publications/dagger.pdf}, address = {Bangkok, Thailand}, date = {August 16, 2024}
WorldWeaver: Procedural World Generation for Text Adventure Games using Large Language Models. Meiqing Jin, Manvi Kaul, Shriya Ramakrishanan, Hardik Jain, Samarth Chandrawat, Ishita Agarwal, Tianyi Zhang, Andrew Zhu, and Chris Callison-Burch. Wordplay Workshop 2024. Abstract WorldWeaver: Procedural World Generation for Text Adventure Games using Large Language Models WorldWeaver proposes a novel approach for enhancing content generation in text adventure games through procedural world generation using large language models. Our system generates game elements including locations, characters, and items, integrating human-in-the-loop intervention to better align with creator intentions. The methodology improves semantic and logical coherence, allowing for interactive creation of complex virtual environments. By utilizing few-shot prompting and task-specific models, we ensure the consistency of the generated game world, which is validated through qualitative and quantitative evaluations. The project provides tools for dynamic world generation, including a Python class for game execution. Code BibTex WorldWeaver: Procedural World Generation for Text Adventure Games using Large Language Models @inproceedings{jin-et-al-2024, title = {WorldWeaver: Procedural World Generation for Text Adventure Games using Large Language Models}, author = {Meiqing Jin and Manvi Kaul and Shriya Ramakrishanan and Hardik Jain and Samarth Chandrawat and Ishita Agarwal and Tianyi Zhang and Andrew Zhu and Chris Callison-Burch}, booktitle = {Proceedings of the Wordplay Workshop}, year = {2024}, url = {https://www.cis.upenn.edu/~ccb/publications/worldweaver.pdf}, address = {Bangkok, Thailand}, date = {August 16, 2024}
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer. Zachary Horvitz, Ajay Patel, Chris Callison-Burch, Zhou Yu, Kathleen McKeown. AAAI 2024. Abstract ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g, formality) to authorship (e.g, Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer. BibTex ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer @inproceedings{horvitz2023paraguide, title={{ParaGuide}: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer}, author={Zachary Horvitz and Ajay Patel and Chris Callison-Burch and Zhou Yu and Kathleen McKeown}, year={2024}, address={Vancouver, Canada}, date={February 20-27, 2024}, eprint={2308.15459}, archivePrefix={arXiv}, primaryClass={cs.CL} }
2023
Understanding Generative Artificial Intelligence and Its Relationship to Copyright. Chris Callison-Burch. The U.S. House of Representatives Judiciary Committee Subcommittee on Courts, Intellectual Property, and the Internet Hearing on Artificial Intelligence and Intellectual Property: Part I – Interoperability of AI and Copyright Law 2023. Press Press Coverage American Bar Association Journal - August 30, 2023 - Copyright Law and Generative AI: What a mess Fox News - May 20, 2023 - What does Congress need to do amid AI boom? Abstract Understanding Generative Artificial Intelligence and Its Relationship to Copyright Chairman Issa, Ranking Member Johnson, and distinguished Members of the Subcommittee, thank you for the opportunity to testify on the topic of artificial intelligence and intellectual property. Generative AI had its breakthrough moment last November with the release of OpenAI’s ChatGPT, bringing my field of research into the public eye and generating a huge amount of excitement. I had access to OpenAI's large language model about a year and a half before the public. Despite having been working in this field for 20 years, I was shocked at its capabilities. My first encounter with it pitched me into a career existential crisis. This technology had seemingly solved many of the research problems that I was working on. It could translate texts from Russian into English. It could write coherent summaries of long documents, and answer questions about them. I wondered if there was any room left in the field for academic research, because training these large language models required Google-sized data centers, and I simply can’t compete at that scale. So I asked myself, "Should I just drop out of computer science and become a poet?" But then I trained the model to write better poetry than me. I have subsequently calmed down, and I do not think I’m at imminent risk of losing my job to ChatGPT. But I understand that many other people are experiencing that same sense of panic that I had. Artists and writers are worried that their work will be devalued. I worry that a career as a paralegal may go the way of the lamplighter. I think that at its core, what we are talking about today goes far beyond copyright. It is about the value of work. This is a truly transformative technology that will shape many aspects of our lives. I hope that it is for the better. I optimistically believe that AI will enable us all to become more productive workers, and more creative in our artistic pursuits. In my testimony today, I hope to offer the Subcommittee: 1. My expertise in the technical aspects of generative AI, I promise to explain it in a way that is understandable without requiring a PhD in computer science. 2. Answers to any questions you may have about the emerging capabilities of this technology or what impact legislation might have on innovation in this field 3. Advocacy to retain fair use for the purposes of training AI systems. In my written testimony, I have provided you and your staff with an overview of how Generative AI works. I’m happy to review any details about it, either during this hearing or 1-on-1 anytime in the future. To briefly summarize written testimony: Generative AI is trained on huge amounts of data. Large Language Models are now trained on roughly 1 trillion words. Image generators are trained on hundreds of millions of images. Much or most of that data consists of copyrighted works that have been gathered by automatically crawling the web. From this data, AI systems learn. Their learning process is called “pre-training”. Pre-training an AI system is different from how children learn, but the effect is similar. AI systems learn how to use language. They learn facts about the world, ideas and opinions, visual concepts, and they even learn some rudimentary common sense reasoning skills. After this pre-training on the huge amount of copyrighted data, that data is set aside. They are then trained to specialize in different tasks – often using custom built data sets that are much smaller. For instance, a large language model could be specialized or “fine tuned” to become an intelligent tutoring system, or a computer vision system could be “fine tuned” to recognize cancerous growths in mammograms. These systems could not be as easily adapted to these specialized tasks without the general knowledge that they acquired from the copyrighted texts that they were pre-trained on. I believe that pre-training squarely falls under fair use of copyrighted texts, and that Internet-era case law like Google books are good precedents that support this. I do believe that the output of Generative AI systems can infringe copyright, and that it is worth congress considering legislation to better shape copyright to govern things like copyrightable characters, or to extend it to cover Right-of-Publicity. I look forward to discussing this topic with you. Video BibTex Understanding Generative Artificial Intelligence and Its Relationship to Copyright @misc{CallisonBurch_2023, author = {Callison-Burch, Chris}, title = {Understanding Generative Artificial Intelligence and Its Relationship to Copyright}, howpublished = {Testimony before The U.S. House of Representatives Judiciary Committee, Subcommittee on Courts, Intellectual Property, and the Internet}, month = {May}, year = {2023}, note = {Hearing on Artificial Intelligence and Intellectual Property: Part I – Interoperability of AI and Copyright Law}, institution = {University of Pennsylvania, School of Engineering and Applied Sciences, Department of Computer and Information Science} }
AI2's Response to the US Copyright Requence for Comments on Artificial Intelligence and Copyright. Ali Farhadi, David Atkinson, Chris Callison-Burch, Nicole DeCario, Jennifer Dumas, Kyle Lo, Luca Soldiani. US Copyright Office Docket No. 2023-6 2023. Abstract AI2's Response to the US Copyright Requence for Comments on Artificial Intelligence and Copyright We are AI researchers, engineers, policy advisors, and legal counsel from The Allen Institute for Artificial Intelligence (AI2). We offer the following submission in response to the U.S. Copyright Office, Library of Congress Notice of inquiry and request for comments (RFC). Our response centers on the use of copyrighted materials to train AI Models, as defined in the RFC, and the Output derived from the Training Materials via the AI Models or AI Systems (Output).1 In this response, we provide background and details on the technical aspects of training AI Models, as outlined in the Training section of the RFC, and we offer two recommendations for consideration. AI2 is a non-profit research institute founded in 2014 with the mission of conducting high-impact AI research and engineering in service of the common good. AI2 is the creation of the late Paul G. Allen, philanthropist and Microsoft co-founder, and is led by Dr. Ali Farhadi. Headquartered in Seattle, AI2 employs over 200 world-class AI researchers and engineers from across the globe. We share Paul Allen’s vision and belief that AI can transform lives in positive ways. Generative AI has potential applications that will benefit society, including medical diagnosis, treatment, and cure research; assistive technologies for people with disabilities; intelligent tutoring systems for personalized and more equitable education; and climate modeling to predict impacts in specific regions. We also recognize the inherent and potential challenges that exist with this technology. Our focus at AI2 is to work not only on cutting edge AI research, but also at the intersection of AI ethics, AI policy, and AI literacy to create solutions that enable a future where AI is universally designed, developed, and deployed responsibly. Starting in March 2023, researchers at AI2 have been building a state-of-the-art generative language model called OLMo (Open Language Model). AI2 expects to publicly release OLMo in early 2024. Our goal is to produce an AI Model designed for scientific research that provides access and education around all aspects of AI Model creation. This summer we released Dolma, the Training Dataset used to create OLMo. Dolma2, consists of 3 trillion tokens from a diverse mix of web content, academic publications, software code, books, and encyclopedic materials. We offer our feedback here as a nonprofit research institute with first-hand experience training generative AI Models from scratch. We will describe aspects of model training, collection of Training Material for AI Models, and related copyright considerations. BibTex AI2's Response to the US Copyright Requence for Comments on Artificial Intelligence and Copyright @misc{ai2response2023, title = {{AI2's Response to the US Copyright Requence for Comments on Artificial Intelligence and Copyright}}, author = {Farhadi, Ali and Atkinson, David and Callison-Burch, Chris and DeCario, Nicole and Dumas, Jennifer and Lo, Kyle and Soldiani, Luca}, howpublished = {US Copyright Office Docket No. 2023-6}, year = {2023}, url = {https://www.cis.upenn.edu/~ccb/publications/understanding-generative-AI-and-its-relationship-to-copyright.pdf}, note = {Comment} }
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, ... (450 authors including Chris Callison-Burch), Ziyi Wu. TMLR 2023. Abstract Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. Data Code Website BibTex Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models @article{srivastava2023beyond, title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models}, author={Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri{\`a} and others}, journal={Transactions on Machine Learning Research}, year={2023}, url={https://arxiv.org/abs/2206.04615} }
The Gender Wage Gap in an Online Labor Market: The Cost of Interruptions. Abi Adams-Prassl, Kotaro Hara, Kristy Milland, Chris Callison-Burch. The Review of Economics and Statistics 2023. Abstract The Gender Wage Gap in an Online Labor Market: The Cost of Interruptions This paper analyzes gender differences in working patterns and wages on Amazon Mechanical Turk, a popular online labor platform. Using information on 2 million tasks, we find no gender differences in task selection nor experience. Nonetheless, women earn 20% less per hour on average. Gender differences in working patterns are a significant driver of this wage gap. Women are more likely to interrupt their working time on the platform with consequences for their task completion speed. A follow-up survey shows that the gender differences in working patterns and hourly wages are concentrated amongst workers with children. BibTex The Gender Wage Gap in an Online Labor Market: The Cost of Interruptions @article{10.1162/rest_a_01282, author = {Adams-Prassl, Abi and Hara, Kotaro and Milland, Kristy and Callison-Burch, Chris}, title = {The Gender Wage Gap in an Online Labor Market: The Cost of Interruptions}, journal = {The Review of Economics and Statistics}, pages = {1-23}, year = {2023}, month = {02}, abstract = {This paper analyses gender differences in working patterns and wages on Amazon Mechanical Turk, a popular online labour platform. Using information on 2 million tasks, we find no gender differences in task selection nor experience. Nonetheless, women earn 20\\% less per hour on average. Gender differences in working patterns are a significant driver of this wage gap. Women are more likely to interrupt their working time on the platform with consequences for their task completion speed. A follow-up survey shows that the gender differences in working patterns and hourly wages are concentrated amongst workers with children.}, issn = {0034-6535}, doi = {10.1162/rest_a_01282}, url = {https://doi.org/10.1162/rest\_a\_01282}, eprint = {https://direct.mit.edu/rest/article-pdf/doi/10.1162/rest\_a\_01282/2070066/rest\_a\_01282.pdf} }
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization. Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark. Agent Learning in Open-Endedness (ALOE) Workshop 2023. Abstract CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory centered on causal abstractions (rather than general "helpful hints") that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time. Website BibTex CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization @inproceedings{majumder2023clin, title={{CLIN}: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization}, author={Bodhisattwa Prasad Majumder and Bhavana Dalvi Mishra and Peter Jansen and Oyvind Tafjord and Niket Tandon and Li Zhang and Chris Callison-Burch and Peter Clark}, booktitle={Proceedings of the Agent Learning in Open-Endedness (ALOE) Workshop, NeurIPS 2023}, year={2023}, address={New Orleans}, date={December 15, 2023}, }
Grounded Intuition of GPT-Vision's Abilities with Scientific Images. Alyssa Hwang, Andrew Head, Chris Callison-Burch. arXiv 2023. Unpublished preprint. Abstract Grounded Intuition of GPT-Vision's Abilities with Scientific Images GPT-Vision has impressed us on a range of vision-language tasks, but it comes with the familiar new challenge: we have little idea of its capabilities and limitations. In our study, we formalize a process that many have instinctively been trying already to develop "grounded intuition" of this new model. Inspired by the recent movement away from benchmarking in favor of example-driven qualitative evaluation, we draw upon grounded theory and thematic analysis in social science and human-computer interaction to establish a rigorous framework for qualitative evaluation in natural language processing. We use our technique to examine alt text generation for scientific figures, finding that GPT-Vision is particularly sensitive to prompting, counterfactual text in images, and relative spatial relationships. Our method and analysis aim to help researchers ramp up their own grounded intuitions of new models while exposing how GPT-Vision can be applied to make information more accessible. BibTex Grounded Intuition of GPT-Vision's Abilities with Scientific Images @misc{hwang2023grounded, title={Grounded Intuition of GPT-Vision's Abilities with Scientific Images}, author={Alyssa Hwang and Andrew Head and Chris Callison-Burch}, year={2023}, eprint={2311.02069}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck. Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, Chris Callison-Burch. arXiv 2023. Unpublished preprint. Abstract Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck Deep neural networks excel in text classification tasks, yet their application in high-stakes domains is hindered by their lack of interpretability. To address this, we propose Text Bottleneck Models (TBMs), an intrinsically interpretable text classification framework that offers both global and local explanations. Rather than directly predicting the output label, TBMs predict categorical values for a sparse set of salient concepts and use a linear layer over those concept values to produce the final prediction. These concepts can be automatically discovered and measured by a Large Language Model (LLM), without the need for human curation. On 12 diverse datasets, using GPT-4 for both concept generation and measurement, we show that TBMs can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa, while falling short against finetuned GPT-3.5. Overall, our findings suggest that TBMs are a promising new framework that enhances interpretability, with minimal performance tradeoffs, particularly for general-domain text. BibTex Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck @misc{ludan2023interpretablebydesign, title={Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck}, author={Josh Magnus Ludan and Qing Lyu and Yue Yang and Liam Dugan and Mark Yatskar and Chris Callison-Burch}, year={2023}, eprint={2310.19660}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Kani 🦀: A Lightweight and Highly Hackable Framework for Building Language Model Applications. Andrew Zhu, Liam Dugan, Alyssa Hwang, Chris Callison-Burch. 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS) 2023. Abstract Kani 🦀: A Lightweight and Highly Hackable Framework for Building Language Model Applications Language model applications are becoming increasingly popular and complex, often including features like tool usage and retrieval augmentation. However, existing frameworks for such applications are often opinionated, deciding for developers how their prompts ought to be formatted and imposing limitations on customizability and reproducibility. To solve this we present Kani: a lightweight, flexible, and model-agnostic open-source framework for building language model applications. Kani helps developers implement a variety of complex features by supporting the core building blocks of chat interaction: model interfacing, chat management, and robust function calling. All Kani core functions are easily overridable and well documented to empower developers to customize functionality for their own needs. Kani thus serves as a useful tool for researchers, hobbyists, and industry professionals alike to accelerate their development while retaining interoperability and fine-grained control. Code BibTex Kani 🦀: A Lightweight and Highly Hackable Framework for Building Language Model Applications @inproceedings{zhu2023kani, title={Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications}, author={Andrew Zhu and Liam Dugan and Alyssa Hwang and Chris Callison-Burch}, year={2023}, booktitle={3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}, address={Singapore} year = {2023} }
Learning Interpretable Style Embeddings via Prompting LLMs. Ajay Patel, Delip Rao, Ansh Kothary, Kathleen McKeown, Chris Callison-Burch. EMNLP Findings 2023. Abstract Learning Interpretable Style Embeddings via Prompting LLMs Style representation learning builds content independent representations of author style in text. To date, no large dataset of texts with stylometric annotations on a wide range of style dimensions has been compiled, perhaps because the linguistic expertise to perform such annotation would be prohibitively expensive. Therefore, current style representation approaches make use of unsupervised neural methods to disentangle style from content to create style vectors. These approaches, however, result in uninterpretable representations, complicating their usage in downstream applications like authorship attribution where auditing and explainability is critical. In this work, we use prompting to perform stylometry on a large number of texts to generate a synthetic stylometry dataset. We use this synthetic data to then train human interpretable style representations we call LISA embeddings. We release our synthetic dataset (STYLEGENOME) and our interpretable style embedding model (LISA) as resources. Data Code BibTex Learning Interpretable Style Embeddings via Prompting LLMs @inproceedings{patel2023learning, title={Learning Interpretable Style Embeddings via Prompting LLMs}, author={Ajay Patel and Delip Rao and Ansh Kothary and Kathleen McKeown and Chris Callison-Burch}, booktitle={Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023 Findings)}, address={Singapore} year = {2023} }
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale. Bryan Li, Chris Callison-Burch. EMNLP Findings 2023. Abstract PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale Existing question answering (QA) systems owe much of their success to large, high-quality training data. Such annotation efforts are costly, and the difficulty compounds in the cross-lingual setting. Therefore, prior cross-lingual QA work has focused on releasing evaluation datasets, and then applying zero-shot methods as baselines. In this work, we propose a synthetic data generation method for cross-lingual QA which leverages indirect supervision from existing parallel corpora. Our method termed paxqa (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. In the first stage, we apply a question generation (QG) model to the English side. In the second stage, we apply annotation projection to translate both the questions and answers. To better translate questions, we propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We release cross-lingual QA datasets across 4 languages, totaling 662K QA examples. We then show that extractive QA models fine-tuned on these datasets outperform both zero-shot and prior synthetic data generation models, showing the sufficient quality of our generations. We find that the largest performance gains are for cross-lingual directions with non-English questions and English contexts. Ablation studies show that our dataset generation method is relatively robust to noise from automatic word alignments. BibTex PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale @inproceedings{li2023crosslingualqa, title={{PAXQA} Generating Cross-lingual Question Answering Examples at Training Scale}, author={Bryan Li and Chris Callison-Burch}, booktitle={Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023 Findings)}, address={Singapore} year = {2023} }
Faithful Chain-of-Thought Reasoning. Area Chair Award (Interpretability and Analysis of Models for NLP). Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, Chris Callison-Burch. AACL-IJCNLP 2023. Abstract Faithful Chain-of-Thought Reasoning While Chain-of-Thought (CoT) prompting boosts Language Models’ (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy. Data Code BibTex Faithful Chain-of-Thought Reasoning @inproceedings{lyu2023faithful, author = {Lyu, Qing and Havaldar, Shreya and Stein, Adam and Zhang, Li and Rao, Delip and Wong, Eric and Apidianaki, Marianna and Callison-Burch, Chris}, title = {Faithful Chain-of-Thought Reasoning}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023)}, year = {2023}, location = {Bali, Indonesia}, date = {November 1--4} }
Rewriting the Script: Adapting Text Instructions for Voice Interaction. Alyssa Hwang, Natasha Oza, Andrew Head, Chris Callison-Burch. DIS 2023. Abstract Rewriting the Script: Adapting Text Instructions for Voice Interaction Voice assistants have sharply risen in popularity in recent years, but their use has been limited mostly to simple applications like music, hands-free search, or control of internet-of-things devices. What would it take for voice assistants to guide people through more complex tasks? In our work, we study the limitations of the dominant approach voice assistants take to complex task guidance: reading aloud written instructions. Using recipes as an example, we observe twelve participants cook at home with a state-of-the-art voice assistant. We learn that the current approach leads to nine challenges, including obscuring the bigger picture, overwhelming users with too much information, and failing to communicate affordances. Instructions delivered by a voice assistant are especially difcult because they cannot be skimmed as easily as written instructions. Alexa in particular did not surface crucial details to the user or answer questions well. We draw on our observations to propose eight ways in which voice assistants can “rewrite the script”—summarizing, signposting, splitting, elaborating, volunteering, reordering, redistributing, and visualizing—to transform written sources into forms that are readily communicated through spoken conversation. We conclude with a vision of how modern advancements in natural language processing can be leveraged for intelligent agents to guide users efectively through complex tasks. BibTex Rewriting the Script: Adapting Text Instructions for Voice Interaction @inproceedings{10.1145/3563657.3596059, author = {Hwang, Alyssa and Oza, Natasha and Callison-Burch, Chris and Head, Andrew}, title = {Rewriting the Script: Adapting Text Instructions for Voice Interaction}, year = {2023}, isbn = {9781450398930}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3563657.3596059}, doi = {10.1145/3563657.3596059}, booktitle = {Proceedings of the 2023 ACM Designing Interactive Systems Conference}, pages = {2233–2248}, numpages = {16}, keywords = {splitting, complex task guidance, remixing, reordering, instructions, voice assistants, voice user interfaces, summarization}, location = {Pittsburgh, PA, USA}, series = {DIS '23} }
Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models. Liam Dugan, Anshul Wadhawan, Kyle Spence, Chris Callison-Burch, Morgan McGuire, Victor Zordan. Interspeech 2023. Abstract Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models Recent work in speech-to-speech translation (S2ST) has focused primarily on offline settings, where the full input utterance is available before any output is given. This, however, is not reasonable in many real-world scenarios. In latency-sensitive applications, rather than waiting for the full utterance, translations should be spoken as soon as the information in the input is present. In this work, we introduce a system for simultaneous S2ST targeting real-world use cases. Our system supports translation from 57 languages to English with tunable parameters for dynamically adjusting the latency of the output—including four policies for determining when to speak an output sequence. We show that these policies achieve offline-level accuracy with minimal increases in latency over a Greedy (wait-k) baseline. We open-source our evaluation code and interactive test script to aid future SimulS2ST research and application development. Code BibTex Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models @inproceedings{dugan2023learning, title={Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models}, author={Dugan, Liam and Wadhawan, Anshul and Spence, Kyle and Callison-Burch, Chris and McGuire, Morgan and Zordan, Victor}, booktitle={Proceedings of INTERSPEECH 2023}, year={2023}, address={Dublin, Ireland}, month={August} pages={5265-5266} }
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models. Bryan Li, Chris Callison-Burch. NAACL 2023. Abstract This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. This contrasts with a multilingual human, who would likely answer consistently. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages—a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context. Our code and data are available at https://github.com/manestay/borderlines. BibTex This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models @inproceedings{li-etal-2024-land, title = "This Land is {Your, My} Land: Evaluating Geopolitical Bias in Language Models through Territorial Disputes", author = "Li, Bryan and Haider, Samar and Callison-Burch, Chris", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.213", doi = "10.18653/v1/2024.naacl-long.213", pages = "3855--3871", abstract = "Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. This contrasts with a multilingual human, who would likely answer consistently. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages{---}a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context. Our code and data are available at https://github.com/manestay/borderlines.", }
Representation of Lexical Stylistic Features in Language Models’ Embedding Space. Qing Lyu, Marianna Apidianaki, Chris Callison-Burch. StarSEM 2023. Abstract Representation of Lexical Stylistic Features in Language Models’ Embedding Space The representation space of pretrained Language Models (LMs) encodes rich information about words and their relationships (e.g., similarity, hypernymy, polysemy) as well as abstract semantic notions (e.g., intensity). In this paper, we demonstrate that lexical stylistic notions such as complexity, formality, and figurativeness, can also be identified in this space. We show that it is possible to derive a vector representation for each of these stylistic notions from only a small number of seed pairs. Using these vectors, we can characterize new texts in terms of these dimensions by performing simple calculations in the corresponding embedding space. We conduct experiments on five datasets and find that static embeddings encode these features more accurately at the level of words and phrases, whereas contextualized LMs perform better on sentences. The lower performance of contextualized representations at the word level is partially attributable to the anisotropy of their vector space, which can be corrected to some extent using techniques like standardization. Code BibTex Representation of Lexical Stylistic Features in Language Models’ Embedding Space @inproceedings{lyu-etal-2023-representation, title = "Representation of Lexical Stylistic Features in Language Models{'} Embedding Space", author = "Lyu, Qing and Apidianaki, Marianna and Callison-Burch, Chris", booktitle = "Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.starsem-1.32", doi = "10.18653/v1/2023.starsem-1.32", pages = "370--387" }
CALYPSO: LLMs as Dungeon Masters' Assistants. Andrew Zhu, Lara J. Martin, Andrew Head, Chris Callison-Burch. AAID 2023. Press Press Coverage The Register - August 19, 2023 - Hallucinating ChatGPT finds a role playing Dungeons & Dragons Abstract CALYPSO: LLMs as Dungeon Masters' Assistants The role of a Dungeon Master, or DM, in the game Dungeons & Dragons is to perform multiple tasks simultaneously. The DM must digest information about the game setting and monsters, synthesize scenes to present to other players, and respond to the players' interactions with the scene. Doing all of these tasks while maintaining consistency within the narrative and story world is no small feat of human cognition, making the task tiring and unapproachable to new players. Large language models (LLMs) like GPT-3 and ChatGPT have shown remarkable abilities to generate coherent natural language text. In this paper, we conduct a formative evaluation with DMs to establish the use cases of LLMs in D&D and tabletop gaming generally. We introduce calypso{}, a system of LLM-powered interfaces that support DMs with information and inspiration specific to their own scenario. calypso{} distills game context into bite-sized prose and helps brainstorm ideas without distracting the DM from the game. When given access to calypso{}, DMs reported that it generated high-fidelity text suitable for direct presentation to players, and low-fidelity ideas that the DM could develop further while maintaining their creative agency. We see calypso{} as exemplifying a paradigm of AI-augmented tools that provide synchronous creative assistance within established game worlds, and tabletop gaming more broadly. Code BibTex CALYPSO: LLMs as Dungeon Masters' Assistants @inproceedings{zhu2023calypso, title={{CALYPSO}: {LLMs} as Dungeon Masters' Assistants}, author={Zhu, Andrew and Martin, Lara J. and Head, Andrew and Callison-Burch, Chris}, booktitle={The 19th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 2023)}, year={2023} }
I Cast Detect Thoughts: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons. Pei Zhou, Andrew Zhu, Jennifer Hu, Jay Pujara, Xiang Ren, Chris Callison-Burch, Yejin Choi, Prithviraj Ammanabrolu. ACL 2023. Abstract I Cast Detect Thoughts: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons We propose a novel task, G4C (Goal-driven Guidance Generation in Grounded Communication), for studying goal-driven and grounded natural language interactions. Specifically, we choose Dungeons and Dragons (D&D) -- a role-playing game consisting of multiple player characters and a Dungeon Master (DM) who collaborate to achieve a set of goals that are beneficial to the players -- as a testbed for this task. Here, each of the player characters is a student, with their own personas and abilities, and the DM is the teacher, an arbitrator of the rules of the world and responsible for assisting and guiding the students towards a global goal. We propose a theory-of-mind-inspired methodology for training such a DM with reinforcement learning (RL), where a DM: (1) learns to predict how the players will react to its utterances using a dataset of D&D dialogue transcripts; and (2) uses this prediction as a reward function providing feedback on how effective these utterances are at guiding the players towards a goal. Human and automated evaluations show that a DM trained with RL to generate guidance by incorporating a theory-of-mind of the players significantly improves the players' ability to achieve goals grounded in their shared world. BibTex I Cast Detect Thoughts: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons @inproceedings{Pei-et-al-2023-dnd-theory-of-mind, author = {Zhou, Pei and Zhu, Andrew and Hu, Jennifer and Pujara, Jay and Ren, Xiang and Callison-Burch, Chris and Choi, Yejin and Ammanabrolu, Prithviraj}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {An AI Dungeon Master's Guide: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons}, booktitle={Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, address={Toronto, Canada} year = {2023}, copyright = {Creative Commons Attribution 4.0 International} }
FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information. Andrew Zhu, Karmanya Aggarwal, Alexander Feng, Lara J. Martin, and Chris Callison-Burch. ACL 2023. Abstract FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information Dungeons & Dragons (D&D) is a tabletop roleplaying game with complex natural language interactions between players and hidden state information. Recent work has shown that large language models (LLMs) that have access to state information can generate higher quality game turns than LLMs that use dialog history alone. However, previous work used game state information that was heuristically created and was not a true gold standard game state. We present FIREBALL, a large dataset containing nearly 25,000 unique sessions from real D&D gameplay on Discord with true game state info. We recorded game play sessions of players who used the Avrae bot, which was developed to aid people in playing D&D online, capturing language, game commands and underlying game state information. We demonstrate that FIREBALL can improve natural language generation (NLG) by using Avrae state information, improving both automated metrics and human judgments of quality. Additionally, we show that LLMs can generate executable Avrae commands, particularly after finetuning. Data BibTex FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information @inproceedings{zhu-et-al-2023-fireball-dataset, author = {Zhu, Andrew and Aggarwal, Karmanya and Feng, Alexander and Martin, Lara and Callison-Burch, Chris and Choi, Yejin and Ammanabrolu, Prithviraj}, title = {FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information}, booktitle={Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, address={Toronto, Canada} year = {2023}, copyright = {Creative Commons Attribution 4.0 International} }
Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification. Zoey Sha Li, Ruining Zhao, Manling Li, Heng Ji, Chris Callison-Burch, Jiawei Han. ACL 2023. Abstract Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification Event schemas are a form of world knowledge about the typical progression of events. Recent methods for event schema induction use information extraction systems to construct a large number of event graph instances from documents, and then learn to generalize the schema from such instances. In contrast, we propose to treat event schemas as a form of commonsense knowledge that can be derived from large language models (LLMs). This new paradigm greatly simplifies the schema induction process and allows us to handle both hierarchical relations and temporal relations between events in a straightforward way. Since event schemas have complex graph structures, we design an incremental prompting and verification method INCPROMPT to break down the construction of a complex event graph into three stages: event skeleton construction, event expansion, and event-event relation verification. Compared to directly using LLMs to generate a linearized graph, INCPROMPT can generate large and complex schemas with 7.2% F1 improvement in temporal relations and 31.0% F1 improvement in hierarchical relations. In addition, compared to the previous state-of-the-art closed-domain schema induction model, human assessors were able to cover ∼10% more events when translating the schemas into coherent stories and rated our schemas 1.3 points higher (on a 5-point scale) in terms of readability. BibTex Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification @inproceedings{sha-et-al-2023-creating-event-schema-with-llms, author = {Li, Sha and Zhao, Ruining and Li, Manling and Ji, Heng and Callison-Burch, Chris and Han, Jiawei}, title = {Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification}, booktitle={Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, address={Toronto, Canada} year = {2023}, copyright = {Creative Commons Attribution 4.0 International} }
Explanation-based Finetuning Makes Models More Robust to Spurious Cues. Josh Magnus Ludan, Yixuan Meng, Tai Nguyen, Saurabh Shah, Qing Lyu, Marianna Apidianaki and Chris Callison-Burch. ACL 2023. Abstract Explanation-based Finetuning Makes Models More Robust to Spurious Cues Large Language Models (LLMs) are so powerful that they sometimes learn correlations between labels and features that are irrelevant to the task, leading to poor generalization on out-of-distribution data. We propose explanation-based finetuning as a novel and general approach to mitigate LLMs’ reliance on spurious correlations. Unlike standard finetuning where the model only predicts the answer given the input, we finetune the model to additionally generate a free-text explanation supporting its answer. To evaluate our method, we finetune the model on artificially constructed training sets containing different types of spurious cues, and test it on a test set without these cues. Compared to standard finetuning, our method makes models remarkably more robust against spurious cues in terms of accuracy drop across four classification tasks: ComVE (+1.2), CREAK (+9.1), e-SNLI (+15.4), and SBIC (+6.5). Moreover, our method works equally well with explanations generated by the model, implying its applicability to more datasets without human-written explanations. BibTex Explanation-based Finetuning Makes Models More Robust to Spurious Cues @inproceedings{ludan-et-al-2023-explanation-based-finetuning, author = {Ludan, Josh Magnus and Meng, Yixuan and Nguyen, Tai and Shah, Saurabh and Lyu, Qing and Apidianaki, Marianna and Callison-Burch, Chris}, title = {Explanation-based Finetuning Makes Models More Robust to Spurious Cues}, booktitle={Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, address={Toronto, Canada} year = {2023}, copyright = {Creative Commons Attribution 4.0 International} }
Human-in-the-Loop Schema Induction. Tianyi Zhang, Isaac Tham, Zhaoyi Hou, Jiaxuan Ren, Liyang Zhou, Hainiu Xu, Li Zhang, Lara J. Martin, Rotem Dror, Sha Li, Heng Ji, Martha Palmer, Susan Brown, Reece Suchocki, Chris Callison-Burch. ACL 2023. Abstract Human-in-the-Loop Schema Induction Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction (IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface. Website BibTex Human-in-the-Loop Schema Induction @inproceedings{human-in-the-loop-schema-induction, author = {Zhang, Tianyi and Tham, Isaac and Hou, Zhaoyi and Ren, Jiaxuan and Zhou, Liyang and Xu, Hainiu and Zhang, Li and Martin, Lara J. and Dror, Rotem and Li, Sha and Ji, Heng and Palmer, Martha and Brown, Susan and Suchocki, Reece and Callison-Burch, Chris}, keywords = {Human-Computer Interaction (cs.HC), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Human-in-the-Loop Schema Induction}, booktitle={Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL 2023 demos)}, address={Toronto, Canada}, year = {2023} }
CORRPUS: Code-based Structured Prompting for Neurosymbolic Story Understanding. Yijiang River Dong, Lara J. Martin, Chris Callison-Burch. ACL Findings 2023. Abstract CORRPUS: Code-based Structured Prompting for Neurosymbolic Story Understanding Story generation and understanding -- as with all NLG/NLU tasks -- has seen a surge in neurosymbolic work. Researchers have recognized that, while large language models (LLMs) have tremendous utility, they can be augmented with symbolic means to be even better and to make up for any flaws that the neural networks might have. However, symbolic methods are extremely costly in terms of the amount of time and expertise needed to create them. In this work, we capitalize on state-of-the-art Code-LLMs, such as Codex, to bootstrap the use of symbolic methods for tracking the state of stories and aiding in story understanding. We show that our CoRRPUS system and abstracted prompting procedures can beat current state-of-the-art structured LLM techniques on pre-existing story understanding tasks (bAbI task 2 and Re^3) with minimal hand engineering. We hope that this work can help highlight the importance of symbolic representations and specialized prompting for LLMs as these models require some guidance for performing reasoning tasks properly. BibTex CORRPUS: Code-based Structured Prompting for Neurosymbolic Story Understanding @inproceedings{Dong-et-al-2023-detecting-story-inconsistencies, doi = {10.48550/ARXIV.2212.10754}, url = {https://arxiv.org/abs/2212.10754}, author = {Dong, Yijiang River and Martin, Lara J. and Callison-Burch, Chris}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {{CORRPUS}: Code-based Structured Prompting for Neurosymbolic Story Understanding}, booktitle={Findings of the The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, address={Toronto, Canada} year = {2023}, copyright = {Creative Commons Attribution 4.0 International} }
Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification. Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, Mark Yatskar. CVPR 2023. Abstract Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification Concept Bottleneck Models (CBM) are inherently interpretable models that factor model decisions into human-readable concepts. They allow people to easily understand why a model is failing, a critical feature for high-stakes applications. CBMs require manually specified concepts and often under-perform their black box counterparts, preventing their broad adoption. We address these shortcomings and are first to show how to construct high-performance CBMs without manual specification of similar accuracy to black box models. Our approach, Language Guided Bottlenecks (LaBo), leverages a language model, GPT-3, to define a large space of possible bottlenecks. Given a problem domain, LaBo uses GPT-3 to produce factual sentences about categories to form candidate concepts. LaBo efficiently searches possible bottlenecks through a novel submodular utility that promotes the selection of discriminative and diverse information. Ultimately, GPT-3's sentential concepts can be aligned to images using CLIP, to form a bottleneck layer. Experiments demonstrate that LaBo is a highly effective prior for concepts important to visual recognition. In the evaluation with 11 diverse datasets, LaBo bottlenecks excel at few-shot classification: they are 11.7% more accurate than black box linear probes at 1 shot and comparable with more data. Overall, LaBo demonstrates that inherently interpretable models can be widely applied at similar, or better, performance than black box approaches. BibTex Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification @inproceedings{yang-etal-2023-language-in-a-bottle, title = {Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification}, author = {Yang, Yue and Panagopoulou, Artemis and Zhou, Shenghao and Jin, Daniel and Callison-Burch, Chris and Yatskar, Mark}, booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)}, year = {2023}, address = {Vancouver, Canada}, publisher = "IEEE/CVF", url = {https://www.cis.upenn.edu/~ccb/publications/language-in-a-bottle.pdf} }
Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text. Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, Chris Callison-Burch. AAAI 2023. Press Press Coverage CNN - July 11, 2023 - Bot or not? How to tell when you’re reading something written by AI Channel 12 News - ABC affiliate in North Carolina - May 19, 2023 - NewsChannel 12 Investigates: Artificial Intelligence Part 3 Penn Today - March 10, 2023 - Real or Fake Text? We can learn to spot the difference Corporate Compliance Insights - March 8, 2023 - A Bot Isn’t Going to Take Your Place, But AI Will Make Your Job Harder Technical.ly - March 6, 2023 - How can humans detect AI writing? These Penn researchers have some tips Abstract Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text As text generated by large language models proliferates, it becomes vital to understand how humans engage with such text, and whether or not they are able to detect when the text they are reading did not originate with a human writer. Prior work on human detection of generated text focuses on the case where an entire passage is either human-written or machine-generated. In this paper, we study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models. We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time. Furthermore, we conduct a detailed comparison study and analyze how a variety of variables (model size, decoding strategy, fine-tuning, prompt genre, etc.) affect human detection performance. Finally, we collect error annotations from our participants and use them to show that certain textual genres influence models to make different types of errors and that certain sentence-level features correlate highly with annotator selection. We release the RoFT dataset: a collection of over 21,000 human annotations paired with error classifications to encourage future work in human detection and evaluation of generated text. Data Code Website Video BibTex Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text @inproceedings{dugan-ippolito-et-al-2023, title={Real or Fake Text? Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text}, author={Liam Dugan and Daphne Ippolito and Arun Kirubarajan and Sherry Shi and Chris Callison-Burch}, booktitle={The 37th AAAI Conference on Artificial Intelligence (AAAI 2023)}, address={Washington DC, USA} year={2023} }
Exploring the Curious Case of Code Prompts. Li Zhang, Liam Dugan, Hainiu Xu, Chris Callison-Burch. Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE) 2023. Abstract Exploring the Curious Case of Code Prompts Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts. BibTex Exploring the Curious Case of Code Prompts @inproceedings{zhang-etal-2023-exploring, title={Exploring the Curious Case of Code Prompts}, author={Li Zhang and Liam Dugan and Hainiu Xu and Chris Callison-Burch}, booktitle={Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)}, address={Toronto, Canada}, year={2023} }
Automatically Generated Summaries of Video Lectures May Enhance Students’ Learning Experience. Hannah Gonzalez, Jiening Li, Helen Jin, Jiaxuan Ren, Hongyu Zhang, Ayotomiwa Akinyele, Adrian Wang, Eleni Miltsakaki, Ryan Baker, Chris Callison-Burch. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) 2023. Abstract Automatically Generated Summaries of Video Lectures May Enhance Students’ Learning Experience We introduce a novel technique for automatically summarizing lecture videos using large language models such as GPT-3 and we present a user study investigating the effects on the studying experience when automatic summaries are added to lecture videos. We test students under different conditions and find that the students who are shown a summary next to a lecture video perform better on quizzes designed to test the course materials than the students who have access only to the video or the summary. Our findings suggest that adding automatic summaries to lecture videos enhances the learning experience. Qualitatively, students preferred summaries when studying under time constraints. BibTex Automatically Generated Summaries of Video Lectures May Enhance Students’ Learning Experience @inproceedings{gonzalez-etal-2023-automatically, title={Automatically Generated Summaries of Video Lectures May Enhance Students’ Learning Experience}, author={Hannah Gonzalez and Jiening Li and Helen Jin and Jiaxuan Ren and Hongyu Zhang and Ayotomiwa Akinyele and Adrian Wang and Eleni Miltsakaki and Ryan Baker and Chris Callison-Burch}, booktitle={Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)}, address={Toronto, Canada}, year={2023} }
Enhancing Human Summaries for Question-Answer Generation in Education. Hannah Gonzalez, Liam Dugan, Eleni Miltsakaki, Zhiqi Cui, Jiaxuan Ren, Bryan Li, Shriyash Upadhyay, Etan Ginsberg, Chris Callison-Burch. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) 2023. Abstract Enhancing Human Summaries for Question-Answer Generation in Education We address the problem of generating high-quality question-answer pairs for educational materials. Previous work on this problem showed that using summaries as input improves the quality of question generation (QG) over original textbook text and that human-written summaries result in higher quality QG than automatic summaries. In this paper, a) we show that advances in Large Language Models (LLMs) are not yet sufficient to generate quality summaries for QG and b) we introduce a new methodology for enhancing bullet point student notes into fully fledged summaries and find that our methodology yields higher quality QG. We conducted a large-scale human annotation study of generated question-answer pairs for the evaluation of our methodology. In order to aid in future research, we release a new dataset of 9.2K human annotations of generated questions. BibTex Enhancing Human Summaries for Question-Answer Generation in Education @inproceedings{gonzalez-etal-2023-enhancing, title={Enhancing Human Summaries for Question-Answer Generation in Education}, author={Hannah Gonzalez and Liam Dugan and Eleni Miltsakaki and Zhiqi Cui and Jiaxuan Ren and Bryan Li and Shriyash Upadhyay and Etan Ginsberg and Chris Callison-Burch}, booktitle={Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)}, address={Toronto, Canada}, year={2023} }
Improving Mathematics Tutoring With A Code Scratchpad. Shriyash Upadhyay, Etan Ginsberg, Chris Callison-Burch. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) 2023. Abstract Improving Mathematics Tutoring With A Code Scratchpad Large language models can solve reasoning tasks (like math problems) more effectively when they are allowed to generate rationales. However, a good tutoring system should not just generate solutions, but should also generate explanations and should be able to correct and guide students. We show that providing a code scratchpad improves performance on each tutoring step with a gradeschool mathematics dataset. On these tutoring tasks, GPT-3 models provided with a code scratchpad significantly outperform those given only a language scratchpad (77.7% vs 48.7% cumulative accuracy). BibTex Improving Mathematics Tutoring With A Code Scratchpad @inproceedings{upadhyay-etal-2023-improving, title={Improving Mathematics Tutoring With A Code Scratchpad}, author={Shriyash Upadhyay and Etan Ginsberg and Chris Callison-Burch}, booktitle={Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)}, address={Toronto, Canada}, year={2023} }
Language Models are Drummers: Drum Composition with Natural Language Pre-Training. Harry Li Zhang, Chris Callison-Burch. AAAI 2023 Workshop on Creative AI Across Modalities 2023. Abstract Language Models are Drummers: Drum Composition with Natural Language Pre-Training Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising. Data Code BibTex Language Models are Drummers: Drum Composition with Natural Language Pre-Training @inproceedings{Zhang-2023-llms-are-drummers, title={Language Models are Drummers: Drum Composition with Natural Language Pre-Training}, author={Li Zhang and Chris Callison-Burch}, booktitle={AAAI 2023 Workshop on Creative AI Across Modalities}, address={Washington DC, USA} year={2023} }
Learn With Martian: A Tool For Creating Assignments That Can Write And Re-Write Themselves. Shriyash Upadhyay, Etan Ginsberg, Chris Callison-Burch. EACL Demos 2023. Abstract Learn With Martian: A Tool For Creating Assignments That Can Write And Re-Write Themselves n this paper, we propose Learn, a unified, easy-to-use tool to apply question generation and selection in classrooms. The tool lets instructors and TAs create assignments that can write and re-write themselves. Given existing course materials, for example a reference textbook, Learn can generate questions, select the highest quality questions, show the questions to students, adapt question difficulty to student knowledge, and generate new questions based on how effectively old questions help students learn. The modular, composable nature of the tools for handling each sub-task allow instructors to use only the parts of the tool necessary to the course, allowing for integration in a large number of courses with varied teaching styles. We also report on the adoption of the tool in classes at the University of Pennsylvania with over 1000 students. Learn is publicly released at https://learn.withmartian.com. Website BibTex Learn With Martian: A Tool For Creating Assignments That Can Write And Re-Write Themselves @inproceedings{upadhyay-etal-2023-learn, title = "Learn With Martian: A Tool For Creating Assignments That Can Write And Re-Write Themselves", author = "Upadhyay, Shriyash and Callison-Burch, Chris and Ginsberg, Etan", booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.eacl-demo.30", pages = "267--276", abstract = "In this paper, we propose Learn, a unified, easy-to-use tool to apply question generation and selection in classrooms. The tool lets instructors and TAs create assignments that can write and re-write themselves. Given existing course materials, for example a reference textbook, Learn can generate questions, select the highest quality questions, show the questions to students, adapt question difficulty to student knowledge, and generate new questions based on how effectively old questions help students learn. The modular, composable nature of the tools for handling each sub-task allow instructors to use only the parts of the tool necessary to the course, allowing for integration in a large number of courses with varied teaching styles. We also report on the adoption of the tool in classes at the University of Pennsylvania with over 1000 students. Learn is publicly released at https://learn.withmartian.com.", }
Causal Reasoning About Entities and Events in Procedural Texts. Li Zhang, Hainiu Xu, Yue Yang, Shuyan Zhou, Weiqiu You, Manni Arora, Chris Callison-Burch. Findings of EACL 2023. Abstract Causal Reasoning About Entities and Events in Procedural Texts Entities and events have long been regarded as the crux of machine reasoning. Procedural texts have received increasing attention due to the dynamic nature of involved entities and events. Existing work has focused either on entity state tracking (e.g., the temperature of a pan) or on counterfactual event reasoning (e.g., how likely am I to burn myself by touching the pan), while these two tasks are tightly intertwined. In this work, we propose CREPE, the first benchmark on causal reasoning about event plausibility based on entity states. We experiment with strong large language models and show that most models, including GPT3, perform close to chance at .30 F1, lagging far behind the human performance of .87 F1. Inspired by the finding that structured representations such as programming languages benefit event reasoning as a prompt to code language models such as Codex, we creatively inject the causal relations between entities and events through intermediate variables and boost the performance to .67 to .72 F1. Our proposed event representation not only allows for knowledge injection but also marks the first successful attempt of chain-of-thought reasoning with code language models. Data Code BibTex Causal Reasoning About Entities and Events in Procedural Texts @inproceedings{zhang-etal-2023-causal, title = "Causal Reasoning About Entities and Events in Procedural Texts", author = "Li Zhang and Hainiu Xu and Yue Yang and Shuyan Zhou and Weiqiu You and Manni Arora and Chris Callison-Burch" booktitle = "Findings of the Association for Computational Linguistics: EACL 2023", year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/pdf/2301.10896.pdf" abstract = "Entities and events have long been regarded as the crux of machine reasoning. Procedural texts have received increasing attention due to the dynamic nature of involved entities and events. Existing work has focused either on entity state tracking (e.g., the temperature of a pan) or on counterfactual event reasoning (e.g., how likely am I to burn myself by touching the pan), while these two tasks are tightly intertwined. In this work, we propose CREPE, the first benchmark on causal reasoning about event plausibility based on entity states. We experiment with strong large language models and show that most models, including GPT3, perform close to chance at .30 F1, lagging far behind the human performance of .87 F1. Inspired by the finding that structured representations such as programming languages benefit event reasoning as a prompt to code language models such as Codex, we creatively inject the causal relations between entities and events through intermediate variables and boost the performance to .67 to .72 F1. Our proposed event representation not only allows for knowledge injection but also marks the first successful attempt of chain-of-thought reasoning with code language models." }
Bidirectional Language Models Are Also Few-shot Learners. Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, Chris Callison-Burch. ICLR 2023. Abstract Bidirectional Language Models Are Also Few-shot Learners Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models. Website BibTex Bidirectional Language Models Are Also Few-shot Learners @inproceedings{Patel-ICLR-2023, url = {https://arxiv.org/abs/2209.14500}, author = {Patel, Ajay and Li, Bryan and Rasooli, Mohammad Sadegh and Constant, Noah and Raffel, Colin and Callison-Burch, Chris}, keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Bidirectional Language Models Are Also Few-shot Learners}, booktitle={Eleventh International Conference on Learning Representations (ICLR 2023)}, address={Kigali, Rwanda} year={2023} }
2022
Low-Resource Authorship Style Transfer with In-Context Learning. Ajay Patel, Nicholas Andrews, Chris Callison-Burch. arXiv 2022. Unpublished preprint. Abstract Low-Resource Authorship Style Transfer with In-Context Learning Authorship style transfer involves altering the style of text to match the style of some target author whilst preserving the semantic meaning of the original text. Existing approaches to unsupervised authorship style transfer like STRAP have largely focused on style transfer for target authors with many examples of their writing style through books, speeches, or other published works (Krishna et al., 2020). Due to this high-resource training data requirement (often greater than 100,000 words), these approaches are often only useful for style transfer to the style of published authors, politicians, or other well-known figures and authorship styles. In this paper, we attempt to perform low-resource authorship style transfer, a more challenging class of authorship style transfer where only a limited amount of text in the target author’s style may exist. In our experiments, we specifically choose source and target authors from Reddit to perform style transfer over their Reddit posts, limiting ourselves to just 16 posts (on average ≈ 500 words) of the target author’s style. We then propose a method for automatic evaluation on the lowresource authorship style transfer task utilizing authorship and style representation embeddings (Rivera-Soto et al., 2021; Wegmann et al., 2022). We evaluate our style transferred outputs with the proposed automatic evaluation method and find that our method, STYLL, is able to outperform STRAP and a comprehensive set of baselines. BibTex Low-Resource Authorship Style Transfer with In-Context Learning @misc{https://doi.org/10.48550/arxiv.2212.08986, doi = {10.48550/ARXIV.2212.08986}, url = {https://arxiv.org/abs/2212.08986}, author = {Patel, Ajay and Andrews, Nicholas and Callison-Burch, Chris}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Low-Resource Authorship Style Transfer with In-Context Learning}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }
A Deep Learning Method to Detect Opioid Prescription and Opioid Use Disorder from Electronic Health Records. Aditya Kashyap, Chris Callison-Burch, Mary Regina Boland. International Journal of Medical Informatics 2022. Abstract A Deep Learning Method to Detect Opioid Prescription and Opioid Use Disorder from Electronic Health Records Objective As the opioid epidemic continues across the United States, methods are needed to accurately and quickly identify patients at risk for opioid use disorder (OUD). The purpose of this study is to develop two predictive algorithms: one to predict opioid prescription and one to predict OUD. Materials and Methods We developed an informatics algorithm that trains two deep learning models over patient EHRs using the MIMIC-III database. We utilize both the structured and unstructured parts of the EHR and show that it is possible to predict both challenging outcomes. Results Our deep learning models incorporate elements from the EHRs to predict opioid prescription with an F1-score of 0.88 ± 0.003 and an AUC-ROC of 0.93 ± 0.002. We also constructed a model to predict OUD diagnosis achieving an F1-score of 0.82 ± 0.05 and AUC-ROC of 0.94 ± 0.008. Discussion Our model for OUD prediction outperformed prior algorithms for specificity, F1 score and AUC-ROC while achieving equivalent sensitivity. This demonstrates the importance of a) deep learning approaches in predicting OUD and b) incorporating both structured and unstructured data for this prediction task. No prediction models for opioid prescription as an outcome were found in the literature and therefore our model is the first to predict opioid prescribing behavior. Conclusion Algorithms such as those described in this paper will become increasingly important to understand the drivers underlying this national epidemic. BibTex A Deep Learning Method to Detect Opioid Prescription and Opioid Use Disorder from Electronic Health Records @article{KASHYAP2022104979, title = {A Deep Learning Method to Detect Opioid Prescription and Opioid Use Disorder from Electronic Health Records}, journal = {International Journal of Medical Informatics}, pages = {104979}, year = {2022}, issn = {1386-5056}, doi = {https://doi.org/10.1016/j.ijmedinf.2022.104979}, url = {https://www.sciencedirect.com/science/article/pii/S1386505622002933}, author = {Aditya Kashyap and Chris Callison-Burch and Mary {Regina Boland}}, keywords = {opioid, machine learning, electronic health records, data mining, natural language processing}, abstract = {Objective As the opioid epidemic continues across the United States, methods are needed to accurately and quickly identify patients at risk for opioid use disorder (OUD). The purpose of this study is to develop two predictive algorithms: one to predict opioid prescription and one to predict OUD. Materials and Methods We developed an informatics algorithm that trains two deep learning models over patient EHRs using the MIMIC-III database. We utilize both the structured and unstructured parts of the EHR and show that it is possible to predict both challenging outcomes. Results Our deep learning models incorporate elements from the EHRs to predict opioid prescription with an F1-score of 0.88 ± 0.003 and an AUC-ROC of 0.93 ± 0.002. We also constructed a model to predict OUD diagnosis achieving an F1-score of 0.82 ± 0.05 and AUC-ROC of 0.94 ± 0.008. Discussion Our model for OUD prediction outperformed prior algorithms for specificity, F1 score and AUC-ROC while achieving equivalent sensitivity. This demonstrates the importance of a) deep learning approaches in predicting OUD and b) incorporating both structured and unstructured data for this prediction task. No prediction models for opioid prescription as an outcome were found in the literature and therefore our model is the first to predict opioid prescribing behavior. Conclusion Algorithms such as those described in this paper will become increasingly important to understand the drivers underlying this national epidemic.} }
Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence. Chris Callison-Burch, Gaurav Singh Tomar, Lara Martin, Daphne Ippolito, Suma Bailis and David Reitter. EMNLP 2022. Abstract Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence AI researchers have posited Dungeons and Dragons (D&D) as a challenge problem to test systems on various language-related capabilities. In this paper, we frame D&D specifically as a dialogue system challenge, where the tasks are to both generate the next conversational turn in the game and predict the state of the game given the dialogue history. We create a gameplay dataset consisting of nearly 900 games, with a total of 7,000 players, 800,000 dialogue turns, 500,000 dice rolls, and 58 million words. We automatically annotate the data with partial state information about the game play. We train a large language model (LM) to generate the next game turn, conditioning it on different information. The LM can respond as a particular character or as the player who runs the game—i.e., the Dungeon Master (DM). It is trained to produce dialogue that is either in-character (roleplaying in the fictional world) or out-of-character (discussing rules or strategy). We perform a human evaluation to determine what factors make the generated output plausible and interesting. We further perform an automatic evaluation to determine how well the model can predict the game state given the history and examine how well tracking the game state improves its ability to produce plausible conversational output. BibTex Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence @inproceedings{callison-burch-tomar-et-al-2022, title={Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence}, author={Chris Callison-Burch and Gaurav Singh Tomar and Lara Martin and Daphne Ippolito and Suma Bailis and David Reitter}, booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)}, address={Abu Dhabi, UAE} year={2022} }
Unsupervised Entity Linking with Guided Summarization and Multiple-Choice Selection. Jeffrey Young-Min Cho, Harry Li Zhang, Chris Callison-Burch. EMNLP 2022. Abstract Unsupervised Entity Linking with Guided Summarization and Multiple-Choice Selection Entity linking, the task of linking potentially ambiguous mentions in texts to corresponding knowledge-base entities, is an important component for language understanding. We address two challenge in entity linking: how to leverage wider contexts surrounding a mention, and how to deal with limited training data. We propose a fully unsupervised model called SumMC that first generates a guided summary of the contexts conditioning on the mention, and then casts the task to a multiple-choice problem where the model chooses an entity from a list of candidates. In addition to evaluating our model on existing datasets that focus on named entities, we create a new dataset that links noun phrases from WikiHow to Wikidata. We show that our SumMC model achieves state-of-the-art unsupervised performance on our new dataset and on exiting datasets. BibTex Unsupervised Entity Linking with Guided Summarization and Multiple-Choice Selection @inproceedings{cho-2022, title={Unsupervised Entity Linking with Guided Summarization and Multiple-Choice Selection}, author={Young-Min Cho and Li Zhang and Chris Callison-Burch}, booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)}, address={Abu Dhabi, UAE} year={2022} }
Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction. Yue Yang, Artemis Panagopoulou, Marianna Apidianaki, Mark Yatskar and Chris Callison-Burch. Findings of EMNLP 2022. Abstract Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction Neural language models encode rich knowledge about entities and their relationships which can be extracted from their representations using probing. Common properties of nouns (e.g., red strawberries, small ant) are, however, more challenging to extract compared to other types of knowledge because they are rarely explicitly stated in texts. We hypothesize this to mainly be the case for perceptual properties which are obvious to the participants in the communication. We propose to extract these properties from images and use them in an ensemble model, in order to complement the information that is extracted from language models. We consider perceptual properties to be more concrete than abstract properties (e.g., interesting, flawless). We propose to use the adjectives’ concreteness score as a lever to calibrate the contribution of each source (text vs. images). We evaluate our ensemble model in a ranking task where the actual properties of a noun need to be ranked higher than other non-relevant properties. Our results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful text-based language models. BibTex Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction @inproceedings{yang-2022, title={Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction}, author={Yue Yang and Artemis Panagopoulou and Marianna Apidianaki and Mark Yatskar and Chris Callison-Burch}, booktitle={Findings of The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)}, address={Abu Dhabi, UAE} year={2022} }
Empathic Conversations: A Multi-level Dataset of Contextualized Conversations. Damilola Omitaomu, Shabnam Tafreshi, Tingting Liu, Sven Buechel, Chris Callison-Burch, Johannes Eichstaedt, Lyle Ungar, João Sedoc. arXiv 2022. Unpublished preprint. Abstract Empathic Conversations: A Multi-level Dataset of Contextualized Conversations Empathy is a cognitive and emotional reaction to an observed situation of others. Empathy has recently attracted interest because it has numerous applications in psychology and AI, but it is unclear how different forms of empathy (e.g., self-report vs counterpart other-report, concern vs. distress) interact with other affective phenomena or demographics like gender and age. To better understand this, we created the {\it Empathic Conversations} dataset of annotated negative, empathy-eliciting dialogues in which pairs of participants converse about news articles. People differ in their perception of the empathy of others. These differences are associated with certain characteristics such as personality and demographics. Hence, we collected detailed characterization of the participants' traits, their self-reported empathetic response to news articles, their conversational partner other-report, and turn-by-turn third-party assessments of the level of self-disclosure, emotion, and empathy expressed. This dataset is the first to present empathy in multiple forms along with personal distress, emotion, personality characteristics, and person-level demographic information. We present baseline models for predicting some of these features from conversations. BibTex Empathic Conversations: A Multi-level Dataset of Contextualized Conversations @misc{https://doi.org/10.48550/arxiv.2205.12698, doi = {10.48550/ARXIV.2205.12698}, url = {https://arxiv.org/abs/2205.12698}, author = {Omitaomu, Damilola and Tafreshi, Shabnam and Liu, Tingting and Buechel, Sven and Callison-Burch, Chris and Eichstaedt, Johannes and Ungar, Lyle and Sedoc, João}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Empathic Conversations: A Multi-level Dataset of Contextualized Conversations}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} }
The Case for a Single Model that can Both Generate Continuations and Fill-in-the-Blank. Daphne Ippolito, Liam Dugan, Emily Reif, Ann Yuan, Andy Coenen, Chris Callison-Burch. NAACL 2022. Abstract The Case for a Single Model that can Both Generate Continuations and Fill-in-the-Blank The task of inserting text into a specified position in a passage, known as fill in the blank (FitB), is useful for a variety of applications where writers interact with a natural language generation (NLG) system to craft text. While previous work has tackled this problem with models trained specifically to do fill in the blank, a more useful model is one that can effectively perform both FitB and continuation tasks. In this work, we evaluate the feasibility of using a single model to do both tasks. We show that models pre-trained with a FitB-style objective are capable of both tasks, while models pre-trained for continuation are not. Finally, we show how these models can be easily finetuned to allow for fine-grained control over the length and word choice of the generation. BibTex The Case for a Single Model that can Both Generate Continuations and Fill-in-the-Blank @inproceedings{Ippolito-2022-fill-in-the-blank, title={The Case for a Single Model that can Both Generate Continuations and Fill in the Blank}, author={Daphne Ippolito and Liam Dugan and Emily Reif and Ann Yuan and Andy Coenen and Chris Callison-Burch}, booktitle={Findings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistic (NAACL 2022)}, address={Seattle, Washington} year={2022} }
Is “My Favorite New Movie” My Favorite Movie? Probing the Understanding of Recursive Noun Phrases. Qing Lyu, Hua Zheng, Daoxin Li, Li Zhang, Marianna Apidianaki, Chris Callison-Burch. NAACL 2022. Abstract Is “My Favorite New Movie” My Favorite Movie? Probing the Understanding of Recursive Noun Phrases Recursive noun phrases (NPs) have interesting semantic properties. For example, my favorite new movie is not necessarily my favorite movie, whereas my new favorite movie is. This is common sense to humans, yet it is unknown whether language models have such knowledge. We introduce the Recursive Noun Phrase Challenge (RNPC), a dataset of three textual inference tasks involving textual entailment and event plausibility comparison, precisely targeting the understanding of recursive NPs. When evaluated on RNPC, state-of-the-art Transformer models only perform around chance. Still, we show that such knowledge is learnable with appropriate data. We further probe the models for relevant linguistic features that can be learned from our tasks, including modifier semantic category and modifier scope. Finally, models trained on RNPC achieve strong zero-shot performance on an extrinsic Harm Detection evaluation task, showing the usefulness of the understanding of recursive NPs in downstream applications. Data Code BibTex Is “My Favorite New Movie” My Favorite Movie? Probing the Understanding of Recursive Noun Phrases @inproceedings{Lyu2022-NPs, title={Is “My Favorite New Movie” My Favorite Movie? Probing the Understanding of Recursive Noun Phrases}, author={Qing Lyu and Hua Zheng and Daoxin Li and Li Zhang and Marianna Apidianaki and Chris Callison-Burch}, booktitle={Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistic (NAACL 2022)}, address={Seattle, Washington} year={2022} }
A Recipe For Arbitrary Text Style Transfer with Large Language Models. Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, Jason Wei. ACL 2022. Abstract A Recipe For Arbitrary Text Style Transfer with Large Language Models In this paper, we leverage large language models (LMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as "make this melodramatic" or "insert a metaphor." Website Video BibTex A Recipe For Arbitrary Text Style Transfer with Large Language Models @inproceedings{Reif2022-style-transfer, title={A Recipe For Arbitrary Text Style Transfer with Large Language Models}, author={Emily Reif and Daphne Ippolito and Ann Yuan and Andy Coenen and Chris Callison-Burch and Jason Wei}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, address={Dublin, Ireland} year={2022} }
Deduplicating Training Data Makes Language Models Better. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini. ACL 2022. Abstract Deduplicating Training Data Makes Language Models Better We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets Code BibTex Deduplicating Training Data Makes Language Models Better @inproceedings{lee2022deduplicating, title={Deduplicating Training Data Makes Language Models Better}, author={Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, address={Dublin, Ireland} year={2022} }
Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data. Shuyan Zhou, Li Zhang, Yue Yang, Qing Lyu, Pengcheng Yin, Chris Callison-Burch, Graham Neubig. ACL 2022. Abstract Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data Procedures are inherently hierarchical. To host a party, one may need to clean the house, which in turn may require putting away the clothes. While such hierarchical knowledge is critical for reasoning about complex procedures, most existing works treat procedures as shallow structures without modeling the hierarchical dependency between them. In this work, we attempt to construct an open-domain hierarchical knowledge-base (KB) of procedures based on wikiHow, a website containing more than 110k instructional articles, each documenting the steps to accomplish a complex procedure. To this end, we develop a simple and efficient method that links steps (e.g. clean the house) in an article to other articles with similar intents (e.g. how to deep clean your house), which proceeds recursively to form the KB. Our method significantly outperforms several strong baselines according to automatic evaluation, human judgment, and application to downstream tasks such as instructional video retrieval. Data Code Website BibTex Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data @inproceedings{Zhou-Zhang-et-al-2022-wikihow-hierarchies, title={Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data}, author={Shuyan Zhou and Li Zhang and Yue Yang and Qing Lyu and Pengcheng Yin and Chris Callison-Burch and Graham Neubig}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, address={Dublin, Ireland} year={2022} }
A Feasibility Study of Answer-Agnostic Question Generation for Education. Liam Dugan, Eleni Miltsakaki, Etan Ginsberg, Shriyash Upadhyay, Hannah Gonzalez, Dahyeon Choi, Chuning Yuan, Chris Callison-Burch. ACL 2022. Abstract A Feasibility Study of Answer-Agnostic Question Generation for Education We conduct a feasibility study into the applicability of answer-agnostic question generation models to textbook passages. We show that a significant portion of errors in such systems arise from asking irrelevant or uninterpretable questions and that such errors can be ameliorated by providing summarized input. We find that giving these models human-written summaries instead of the original text results in a significant increase in acceptability of generated questions (33% to 83%) as determined by expert annotators. We also find that, in the absence of human-written summaries, automatic summarization can serve as a good middle ground. BibTex A Feasibility Study of Answer-Agnostic Question Generation for Education @inproceedings{Dugan-et-al-2022-feasibility-study, title={A Feasibility Study of Answer-Unaware Question Generation for Education}, author={Liam Dugan and Eleni Miltsakaki and Etan Ginsberg and Shriyash Upadhyay and Hannah Gonzalez and Dahyeon Choi and Chuning Yuan and Chris Callison-Burch}, booktitle={Findings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, address={Dublin, Ireland} year={2022} }
QuakerBot: A Household Dialog System Powered by Large Language Models. Artemis Panagopoulou, Manni Arora, Li Zhang, Dimitri Cugini, Weiqiu You, Yue Yang, Liyang Zhou, Yuxuan Wang, Zhaoyi Hou, Alyssa Hwang, Lara Martin, Sherry Shi, Chris Callison-Burch, Mark Yatskar. Amazon Alexa Competition 2022. Unpublished preprint. Abstract QuakerBot: A Household Dialog System Powered by Large Language Models We describe QuakerBot, a dialog system that helps users with household tasks and a participant in the Alexa Prize TaskBot Challenge. QuakerBot can process a variety of user requests, search for instructions from web resources such as wikiHow or Whole Foods Market recipes, answer related questions, and so on. Its components simultaneously consist of large language models with an impressive few-shot performance, and rule-based models with robust service. BibTex QuakerBot: A Household Dialog System Powered by Large Language Models @misc{quakerbot, url = {https://assets.amazon.science/d2/af/3db9f05046e386108b76ce01e06d/quakerbot-a-household-dialog-system-powered-by-large-language-models.pdf}, author = {Artemis Panagopoulou and Manni Arora and Li Zhang and Dimitri Cugini and Weiqiu You and Yue Yang and Liyang Zhou and Yuxuan Wang and Zhaoyi Hou and Alyssa Hwang and Lara Martin and Sherry Shi and Chris Callison-Burch and Mark Yatskar}, title = {QuakerBot&colon A Household Dialog System Powered by Large Language Models}, year = {2022}, }
2021
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets. Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, and Sebastian Gehrmann. NeurIPS 2021. Abstract SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality. Data BibTex SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets @inproceedings{Yuan2021SynthBio, title={{SynthBio}: A Case Study in Human-{AI} Collaborative Curation of Text Datasets}, author={Ann Yuan and Daphne Ippolito and Vitaly Nikolaev and Chris Callison-Burch and Andy Coenen and Sebastian Gehrmann}, booktitle={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks.}, year={2021} }
Identifying and Responding to Health Misinformation on Reddit Dermatology Forums With Artificially Intelligent Bots Using Natural Language Processing: Design and Evaluation Study. Monique A Sager, Aditya M Kashyap, Mila Tamminga, Sadhana Ravoori, Chris Callison-Burch and Jules B Lipoff. JMIR 2021. Abstract Identifying and Responding to Health Misinformation on Reddit Dermatology Forums With Artificially Intelligent Bots Using Natural Language Processing: Design and Evaluation Study Background Reddit, the fifth most popular website in the United States, boasts a large and engaged user base on its dermatology forums where users crowdsource free medical opinions. Unfortunately, much of the advice provided is unvalidated and could lead to the provision of inappropriate care. Initial testing has revealed that artificially intelligent bots can detect misinformation regarding tanning and essential oils on Reddit dermatology forums and may be able to produce responses to posts containing misinformation. Objective To analyze the ability of bots to find and respond to tanning and essential oil–related health misinformation on Reddit’s dermatology forums in a controlled test environment. Methods Using natural language processing techniques, we trained bots to target misinformation, using relevant keywords and to post prefabricated responses. By evaluating different model architectures across a held-out test set, we compared performances. Results Our models yielded data test accuracies ranging 95%-100%, with a Bidirectional Encoder Representations from Transformers (BERT) fine-tuned model resulting in the highest level of test accuracy. Bots were then able to post corrective prefabricated responses to misinformation in a test environment. Conclusions Using a limited data set, bots accurately detected examples of health misinformation within Reddit dermatology forums. Given that these bots can then post prefabricated responses, this technique may allow for interception of misinformation. Providing correct information does not mean that users will be receptive or find such interventions persuasive. Further studies should investigate this strategy’s effectiveness to inform future deployment of bots as a technique in combating health misinformation. BibTex Identifying and Responding to Health Misinformation on Reddit Dermatology Forums With Artificially Intelligent Bots Using Natural Language Processing: Design and Evaluation Study @Article{info:doi/10.2196/20975, author="Sager, Monique A and Kashyap, Aditya M and Tamminga, Mila and Ravoori, Sadhana and Callison-Burch, Chris and Lipoff, Jules B", title="Identifying and Responding to Health Misinformation on Reddit Dermatology Forums With Artificially Intelligent Bots Using Natural Language Processing: Design and Evaluation Study", journal="JMIR Dermatol", year="2021", month="Sep", day="30", volume="4", number="2", pages="e20975", keywords="bots; natural language processing; artificial intelligence; Reddit, medical misinformation; health misinformation; detecting misinformation; dermatology; misinformation", abstract="Background: Reddit, the fifth most popular website in the United States, boasts a large and engaged user base on its dermatology forums where users crowdsource free medical opinions. Unfortunately, much of the advice provided is unvalidated and could lead to the provision of inappropriate care. Initial testing has revealed that artificially intelligent bots can detect misinformation regarding tanning and essential oils on Reddit dermatology forums and may be able to produce responses to posts containing misinformation. Objective: To analyze the ability of bots to find and respond to tanning and essential oil--related health misinformation on Reddit's dermatology forums in a controlled test environment. Methods: Using natural language processing techniques, we trained bots to target misinformation, using relevant keywords and to post prefabricated responses. By evaluating different model architectures across a held-out test set, we compared performances. Results: Our models yielded data test accuracies ranging 95{\%}-100{\%}, with a Bidirectional Encoder Representations from Transformers (BERT) fine-tuned model resulting in the highest level of test accuracy. Bots were then able to post corrective prefabricated responses to misinformation in a test environment. Conclusions: Using a limited data set, bots accurately detected examples of health misinformation within Reddit dermatology forums. Given that these bots can then post prefabricated responses, this technique may allow for interception of misinformation. Providing correct information does not mean that users will be receptive or find such interventions persuasive. Further studies should investigate this strategy's effectiveness to inform future deployment of bots as a technique in combating health misinformation. ", issn="2562-0959", doi="10.2196/20975", url="https://derma.jmir.org/2021/2/e20975", url="https://doi.org/10.2196/20975" }
BiSECT: Learning to Split and Rephrase Sentences with Bitexts. Rose Undergraduate Research Award. Joongwon Kim, Mounica Maddela, Reno Kriz, Wei Xu, and Chris Callison-Burch. EMNLP 2021. Abstract BiSECT: Learning to Split and Rephrase Sentences with Bitexts An important task in NLP applications such as sentence simplification is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. We introduce a novel dataset and a new model for this `split and rephrase' task. Our BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. We obtain these by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language. BiSECT contains higher quality training examples than previous Split and Rephrase corpora, with sentence splits that require more significant modifications. We categorize examples in our corpus, and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited. Moreover, we show that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations. Data Code BibTex BiSECT: Learning to Split and Rephrase Sentences with Bitexts @inproceedings{kim-maddela-et-al-2021, title={{BiSECT}: Learning to Split and Rephrase Sentences with Bitexts, author={Joongwon Kim and Mounica Maddela and Reno Kriz and Wei Xu and Chris Callison-Burch}, booktitle={Proceedings of The 2021 Conference on Empirical Methods In Natural Language Proceedings (EMNLP)}, year={2021}, url={http://www.cis.upenn.edu/~ccb/publications/bisect.pdf} }
GooAQ 🥑: Open Question Answering with Diverse Answer Types. Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, Chris Callison-Burch. Findings of EMNLP 2021. Abstract GooAQ 🥑: Open Question Answering with Diverse Answer Types While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmarkT5 models on GooAQ and observe that (a) in line with recent work, LM's strong performance on GooAQ's short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as 'how' and 'why' questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types Data BibTex GooAQ 🥑: Open Question Answering with Diverse Answer Types @inproceedings{khashabi2021gooaq, title={GooAQ: Open Question Answering with Diverse Answer Types}, author={Khashabi, Daniel and Ng, Amos and Khot, Tushar and Sabharwal, Ashish and Hajishirzi, Hannaneh and Callison-Burch, Chris}, booktitle={Proceedings of The 2021 Conference on Empirical Methods In Natural Language Proceedings (EMNLP)}, year={2021}, url={http://www.cis.upenn.edu/~ccb/publications/GooAQ.pdf} }
Wikily Supervised Neural Translation Tailored to Cross-Lingual Tasks. Mohammad Sadegh Rasooli, Chris Callison-Burch, Derry Wijaya. EMNLP 2021. Abstract Wikily Supervised Neural Translation Tailored to Cross-Lingual Tasks We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong supervised baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our wikily translation models to unsupervised image captioning and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data. Our captioning results in Arabic are slightly better than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as an artificial training data in an annotation projection framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers. Code BibTex Wikily Supervised Neural Translation Tailored to Cross-Lingual Tasks @inproceedings{rasooli2021wikily, title={``Wikily'' Supervised Neural Translation Tailored to Cross-Lingual Tasks}, author={Rasooli, Mohammad Sadegh and Callison-Burch, Chris and Wijaya, Derry Tanti}, booktitle={Proceedings of The 2021 Conference on Empirical Methods In Natural Language Proceedings (EMNLP)}, year={2021}, url={http://www.cis.upenn.edu/~ccb/publications/wikily-supervised-translation.pdf} }
Visual Goal-Step Inference using wikiHow. Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch. EMNLP 2021. Abstract Visual Goal-Step Inference using wikiHow Procedural events can often be thought of as a high level goal composed of a sequence of steps. Inferring the sub-sequence of steps of a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images. Our task is challenging for state-of-the-art muitimodal models. We introduce a novel dataset harvested from wikiHow that consists of 772,294 images representing human actions. We show that the knowledge learned from our data can effectively transfer to other datasets like HowTo100M, increasing the multiple-choice accuracy by 15% to 20%. Our task will facilitate multi-modal reasoning about procedural events. Data Code BibTex Visual Goal-Step Inference using wikiHow @inproceedings{yang2021visual, title={Visual Goal-Step Inference using {wikiHow}, author={Yang, Yue and Panagopoulou, Artemis and Lyu, Qing and Zhang, Li and Yatskar, Mark and Callison-Burch, Chris}, booktitle={Proceedings of The 2021 Conference on Empirical Methods In Natural Language Proceedings (EMNLP)}, year={2021}, url={http://www.cis.upenn.edu/~ccb/publications/visual-goal-step-inference-using-wikihow.pdf} }
Goal-Oriented Script Construction. Qing Lyu and Li Zhang and Chris Callison-Burch. INGL 2021. Abstract Goal-Oriented Script Construction The knowledge of scripts, common chains of events in stereotypical scenarios, is a valuable asset for task-oriented natural language understanding systems. We propose the GoalOriented Script Construction task, where a model produces a sequence of steps to accomplish a given goal. We pilot our task on the first multilingual script learning dataset supporting 18 languages collected from wikiHow, a website containing half a million how-to articles. For baselines, we consider both a generation0based approach using a language model and a retrieval-based approach by first retrieving the relevant steps from a large candidate pool and then ordering them. We show that our task is practical, feasible but challenging for state-of-the-art Transformer models, and that our methods can be readily deployed for various other datasets and domains with decent zero-shot performance. Data BibTex Goal-Oriented Script Construction @inproceedings{Lyu-et-al:2021, author = {Qing Lyu and Li Zhang and Chris Callison-Burch}, title = {Goal-Oriented Script Construction}, booktitle = {Proceedings of the 14th International Conference on Natural Language Generation (INLG 2021)}, year = {2021}, url = {http://www.cis.upenn.edu/~ccb/publications/goal-oriented-script-construction.pdf} }
TopGuNN: Fast NLP Training Data Augmentation using Large Corpora. Rebecca Iglesias-Flores, Megha Mishra, Ajay Patel, Akanksha Malhotra, Reno Kriz, Martha Palmer and Chris Callison-Burch. Workshop on Data Science with Human in the Loop 2021. Abstract TopGuNN: Fast NLP Training Data Augmentation using Large Corpora Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora to easily retrieve new diverse training examples. TopGuNN is demonstrated for a semantic role labeling training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day. Code BibTex TopGuNN: Fast NLP Training Data Augmentation using Large Corpora @inproceedings{TopGuNN-system:2021, author = {Rebecca Iglesias-Flores and Megha Mishra and Ajay Patel and Akanksha Malhotra and Reno Kriz and Martha Palmer and Chris Callison-Burch}, title = {{TopGuNN}: Fast {NLP} Training Data Augmentation using Large Corpora}, booktitle = {Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances}, year = {2021}, url = {http://www.cis.upenn.edu/~ccb/publications/TopGuNN-system.pdf} }
RESIN: A Dockerlized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System. Haoyang Wen, Ying Lin, Tuan Lai, Xiaoman Pan, Sha Li, Xudong Lin, Ben Zhou, Manling Li, Haoyu Wang, Hongming Zhang, Xiaodong Yu, Alexander Dong, Zhenhailong Wang, Yi Fung, Piyush Mishra, Qing Lyu, Dídac Surís, Brian Chen, Susan Windisch Brown, Martha Palmer, Chris Callison-Burch, Carl Vondrick, Jiawei Han, Dan Roth, Shih-Fu Chang, and Heng Ji. NAACL 2021. Abstract RESIN: A Dockerlized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerized system publicly available for research purpose at GitHub, with a demo video. Code BibTex RESIN: A Dockerlized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System @inproceedings{kairos-resin-system:2021, author = {Haoyang Wen and Ying Lin and Tuan Lai and Xiaoman Pan and Sha Li and Xudong Lin and Ben Zhou and Manling Li and Haoyu Wang and Hongming Zhang and Xiaodong Yu and Alexander Dong and Zhenhailong Wang and Yi Fung and Piyush Mishra and Qing Lyu and Dídac Surís and Brian Chen and Susan Windisch Brown and Martha Palmer and Chris Callison-Burch and Carl Vondrick and Jiawei Han and Dan Roth and Shih-Fu Chang and Heng Ji}, title = {{RESIN}: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System}, booktitle = {Proceedings of The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2021}, url = {http://www.cis.upenn.edu/~ccb/publications/kairos-resin-system.pdf} }
Cultural and Geographical Influences on Image Translatability of Words across Languages. Nikzad Khani, Isidora Chara Tourni, Mohammad Sadegh Rasooli, Chris Callison-Burch and Derry Tanti Wijaya. NAACL 2021. Abstract Cultural and Geographical Influences on Image Translatability of Words across Languages Neural Machine Translation (NMT) models have been observed to produce poor translations when there are few/no parallel sentences to train the models. In the absence of parallel data, several approaches have turned to the use of images to learn translations. Since images of words, e.g., horse may be unchanged across languages, translations can be identified via images associated with words in different languages that have a high degree of visual similarity. However, translating via images has been shown to improve upon text-only models only marginally. To better understand when images are useful for translation, we study image translatability of words, which we define as the translatability of words via images, by measuring intra- and inter-cluster similarities of image representations of words that are translations of each other. We find that images of words are not always invariant across languages, and that similarities of image representations of words that are translations of each other. We find that a language pairs with shared culture, meaning having either a common language family, ethnicity or religion, have improved image translatability (i.e., have more similar images for similar words) compared to its converse regardless of their geographic proximity. In addition, in line with previous works that show images help more in translating concrete words, we found that concrete words have improved image translatability compared to abstract ones. Code BibTex Cultural and Geographical Influences on Image Translatability of Words across Languages @inproceedings{khani-cultural-and-geographical-influences:2021, author = {Nikzad Khani and Isidora Chara Tourni and Mohammad Sadegh Rasooli and Chris Callison-Burch and Derry Tanti Wijaya}, title = {Cultural and Geographical Influences on Image Translatability of Words across Languages}, booktitle = {Proceedings of The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2021}, url = {http://www.cis.upenn.edu/~ccb/publications/cultural-and-geographical-influences-on-image-translatability-of-words-across-languages.pdf} }
2020
Simple-QE: Better Automatic Quality Estimation for Text Simplification. Reno Kriz, Marianna Apidianaki, Chris Callison-Burch. arXiv 2020. Unpublished preprint. Abstract Simple-QE: Better Automatic Quality Estimation for Text Simplification Text simplification systems generate versions of texts that are easier to understand for a broader audience. The quality of simplified texts is generally estimated using metrics that compare to human references, which can be difficult to obtain. We propose Simple-QE, a BERT-based quality estimation (QE) model adapted from prior summarization QE work, and show that it correlates well with human quality judgments. Simple-QE does not require human references, which makes the model useful in a practical setting where users would need to be informed about the quality of generated simplifications. We also show that we can adapt this approach to accurately predict the complexity of human-written texts. BibTex Simple-QE: Better Automatic Quality Estimation for Text Simplification @article{kriz2020simple, title={Simple-QE: Better Automatic Quality Estimation for Text Simplification}, author={Kriz, Reno and Apidianaki, Marianna and Callison-Burch, Chris}, journal={arXiv preprint arXiv:2012.12382}, year={2020} }
Automatic Standardization of Colloquial Persian. Mohammad Sadegh Rasooli, Farzane Bakhtyari, Fatemeh Shafiei, Mahsa Ravanbakhsh, Chris Callison-Burch. arXiv 2020. Unpublished preprint. Abstract Automatic Standardization of Colloquial Persian The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training data is from colloquial Persian with 1.4 absolute BLEU score difference in the development data, and 0.8 in the test data. Data Code BibTex Automatic Standardization of Colloquial Persian @article{rasooli2020automatic, title={Automatic Standardization of Colloquial Persian}, author={Rasooli, Mohammad Sadegh and Bakhtyari, Farzane and Shafiei, Fatemeh and Ravanbakhsh, Mahsa and Callison-Burch, Chris}, journal={arXiv preprint arXiv:2012.05879}, year={2020} }
Artificial Intelligence in mental health and the biases of language based models. Isabel Straw and Chris Callison-Burch. PLOS One 2020. Abstract Artificial Intelligence in mental health and the biases of language based models The rapid integration of Artificial Intelligence (AI) into the healthcare field has occurred with little communication between computer scientists and doctors. The impact of AI on health outcomes and inequalities calls for health professionals and data scientists to make a collaborative effort to ensure historic health disparities are not encoded into the future. We present a study that evaluates bias in existing Natural Language Processing (NLP) models used in psychiatry and discuss how these biases may widen health inequalities. Our approach systematically evaluates each stage of model development to explore how biases arise from a clinical, data science and linguistic perspective. A literature review of the uses of NLP in mental health was carried out across multiple disciplinary databases with defined Mesh terms and keywords. Our primary analysis evaluated biases within ‘GloVe’ and ‘Word2Vec’ word embeddings. Euclidean distances were measured to assess relationships between psychiatric terms and demographic labels, and vector similarity functions were used to solve analogy questions relating to mental health. Our primary analysis of mental health terminology in GloVe and Word2Vec embeddings demonstrated significant biases with respect to religion, race, gender, nationality, sexuality and age. Our literature review returned 52 papers, of which none addressed all the areas of possible bias that we identify in model development. In addition, only one article existed on more than one research database, demonstrating the isolation of research within disciplinary silos and inhibiting cross-disciplinary collaboration or communication. Our findings are relevant to professionals who wish to minimize the health inequalities that may arise as a result of AI and data-driven algorithms. We offer primary research identifying biases within these technologies and provide recommendations for avoiding these harms in the future. BibTex Artificial Intelligence in mental health and the biases of language based models @article{straw2020, title={Artificial Intelligence in mental health and the biases of language based models}, author={Straw, Isabel and Callison-Burch, Chris}, journal={PloS one}, volume={15}, number={12}, year={2020}, publisher={Public Library of Science} }
Reasoning about Goals, Steps, and Temporal Ordering with WikiHow. Qing Lyu, Li Zhang, Chris Callison-Burch. EMNLP 2020. Abstract Reasoning about Goals, Steps, and Temporal Ordering with WikiHow We propose a suite of reasoning tasks on two types of relations between procedural events: GOAL-STEP relations ("learn poses" is a step in the larger goal of "doing yoga") and STEP-STEP TEMPORAL relations ("buy a yoga mat" typically precedes "learn poses"). We introduce a dataset targeting these two relations based on wikiHow, a website of instructional how-to articles. Our human-validated test set serves as a reliable benchmark for commonsense inference, with a gap of about 10% to 20% between the performance of state-of-the-art transformer models and human performance. Our automatically-generated training set allows models to effectively transfer to out-of-domain tasks requiring knowledge of procedural events, with greatly improved performances on SWAG, Snips, and Story Cloze Test in zero- and few-shot settings. Data Code BibTex Reasoning about Goals, Steps, and Temporal Ordering with WikiHow @inproceedings{lyu-zhang-wikihow:2020, author = {Qing Lyu and Li Zhang and Chris Callison-Burch}, title = {Reasoning about Goals, Steps, and Temporal Ordering with WikiHow}, booktitle = {Proceedings of The 2020 Conference on Empirical Methods In Natural Language Proceedings (EMNLP)}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/reasoning-about-goals-with-wikihow.pdf} }
Intent Detection with WikiHow. Li Zhang, Qing Lyu, Chris Callison-Burch. AACL-IJCNLP 2020. Abstract Intent Detection with WikiHow Modern task-oriented dialog systems need to reliably understand users’ intents. Intent detection is most challenging when moving to new domains or new languages, since there is little annotated data. To address this challenge, we present a suite of pretrained intent detection models. Our models are able to predict a broad range of intended goals from many actions because they are trained on wikiHow, a comprehensive instructional website. Our models achieve state-of-the-art results on the Snips dataset, the Schema-Guided Dialogue dataset, and all 3 languages of the Facebook multilingual dialog datasets. Our models also demonstrate strong zero- and few-shot performance, reaching over 75% accuracy using only 100 training examples in all datasets. Data BibTex Intent Detection with WikiHow @inproceedings{zhang-et-al-wikihow-intent-detection:2020, author = {Qing Lyu and Li Zhang and Chris Callison-Burch}, title = {Intent Detection with WikiHow}, booktitle = {Proceedings of The 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/intent-detection-with-wikihow.pdf} }
RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text. Liam Dugan, Daphne Ippolito, Arun Kirubarajan*, Chris Callison-Burch. EMNLP 2020. Demo papers. Abstract RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text In recent years, large neural networks for natural language generation (NLG) have made leaps and bounds in their ability to generate fluent text. However, the tasks of evaluating quality differences between NLG systems and understanding how humans perceive the generated text remain both crucial and difficult. In this system demonstration, we present Real or Fake Text (RoFT), a website that tackles both of these challenges by inviting users to try their hand at detecting machine-generated text in a variety of domains. We introduce a novel evaluation task based on detecting the boundary at which a text passage that starts off human-written transitions to being machine-generated. We show preliminary results of using RoFT to evaluate detection of machine-generated news articles. Code Website BibTex RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text @inproceedings{roft-demo:2020, author = {Liam Dugan and Daphne Ippolito and Arun Kirubarajan and Chris Callison-Burch}, title = {RoFT: A Tool for Evaluating \\Human Detection of Machine-Generated Text}, booktitle = {Proceedings of The 2020 Conference on Empirical Methods In Natural Language Proceedings (EMNLP) - Demo Track}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/roft-demo.pdf} }
Turkish Judge: A Peer Evaluation Framework for Crowd Work Appeals. Edward Cohen, Mukund Venkateswaran, Nivedita Sankar, Chris Callison-Burch. HCOMP 2020. Abstract Turkish Judge: A Peer Evaluation Framework for Crowd Work Appeals We present a crowd-driven adjudication system for rejected work on Amazon Mechanical Turk. The Mechanical Turk crowdsourcing platform allows Requesters to approve or reject assignments submitted by Workers. If the work is rejected, then Workers aren’t paid, and their reputation suffers. Currently, there is no built-in mechanism for Workers to appeal rejections, other than contacting Requesters directly. The time it takes Requesters to review potentially incorrectly rejected tasks means that their costs are substantially higher than the payment amount that is in dispute. As a solution to this issue, we present an automated appeals system called Turkish Judge which employs crowd workers as judges to adjudicate whether work was fairly rejected when their peers initiate an appeal. We describe our system, analyze the added cost to Requesters, and discuss the advantages of such a system to the Mechanical Turk marketplace and other similar microtasking platforms. BibTex Turkish Judge: A Peer Evaluation Framework for Crowd Work Appeals @inproceedings{turkish-judge-demo:2020, author = {Edward Cohen and Mukund Venkateswaran and Nivedita Sankar and Chris Callison-Burch}, title = {Turkish Judge: A Peer Evaluation Framework for Crowd Work Appeals}, booktitle = {Proceedings of the Eighth AAAI Conference on Human Computation and Crowdsourcing (HCOMP)}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/turkish-judge-demo.pdf} }
Automatic Detection of Generated Text is Easiest when Humans are Fooled. Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch and Douglas Eck. ACL 2020. Abstract Automatic Detection of Generated Text is Easiest when Humans are Fooled Recent advancements in neural language modeling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies like top-k, nucleus sampling, and untruncated random sampling—and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems. BibTex Automatic Detection of Generated Text is Easiest when Humans are Fooled @inproceedings{Ippolito-Duckworth-et-al:2020, author = {Daphne Ippolito and Daniel Duckworth and Douglas Eck and Chris Callison-Burch}, title = {Automatic Detection of Generated Text is Easiest when Humans are Fooled}, booktitle = {Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/automatic-detection-of-generated-text-is-easiest-when-humans-are-fooled.pdf} }
Toward Better Storylines with Sentence-Level Language Models. Daphne Ippolito, David Grangier, Douglas Eck and Chris Callison-Burch. ACL 2020. Short papers. Abstract Toward Better Storylines with Sentence-Level Language Models We propose a sentence-level language model which selects the next sentence in a story from a finite set of fluent alternatives. Since it does not need to model fluency, the sentence-level language model can focus on longer range dependencies, which are crucial for multi-sentence coherence. Rather than dealing with individual words, our method treats the story so far as a list of pre-trained sentence embeddings and predicts an embedding for the next sentence, which is more efficient than predicting word embeddings. Notably this allows us to consider a large number of candidates for the next sentence during training. We demonstrate the effectiveness of our approach with state-of-the-art accuracy on the unsupervised Story Cloze task and with promising results on larger-scale next sentence prediction tasks. Code BibTex Toward Better Storylines with Sentence-Level Language Models @inproceedings{Ippolito-et-al:2020, author = {Daphne Ippolito and David Grangier and Douglas Eck and Chris Callison-Burch}, title = {Toward Better Storylines with Sentence-Level Language Models}, booktitle = {Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/toward-better-storylines-with-sentence-level-language-models.pdf} }
Resolving Pronouns in Twitter Streams: Context Can Help. Anietie Andy, Chris Callison-Burch and Derry Wijaya. Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC) 2020. Abstract Resolving Pronouns in Twitter Streams: Context Can Help Many people live-tweet televised events like Presidential debates and popular TV-shows and discuss people or characters in the event. Naturally, many tweets make pronominal reference to these people/characters. We propose an algorithm for resolving personal pronouns that make reference to people involved in an event, in tweet streams collected during the event. BibTex Resolving Pronouns in Twitter Streams: Context Can Help @inproceedings{Andy-et-al:2020, author = {Anietie Andy and Chris Callison-Burch and Derry Wijaya}, title = {Resolving Pronouns in Twitter Streams: Context Can Help}, booktitle = {Proceedings of Third Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2020)}, year = {2020}, url = {http://www.cis.upenn.edu/~ccb/publications/resolving-pronouns-in-twitter-streams.pdf} }
The CLASSE GATOR (CLinical Acronym SenSE disambiGuATOR): A Method for Predicting Acronym Sense from Neonatal Clinical Notes. Aditya Kashyap, Heather Burris, Chris Callison-Burch, Mary Regina Boland. International Journal of Medical Informatics 2020. Abstract The CLASSE GATOR (CLinical Acronym SenSE disambiGuATOR): A Method for Predicting Acronym Sense from Neonatal Clinical Notes Objective To develop an algorithm for identifying acronym ‘sense’ from clinical notes without requiring a clinically annotated training set. Materials and Methods Our algorithm is called CLASSE GATOR: Clinical Acronym SenSE disambiGuATOR. CLASSE GATOR extracts acronyms and definitions from PubMed Central (PMC). A logistic regression model is trained using words associated with specific acronym-definition pairs from PMC. CLASSE GATOR uses this library of acronym-definitions and their corresponding word feature vectors to predict the acronym ‘sense’ from Beth Israel Deaconess (MIMIC-III) neonatal notes. Results We identified 1,257 acronyms and 8,287 definitions including a random definition from 31,764 PMC articles on prenatal exposures and 2,227,674 PMC open access articles. The average number of senses (definitions) per acronym was 6.6 (min = 2, max = 50). The average internal 5-fold cross validation was 87.9 % (on PMC). We found 727 unique acronyms (57.29 %) from PMC were present in 105,044 neonatal notes (MIMIC-III). We evaluated the performance of acronym prediction using 245 manually annotated clinical notes with 9 distinct acronyms. CLASSE GATOR achieved an overall accuracy of 63.04 % and outperformed random for 8/9 acronyms (88.89 %) when applied to clinical notes. We also compared our algorithm with UMN's acronym set, and found that CLASSE GATOR outperformed random for 63.46 % of 52 acronyms when using logistic regression, 75.00 % when using Bert and 76.92 % when using BioBert as the prediction algorithm within CLASSE GATOR. Conclusions CLASSE GATOR is the first automated acronym sense disambiguation method for clinical notes. Importantly, CLASSE GATOR does not require an expensive manually annotated acronym-definition corpus for training. BibTex The CLASSE GATOR (CLinical Acronym SenSE disambiGuATOR): A Method for Predicting Acronym Sense from Neonatal Clinical Notes @article{KASHYAP2020104101, title = "The {CLASSE GATOR (CLinical Acronym SenSE disambiGuATOR)}: A Method for predicting acronym sense from neonatal clinical notes", journal = "International Journal of Medical Informatics", volume = "137", year = "2020", doi = "https://doi.org/10.1016/j.ijmedinf.2020.104101", url = "http://www.sciencedirect.com/science/article/pii/S1386505619312122", author = "Aditya Kashyap and Heather Burris and Chris Callison-Burch and Mary Regina Boland", keywords = "Electronic health records, Natural language processing, Secondary reuse, Transfer learning" }
SNAP judgments into the digital age: Reporting on food stamps varies significantly with time, publication type, and political leaning. Benjamin Chrisinger, Eliza Kinsey, Ellie Pavlick, Chris Callison-Burch. PLOS One 2020. Abstract SNAP judgments into the digital age: Reporting on food stamps varies significantly with time, publication type, and political leaning The Supplemental Nutrition Assistance Program (SNAP) is the second-largest and most contentious public assistance program administered by the United States government. The media forums where SNAP discourse occurs have changed with the advent of social and web-based media. We used machine learning techniques to characterize media coverage of SNAP over time (1990–2017), between outlets with national readership and those with narrower scopes, and, for a subset of web-based media, by the outlet’s political leaning. We applied structural topic models, a machine learning methodology that categorizes and summarizes large bodies of text that have document-level covariates or metadata, to a corpus of print media retrieved via LexisNexis (n = 76,634). For comparison, we complied a separate corpus via web-scrape algorithm of the Google News API (2012–2017), and assigned political alignment metadata to a subset documents according to a recent study of partisanship on social media. A similar procedure was used on a subset of the print media documents that could be matched to the same alignment index. Using linear regression models, we found some, but not all, topics to vary significantly with time, between large and small media outlets, and by political leaning. Our findings offer insights into the polarized and partisan nature of a major social welfare program in the United States, and the possible effects of new media environments on the state of this discourse. BibTex SNAP judgments into the digital age: Reporting on food stamps varies significantly with time, publication type, and political leaning @article{chrisinger2020snap, title={SNAP judgments into the digital age: Reporting on food stamps varies significantly with time, publication type, and political leaning}, author={Chrisinger, Benjamin W and Kinsey, Eliza W and Pavlick, Ellie and Callison-Burch, Chris}, journal={PloS one}, volume={15}, number={2}, year={2020}, publisher={Public Library of Science} }
2019
Paraphrase-Sense-Tagged Sentences. Anne Cocos and Chris Callison-Burch. TACL 2019. Abstract Paraphrase-Sense-Tagged Sentences Many natural language processing tasks require discriminating the particular meaning of a word in context, but building corpora for developing sense-aware models can be a challenge. We present a large resource of example usages for words having a particular meaning, called Paraphrase-Sense Tagged Sentences (PSTS). Built upon the premise that a word's paraphrases instantiate its fine-grained meanings – i.e. bug has different meanings corresponding to its paraphrases fly and microbe – the resource contains up to 10,000 sentences for each of 3 million target-paraphrase pairs where the target word takes on the meaning of the paraphrase. We describe an automatic method based on bilingual pivoting used to enumerate sentences for PSTS, and present two models for ranking PSTS sentences based on their quality. Finally, we demonstrate the utility of PSTS by using it to build a dataset for the task of hypernym prediction in context. Training a model on this automatically-generated dataset produces accuracy that is competitive with a model trained on smaller datasets crafted with some manual effort. Data Website BibTex Paraphrase-Sense-Tagged Sentences @article{Cocos-Callison-Burch:2019:TACL, author = {Anne Cocos and Chris Callison-Burch}, title = {Paraphrase-Sense-Tagged Sentences}, journal = {Transactions of the Association for Computational Linguistics}, volume = {}, year = {2019}, url = {http://www.cis.upenn.edu/~ccb/publications/paraphrase-sense-tagged-sentences.pdf}, pages = {} }
PerspectroScope: A Window to the World of Diverse Perspectives. Sihao Chen, Daniel Khashabi, Chris Callison-Burch and Dan Roth. ACL 2019. Demo papers. Abstract PerspectroScope: A Window to the World of Diverse Perspectives This work presents PERSPECTROSCOPE, a web-based system which lets users query a discussion-worthy natural language claim, and extract and visualize various perspectives in support or against the claim, along with evidence supporting each perspective. The system thus lets users explore various perspectives that could touch upon aspects of the issue at hand. The system is built as a combination of retrieval engines and learned textualentailment-like classifiers built using a few recent developments in natural language understanding. To make the system more adaptive, expand its coverage, and improve its decisions over time, our platform employs various mechanisms to get corrections from the users. PERSPECTROSCOPE is available at github.com/CogComp/perspectroscope. A brief video of the system is available at youtube.com/watch?v=MXBTR1Sp3Bs. Code BibTex PerspectroScope: A Window to the World of Diverse Perspectives @inproceedings{Chen-Khashabi-et-al:2019, author = {Sihao Chen and Daniel Khashabi and Chris Callison-Burch and Dan Roth}, title = {PerspectroScope: A Window to the World of Diverse Perspectives}, booktitle = {Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL) demo session}, year = {2019}, address = {Florence, Italy}, url = {http://www.cis.upenn.edu/~ccb/publications/comparison-of-diverse-decoding-methods-from-conditional-language-models.pdf} }
Comparison of Diverse Decoding Methods from Conditional Language Models. Daphne Ippolito, Reno Kriz, João Sedoc, Maria Kustikova and Chris Callison-Burch. ACL 2019. Abstract Comparison of Diverse Decoding Methods from Conditional Language Models While conditional language models have greatly improved in their ability to output high-quality natural language, many NLP applications benefit from being able to generate a diverse set of candidate sequences. Diverse decoding strategies aim to, within a given-sized candidate list, cover as much of the space of high-quality outputs as possible, leading to improvements for tasks that re-rank and combine candidate outputs. Standard decoding methods, such as beam search, optimize for generating high likelihood sequences rather than diverse ones, though recent work has focused on increasing diversity in these methods. In this work, we perform an extensive survey of decoding-time strategies for generating diverse outputs from conditional language models. We also show how diversity can be improved without sacrificing quality by oversampling additional candidates, then filtering to the desired number. Code BibTex Comparison of Diverse Decoding Methods from Conditional Language Models @inproceedings{Ippolito-Kriz-et-al:2019, author = {Daphne Ippolito and Reno Kriz and Joao Sedoc and Maria Kustikova and Chris Callison-Burch}, title = {Comparison of Diverse Decoding Methods from Conditional Language Models}, booktitle = {Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL) }, year = {2019}, address = {Florence, Italy}, url = {http://www.cis.upenn.edu/~ccb/publications/comparison-of-diverse-decoding-methods-from-conditional-language-models.pdf} }
Winter is here: Summarizing Twitter Streams related to Pre-Scheduled Events. Anietie Andy, Derry Wijaya and Chris Callison-Burch. Proceedings of the Second Workshop on Storytelling 2019. Abstract Winter is here: Summarizing Twitter Streams related to Pre-Scheduled Events Pre-scheduled events, such as TV shows and sports games, usually garner considerable attention from the public. Twitter captures large volumes of discussions and messages related to these events, in real-time. Twitter streams related to pre-scheduled events are characterized by the following: (1) spikes in the volume of published tweets reflect the highlights of the event and (2) some of the published tweets make reference to the characters involved in the event, in the context in which they are currently portrayed in a subevent. In this paper, we take advantage of these characteristics to identify the highlights of pre-scheduled events from tweet streams and we demonstrate a method to summarize these highlights. We evaluate our algorithm on tweets collected around 2 episodes of a popular TV show, Game of Thrones, Season 7. BibTex Winter is here: Summarizing Twitter Streams related to Pre-Scheduled Events @inproceedings{Andy-Wijaya-et-al:2019, author = {Anietie Andy and Derry Wijaya and Chris Callison-Burch}, title = {Winter is here: Summarizing Twitter Streams related to Pre-Scheduled Events}, booktitle = {Proceedings of the Second Workshop on Storytelling}, year = {2019}, address = {Florence, Italy}, url = {http://www.cis.upenn.edu/~ccb/publications/winter-is-here.pdf} }
A Comparison of Context-sensitive Models for Lexical Substitution. Aina Garí Soler, Anne Cocos, Marianna Apidianaki, Chris Callison-Burch. 13th International Conference on Computational Semantics (IWCS) 2019. Abstract A Comparison of Context-sensitive Models for Lexical Substitution Word embedding representations provide good estimates of word meaning and give state-of-the art performance in semantic tasks. Embedding approaches differ as to whether and how they account for the context surrounding a word. We present a comparison of different word and context representations on the task of proposing substitutes for a target word in context (lexical substitution). We also experiment with tuning contextualized word embeddings on a dataset of sense-specific instances for each target word. We show that powerful contextualized word representations, which give high performance in several semantics-related tasks, deal less well with the subtle in-context similarity relationships needed for substitution. This is better handled by models trained with this objective in mind, where the inter-dependence between word and context representations is explicitly modeled during training. BibTex A Comparison of Context-sensitive Models for Lexical Substitution @inproceedings{Gar-Soler:2019, author = {Aina Gar\'{i} Soler and Anne Cocos and Marianna Apidianaki and Chris Callison-Burch}, title = {A Comparison of Context-sensitive Models for Lexical Substitution}, booktitle = {Proceedings of the 13th International Conference on Computational Semantics (IWCS)}, year = {2019}, address = {Gothenburg, Sweden}, url = {http://www.cis.upenn.edu/~ccb/publications/comparison-of-context-sensitive-models-for-lexical-substitution.pdf} }
Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification. Reno Kriz, João Sedoc, Marianna Apidianaki, Carolina Zheng, Gaurav Kumar, Eleni Miltsakaki, and Chris Callison-Burch. NAACL 2019. Abstract Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification Sentence simplification is the task of rewriting texts so they are easier to understand. Recent research has applied sequence-to-sequence (Seq2Seq) models to this task, focusing largely on training-time improvements via reinforcement learning and memory augmentation. One of the main problems with applying generic Seq2Seq models for simplification is that these models tend to copy directly from the original sentence, resulting in outputs that are relatively long and complex. We aim to alleviate this issue through the use of two main techniques. First, we incorporate content word complexities, as predicted with a leveled word complexity model, into our loss function during training. Second, we generate a large set of diverse candidate simplifications at test time, and rerank these to promote fluency, adequacy, and simplicity. Here, we measure simplicity through a novel sentence complexity model. These extensions allow our models to perform competitively with state-of-the-art systems while generating simpler sentences. We report standard automatic and human evaluation metrics. BibTex Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification @inproceedings{Kriz-et-al:2019:NAACL, author = {Reno Kriz and Joao Sedoc and Marianna Apidianaki and Carolina Zheng and Gaurav Kumar and Eleni Miltsakaki and Chris Callison-Burch}, title = {Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification}, booktitle = {The 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019)}, month = {June}, year = {2019}, address = {Minneapolis, Minnesota}, url = {http://www.cis.upenn.edu/~ccb/publications/complexity-weighted-loss-for-sentence-simplification.pdf} }
Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims. Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Callison-Burch and Dan Roth. NAACL 2019. Abstract Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims One key consequence of the information revolution is a significant increase and a contamination of our information supply. The practice of fact checking won’t suffice to eliminate the biases in text data we observe, as the degree of factuality alone does not determine whether biases exist in the spectrum of opinions visible to us. To better understand controversial issues, one needs to view them from a diverse yet comprehensive set of perspectives. For example, there are many ways to respond to a claim such as “animals should have lawful rights”, and these responses form a spectrum of perspectives, each with a stance relative to this claim and, ideally, with evidence supporting it. Inherently, this is a natural language understanding task, and we propose to address it as such. Specifically, we propose the task of substantiated perspective discovery where, given a claim, a system is expected to discover a diverse set of well-corroborated perspectives that take a stance with respect to the claim. Each perspective should be substantiated by evidence paragraphs which summarize pertinent results and facts. We construct PERSPECTRUM, a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data collection, and augmenting it using search engines in order to expand and diversify our dataset. We use crowdsourcing to filter out the noise and ensure high-quality data. Our dataset contains 1k claims, accompanied with pools of 10k and 8k perspective sentences and evidence paragraphs, respectively. We provide a thorough analysis of the dataset to highlight key underlying language understanding challenges, and show that human baselines across multiple subtasks far outperform machine baselines built upon state-of-the-art NLP techniques. This poses a challenge and opportunity for the NLP community to address. Data BibTex Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims @inproceedings{Chen-et-al:2019:NAACL, author = {Sihao Chen and Daniel Khashabi and Wenpeng Yin and Chris Callison-Burch and Dan Roth}, title = {Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims}, booktitle = {The 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019)}, month = {June}, year = {2019}, address = {Minneapolis, Minnesota}, url = {http://www.cis.upenn.edu/~ccb/publications/discovering-diverse-perspectives.pdf} }
ChatEval: A Tool for the Systematic Evaluation of Chatbots. João Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. NAACL 2019. Demo papers. Abstract ChatEval: A Tool for the Systematic Evaluation of Chatbots Open-domain dialog systems (i.e. chatbots) are difficult to evaluate. The current best practice for analyzing and comparing these dialog systems is the use of human judgments. However, the lack of standardization in evaluation procedures, and the fact that model parameters and code are rarely published hinder systematic human evaluation experiments. We introduce a unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems. Researchers can submit their trained models to the ChatEval web interface and obtain comparisons with baselines and prior work. The evaluation code is open-source to ensure standardization and transparency. In addition, we introduce open-source baseline models and evaluation datasets. ChatEval can be found at chateval.org. Website BibTex ChatEval: A Tool for the Systematic Evaluation of Chatbots @inproceedings{Sedoc:2018:ChatEval, author = {Joao Sedoc and Daphne Ippolito and Arun Kirubarajan and Jai Thirani and Lyle Ungar and Chris Callison-Burch}, title = {ChatEval: A Tool for the Systematic Evaluation of Chatbots}, booktitle = {The 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019) Demonstrations}, month = {June}, year = {2019}, address = {Minneapolis, Minnesota}, url = {http://www.cis.upenn.edu/~ccb/publications/chateval-demo.pdf} }
Unsupervised Hierarchical Story Infilling. Daphne Ippolito, David Grangier, Chris Callison-Burch and Douglas Eck. 13th International Conference on Computational Semantics (IWCS) 2019. Abstract Unsupervised Hierarchical Story Infilling Story infilling involves predicting words to go into a missing span from a story. This challenging task has the potential to transform interactive tools for creative writing. However, state-of-the-art conditional language models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without sacrificing diversity. BibTex Unsupervised Hierarchical Story Infilling @inproceedings{Ippolito-et-al:2019, author = {Daphne Ippolito and David Grangier and Chris Callison-Burch and Douglas Eck}, title = {A Comparison of Context-sensitive Models for Lexical Substitution}, booktitle = {Proceedings of the First Workshop on Narrative Understanding}, year = {2019}, address = {Minneapolis, Minnesota}, url = {http://www.cis.upenn.edu/~ccb/publications/story-infilling.pdf} }
Anonymization of Sensitive Information in Medical Health Records. Bhavna Saluja, Gaurav Kumar, João Sedoc, and Chris Callison-Burch. Iberian Languages Evaluation Forum 2019. Abstract Anonymization of Sensitive Information in Medical Health Records Due to privacy constraints, clinical records with protected health information (PHI) cannot be directly shared. De-identification, i.e., the exhaustive removal, or replacement, of all mentioned PHI phrases has to be performed before making the clinical records available outside of hospitals. We have tried to identify PHI on medical records written in Spanish language. We applied two approaches for the anonymization of medical records in this paper. In the first approach, we gathered various token-level features and built a LinearSVC model which gave us F1 score of 0.861 on test data. In the other approach, we built a neural network involving an LSTM-CRF model which gave us a higher F1 score of 0.935 which is an improvement over the first approach. BibTex Anonymization of Sensitive Information in Medical Health Records @inproceedings{Saluja-Kumar-et-al:2019, author = {Bhavna Saluja and Gaurav Kumar and Joao Sedoc and Chris Callison-Burch}, title = {Anonymization of Sensitive Information in Medical Health Records}, booktitle = {Iberian Languages Evaluation Forum}, year = {2019}, address = {Bilbao, Spain}, url = {http://www.cis.upenn.edu/~ccb/publications/meddocan-shared-task-submission.pdf} }
Natural Language Processing of Reddit Data to Evaluate Dermatology Patient Experiences and Therapeutics. Edidiong Okon, Vishnutheja Rachakonda, Hyo Jung Hong, Chris Callison-Burch and Jules Lipoff. Journal of the American Academy of Dermatology 2019. Abstract Natural Language Processing of Reddit Data to Evaluate Dermatology Patient Experiences and Therapeutics Background - There is a lack of research studying patient-generated data on Reddit, one of the world’s most popular forums with active users interested in dermatology. Techniques within natural language processing, a field of artificial intelligence, can analyze large amounts of text information and extract insights. Objective - To apply natural language processing to Reddit comments about dermatology topics to assess for feasibility and potential for insights and engagement. Methods - A software pipeline preprocessed Reddit comments from 2005 to 2017 from seven popular dermatology-related subforums on Reddit, applied Latent Dirichlet allocation (LDA), and used spectral clustering to establish cohesive themes and the frequency of word representation and grouped terms within these topics. Results - We created a corpus of 176K comments and identified trends in patient engagement in spaces such as eczema, acne, etc., with a focus on homeopathic treatments and Accutane. Limitations - LDA is an unsupervised model, meaning there is no ground truth to which the model output can be compared. However, as these forums are anonymous, there seems little incentive for patients to be dishonest. Conclusions - Reddit data has viability and utility for dermatologic research and engagement with the public, especially for common dermatology topics such as tanning, acne, and psoriasis. BibTex Natural Language Processing of Reddit Data to Evaluate Dermatology Patient Experiences and Therapeutics @article{Okon-EtAl:2016:JAAD, author = {Edidiong Okon and Vishnutheja Rachakonda and Hyo Jung Hong and Chris Callison-Burch and Jules Lipoff}, title = {Natural Language Processing of Reddit Data to Evaluate Dermatology Patient Experiences and Therapeutics}, journal = {Journal of the American Academy of Dermatology}, volume = {}, number = {}, year = {2017}, url = {https://www.sciencedirect.com/science/article/pii/S0190962219323710} }
Worker Demographics and Earnings on Amazon Mechanical Turk: An Exploratory Analysis. Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Benjamin V. Hanrahan, Jeffrey P. Bigham and Chris Callison-Burch. CHI Late Breaking Work 2019. Abstract Worker Demographics and Earnings on Amazon Mechanical Turk: An Exploratory Analysis Prior research reported that workers on Amazon Mechanical Turk (AMT) are underpaid, earning about $2/h. But the prior research did not investigate the difference in wage due to worker characteristics (e.g., country of residence). We present the first data-driven analysis on wage gap on AMT. Using work log data and demographic data collected via online survey, we analyse the gap in wage due to different factors. We show that there is indeed wage gap; for example, workers in the U.S. earn $3.01/h while those in India earn $1.41/h. BibTex Worker Demographics and Earnings on Amazon Mechanical Turk: An Exploratory Analysis @inproceedings{Hara-et-al:2019:CHI-LBW, author = {Kotaro Hara and Abigail Adams and Kristy Milland and Saiph Savage and Benjamin V. Hanrahan and Jeffrey P. Bigham and Chris Callison-Burch}, title = {Worker Demographics and Earnings on Amazon Mechanical Turk: An Exploratory Analysis}, booktitle = {CHI'19 Late Breaking Work}, month = {May}, year = {2019}, address = {Glasgow, Scotland, UK}, url = {http://www.cis.upenn.edu/~ccb/publications/crowd-workers-demographics.pdf} }
2018
Learning Scalar Adjective Intensity from Paraphrases. Anne Cocos, Skyler Wharton, Ellie Pavlick, Marianna Apidianaki and Chris Callison-Burch. EMNLP 2018. Abstract Learning Scalar Adjective Intensity from Paraphrases Adjectives like warm, hot, and scalding all describe temperature but differ in intensity. Understanding these differences between adjectives is a necessary part of reasoning about natural language. We propose a new paraphrase-based method to automatically learn the relative intensity relation that holds between a pair of scalar adjectives. Our approach analyzes over 36k adjectival pairs from the Paraphrase Database under the assumption that, for example, paraphrase pair really hot <-> scalding suggests that hot < scalding. We show that combining this paraphrase evidence with existing, complementary pattern- and lexicon-based approaches improves the quality of systems for automatically ordering sets of scalar adjectives and inferring the polarity of indirect answers to yes/no questions. BibTex Learning Scalar Adjective Intensity from Paraphrases @inproceedings{Cocos-et-al:2018:ACL, author = {Anne Cocos and Skyler Wharton and Ellie Pavlick and Marianna Apidianaki and Chris Callison-Burch}, title = {Learning Scalar Adjective Intensity from Paraphrases}, booktitle = {2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)}, month = {November}, year = {2018}, address = {Brussels, Belgium}, url = {http://www.cis.upenn.edu/~ccb/publications/learning-scalar-adjective-intensity-from-paraphrases.pdf} }
Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package. Ajay Patel, Alex Sands, Marianna Apidianaki and Chris Callison-Burch. EMNLP 2018. Demo papers. Abstract Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package Vector space embedding models like word2vec, GloVe, and fastText are extremely popular representations in natural language processing (NLP) applications. We present Magnitude, a fast, lightweight tool for utilizing and processing embeddings. Magnitude is an open source Python package with a compact vector storage file format that allows for efficient manipulation of huge numbers of embeddings. Magnitude performs common operations up to 60 to 6,000 times faster than Gensim. Magnitude introduces several novel features for improved robustness like out-of-vocabulary lookups. BibTex Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package @inproceedings{Cocos-et-al:2018:ACL, author = {Ajay Patel and Alex Sands and Marianna Apidianaki and Chris Callison-Burch}, title = {Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package}, booktitle = {2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) - Demos}, month = {November}, year = {2018}, address = {Brussels, Belgium}, url = {http://www.cis.upenn.edu/~ccb/magnitude-fast-efficient-vector-embeddings-in-python.pdf} }
Learning Translations via Images with a Massively Multilingual Image Dataset. John Hewitt, Daphne Ippolito, Brendan Callahan, Reno Kriz, Derry Wijaya and Chris Callison-Burch. ACL 2018. Abstract Learning Translations via Images with a Massively Multilingual Image Dataset We conduct the most comprehensive study to date into translating words via images. To facilitate research on the task, we introduce a large-scale multilingual corpus of images, each labeled with the word it represents. Past datasets have been limited to only a few high-resource languages and unrealistically easy translation settings. In contrast, we have collected by far the largest available dataset for this task, with images for approximately 10,000 words in each of 100 languages. We run experiments on a dozen high resource languages and 20 low resources languages, demonstrating the effect of word concreteness and part-of-speech on translation quality. To improve image-based translation, we introduce a novel method of predicting word concreteness from images, which improves upon previously state-of-the-art unsupervised techniques. This allows us to predict when image-based translation may be effective, enabling consistent improvements to a state-of-the-art text-based word translation system. Our code and dataset will be made available. BibTex Learning Translations via Images with a Massively Multilingual Image Dataset @inproceedings{Hewitt-et-al:2018:ACL, author = {John Hewitt and Daphne Ippolito and Brendan Callahan and Reno Kriz and Derry Wijaya and Chris Callison-Burch}, title = {Learning Translations via Images with a Massively Multilingual Image Dataset}, booktitle = {The 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018)}, month = {July}, year = {2018}, address = {Melbourne, Australia}, url = {http://www.cis.upenn.edu/~ccb/publications/learning-translations-via-images.pdf} }
Simplification Using Paraphrases and Context-based Lexical Substitution. Reno Kriz, Eleni Miltsakaki, Marianna Apidianaki and Chris Callison-Burch. NAACL 2018. Abstract Simplification Using Paraphrases and Context-based Lexical Substitution Lexical simplification involves, among other steps, identifying complex words or phrases that need simplification and recommending simpler words or phrases that can be more easily understood. In this paper, we improve the task of identifying English complex words by proposing a model that exploits both lexical and contextual features for identifying words in a text that need to be simplified. We improve the lexical substitution task by using a word embedding-based lexical substitution model, and replacing the detected complex words with simpler paraphrases that preserve the meaning of the original segments. We compare our lexical simplification system to several baselines, and evaluate the best performing system using human judgments for 2,351 tokens. BibTex Simplification Using Paraphrases and Context-based Lexical Substitution @inproceedings{Kriz-et-al:2018:NAACL, author = {Reno Kriz and Eleni Miltsakaki and Marianna Apidianaki and Chris Callison-Burch}, title = {Simplification Using Paraphrases and Context-based Lexical Substitution}, booktitle = {The 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2018)}, month = {June}, year = {2018}, address = {New Orleans, Louisiana }, url = {http://www.cis.upenn.edu/~ccb/publications/simpliciation-using-paraphrases-and-lexical-substitution.pdf} }
Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation. Marianna Apidianaki, Guillaume Wisniewski, Anne Cocos and Chris Callison-Burch. NAACL 2018. Short papers. Abstract Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation Human translators and MT systems can produce multiple plausible translations for input texts. To reward meaning-equivalent but lexically divergent translations, MT evaluation metrics exploit synonyms and paraphrases, or multiple references. The HyTER metric relies on massive reference networks encoding an exponential number of correct translations for parts of a given sentence, proposed by human annotators. The manually built networks encode the set of all correct translations for a sentence, and HyTER rewards high quality hypotheses by measuring their minimum edit distance to the set of possible translations BibTex Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation @inproceedings{Apidianaki-et-al:2018:NAACL, author = {Marianna Apidianaki and Guillaume Wisniewski and Anne Cocos and Chris Callison-Burch}, title = {Automated Paraphrase Lattice Creation for {HyTER} Machine Translation Evaluation}, booktitle = {The 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2018)}, month = {June}, year = {2018}, address = {New Orleans, Louisiana }, url = {http://www.cis.upenn.edu/~ccb/publications/hyter-paraphrase-lattices.pdf} }
Comparing Constraints for Taxonomic Organization. Anne Cocos, Marianna Apidianaki and Chris Callison-Burch. NAACL 2018. Abstract Comparing Constraints for Taxonomic Organization Building a lexical taxonomy from the ground up involves several sub-tasks: selecting terms to include, predicting relations between terms, and selecting the best subset of relations to keep, given constraints on the taxonomy graph. Methods for the final taxonomic organization step vary in terms of the constraints they impose, and whether they enable discovery of synonymous terms. But it is hard to isolate the impact of these factors on the quality of the resulting taxonomy because organization methods are rarely compared directly. In this paper, we present a head-to-head comparison of six taxonomic organization algorithms that vary with respect to their structural and transitivity constraints, and treatment of synonymy. We find that while transitive algorithms out-perform their non-transitive counterparts, the top-performing transitive algorithm is prohibitively slow for taxonomies with as few as 50 entities. We propose a simple modification to a non-transitive maximum spanning tree algorithm to explicitly incorporate synonymy, resulting in a method that is orders of magnitude faster than the top-performer while giving comparable performance. BibTex Comparing Constraints for Taxonomic Organization @inproceedings{Cocos-et-al:2018:NAACL, author = {Anne Cocos and Marianna Apidianaki and Chris Callison-Burch}, title = {Comparing Constraints for Taxonomic Organization}, booktitle = {The 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2018)}, month = {June}, year = {2018}, address = {New Orleans, Louisiana }, url = {http://www.cis.upenn.edu/~ccb/publications/comparing-constraints-for-taxonomic-organization.pdf} }
A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk. Honorable Mention Award. Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, Jeffrey P. Bigham. CHI 2018. Press Press Coverage Wired Magazine - April 23, 2020 - Newly Unemployed, and Labeling Photos for Pennies: People who've lost jobs and are stuck indoors are turning to crowd work—filling out online surveys and transcribing audio for less than the minimum wage. Full Frontal with Samantha Bee - March 11, 2020 - America’s Gig-Based Economy Gets Zero New York Times - November 15, 2019 - I Found Work on an Amazon Website. I Made 97 Cents an Hour. Abstract A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~$2/h, and only 4% earned more than $7.25/h. While the average requester pays more than $11/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work. BibTex A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk @inproceedings{Hara-et-al:2018:CHI, author = {Kotaro Hara and Abi Adams and Kristy Milland and Saiph Savage and Chris Callison-Burch and Jeffrey P. Bigham}, title = {A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk}, booktitle = {CHI 2018}, month = {April}, year = {2018}, address = {Montreal, QC, Canada}, url = {http://www.cis.upenn.edu/~ccb/publications/data-driven-analysis-of-workers-earnings-on-amazon-mechanical-turk.pdf} }
ChatEval: A Tool for the Systematic Evaluation of Chatbots. Joao Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. Workshop on Intelligent Interactive Systems and Language Generation 2018. Abstract ChatEval: A Tool for the Systematic Evaluation of Chatbots Open-domain dialog systems are difficult to evaluate. The current best practice for analyzing and comparing these dialog systems is the use of human judgments. However, the lack of standardization in evaluation procedures, and the fact that model parameters and code are rarely published hinder systematic human evaluation experiments. We introduce a unified framework for human evaluation of chatbots that augments existing chatbot tools, and provides a web-based hub for researchers to share and compare their dialog systems. Researchers can submit their trained models to the ChatEval web interface and obtain comparisons with baselines and prior work. The evaluation code is open-source to ensure evaluation is performed in a standardized and transparent way. In addition, we introduce open-source baseline models and evaluation datasets. ChatEval can be found at https://chateval.org/ BibTex ChatEval: A Tool for the Systematic Evaluation of Chatbots @inproceedings{Sedoc:2018:ChatEval, author = {Joao Sedoc and Daphne Ippolito and Arun Kirubarajan and Jai Thirani and Lyle Ungar and Chris Callison-Burch}, title = {ChatEval: A Tool for the Systematic Evaluation of Chatbots}, booktitle = {Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation}, year = {2018}, address = {Tilburg, The Netherlands}, url = {http://www.cis.upenn.edu/~ccb/publications/chateval.pdf} }
2017
Learning Translations via Matrix Completion. Derry Wijaya, Brendan Callahan, John Hewitt, Jie Gao, Xiao Ling, Marianna Apidianaki and Chris Callison-Burch. EMNLP 2017. Abstract Learning Translations via Matrix Completion Bilingual Lexicon Induction is the task of learning word translations without bilingual parallel corpora. We model this task as a matrix completion problem, and present an effective and extendable framework for completing the matrix. This method harnesses diverse bilingual and monolingual signals, each of which may be incomplete or noisy. Our model achieves state-of-the-art performance for both high and low resource languages. BibTex Learning Translations via Matrix Completion @inproceedings{Wijaya-et-al:2017:EMNLP, author = {Derry Wijaya and Brendan Callahan and John Hewitt and Jie Gao and Xiao Ling and Marianna Apidianaki and Chris Callison-Burch}, title = {Learning Translations via Matrix Completion}, booktitle = {Conference on Empirical Methods in Natural Language Processing}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, url = {http://www.cis.upenn.edu/~ccb/publications/learning-translations-via-matrix-completion.pdf} }
KnowYourNyms? A Game of Semantic Relationships. Ross Mechanic, Dean Fulgoni, Hannah Cutler, Sneha Rajana, Zheyuan Liu, Bradley Jackson, Anne Cocos, Chris Callison-Burch and Marianna Apidianaki. EMNLP 2017. Demo papers. Abstract KnowYourNyms? A Game of Semantic Relationships Semantic relation knowledge is crucial for natural language understanding. We introduce KnowYourNyms?, a web-based game for learning semantic relations. While providing users with an engaging experience, the application collects large amounts of data that can be used to improve semantic relation classifiers. The data also broadly informs us of how people perceive the relationships between words, providing useful insights for research in psychology and linguistics. Code Website BibTex KnowYourNyms? A Game of Semantic Relationships @inproceedings{Mechanic-et-al:2017:EMNLP, author = {Ross Mechanic and Dean Fulgoni and Hannah Cutler and Sneha Rajana and Zheyuan Liu and Bradley Jackson and Anne Cocos and Chris Callison-Burch and Marianna Apidianaki}, title = {KnowYourNyms? A Game of Semantic Relationships}, booktitle = {Conference on Empirical Methods in Natural Language Processing}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, url = {http://www.cis.upenn.edu/~ccb/publications/know-your-nyms.pdf} }
Constructing an Alias List for Named Entities During an Event. Anietie Andy, Mark Dredze, Mugizi Rwebangira, and Chris Callison-Burch. Workshop on Noisy User-generated Text 2017. Abstract Constructing an Alias List for Named Entities During an Event In certain fields, real-time knowledge from events can help in making informed decisions. In order to extract pertinent realtime knowledge related to an event, it is important to identify the named entities and their corresponding aliases related to the event. The problem of identifying aliases of named entities that spike has remained unexplored. In this paper, we introduce an algorithm, EntitySpike, that identifies entities that spike in popularity in tweets from a given time period, and constructs an alias list for these spiked entities. EntitySpike uses a temporal heuristic to identify named entities with similar context that occur in the same time period (within minutes) during an event. Each entity is encoded as a vector using this temporal heuristic.We show how these entityvectors can be used to create a named entity alias list. We evaluated our algorithm on a dataset of temporally ordered tweets from a single event, the 2013 Grammy Awards show. We carried out various experiments on tweets that were published in the same time period and show that our algorithm identifies most entity name aliases and outperforms a competitive baseline. BibTex Constructing an Alias List for Named Entities During an Event @inproceedings{andy2017constructing, title={Constructing an Alias List for Named Entities during an Event}, author={Andy, Anietie and Dredze, Mark and Rwebangira, Mugizi and Callison-Burch, Chris}, booktitle={Proceedings of the 3rd Workshop on Noisy User-generated Text}, pages={40--44}, year={2017} }
Systematically Adapting Machine Translation for Grammatical Error Correction. Courtney Napoles and Chris Callison-Burch. 12th Workshop on Innovative Use of NLP for Building Educational Applications (BEA12) 2017. Abstract Systematically Adapting Machine Translation for Grammatical Error Correction In this work we adapt machine translation (MT) to grammatical error correction, identifying how components of the statistical MT pipeline can be modified for this task and analyzing how each modification impacts system performance. We evaluate the contribution of each of these components with standard evaluation metrics and automatically characterize the morphological and lexical transformations made in system output. Our model rivals the current state of the art using a fraction of the training data. BibTex Systematically Adapting Machine Translation for Grammatical Error Correction @InProceedings{napoles-callisonburch:2017:BEA, author = {Napoles, Courtney and Callison-Burch, Chris}, title = {Systematically Adapting Machine Translation for Grammatical Error Correction}, booktitle = {Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, pages = {345--356}, url = {http://www.aclweb.org/anthology/W17-5039} }
Mapping the Paraphrase Database to WordNet. Anne Cocos, Marianna Apidianaki and Chris Callison-Burch. STARSEM 2017. Abstract Mapping the Paraphrase Database to WordNet WordNet has facilitated important research in natural language processing but its usefulness is somewhat limited by its relatively small coverage. The Paraphrase Database (PPDB) covers 650 times more words, but lacks the semantic structure of WordNet that would make it more directly useful for downstream tasks. We present a method for mapping words from PPDB to WordNet synsets with 89% accuracy. The mapping also lays important groundwork for incorporating WordNet's relations into PPDB so as to increase its utility for semantic reasoning in applications. BibTex Mapping the Paraphrase Database to WordNet @inproceedings{Cocos-et-al:2017:STARSEM, author = {Anne Cocos and Marianna Apidianaki and Chris Callison-Burch}, title = {Mapping the Paraphrase Database to WordNet}, booktitle = {*SEM 2017: The Sixth Joint Conference on Lexical and Computational Semantics}, month = {August}, year = {2017}, address = {Vancouver, Canada}, url = {http://www.cis.upenn.edu/~ccb/publications/mapping-ppdb-to-wordnet.pdf} }
Learning Antonyms with Paraphrases and a Morphology-aware Neural Network. Sneha Rajana, Chris Callison-Burch, Marianna Apidianaki and Vered Shwartz. STARSEM 2017. Abstract Learning Antonyms with Paraphrases and a Morphology-aware Neural Network Recognizing antonymy is a key task for improving the performance of NLP systems. In this paper, we propose a novel method for deriving antonym pairs from paraphrase pairs containing negation markers. We further integrate morphological features indicative of antonymy into a path-based relation detection algorithm. Our novel neural network model, AntNET, outperforms state-of-the-art models in distinguishing antonyms from other semantic relations and is capable of efficiently handling multi-word expressions. BibTex Learning Antonyms with Paraphrases and a Morphology-aware Neural Network @inproceedings{Rajana-et-al:2017:STARSEM, author = {Sneha Rajana and Chris Callison-Burch and Marianna Apidianaki and Vered Shwartz}, title = {Learning Antonyms with Paraphrases and a Morphology-aware Neural Network}, booktitle = {*SEM 2017: The Sixth Joint Conference on Lexical and Computational Semantics}, month = {August}, year = {2017}, address = {Vancouver, Canada}, url = {http://www.cis.upenn.edu/~ccb/publications/learning-antonyms.pdf} }
Word Sense Filtering Improves Embedding-Based Lexical Substitution. Best Paper Award. Anne Cocos, Marianna Apidianaki and Chris Callison-Burch. Workshop on Sense, Concept and Entity Representations and their Applications 2017. Abstract Word Sense Filtering Improves Embedding-Based Lexical Substitution The role of word sense disambiguation in lexical substitution has been questioned due to the high performance of vector space models which propose good substitutes without explicitly accounting for sense. We show that a filtering mechanism based on a sense inventory optimized for substitutability can improve the results of these models. Our sense inventory is constructed using a clustering method which generates paraphrase clusters that are congruent with lexical substitution annotations in a development set. The results show that lexical substitution can still benefit from senses which can improve the output of vector space paraphrase ranking models. BibTex Word Sense Filtering Improves Embedding-Based Lexical Substitution @inproceedings{Cocos-Apidianaki-Callison-Burch:2017:SENSE-WS, author = {Anne Cocos and Marianna Apidianaki and Chris Callison-Burch}, title = {Word Sense Filtering Improves Embedding-Based Lexical Substitution}, booktitle = {Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications}, month = {April}, year = {2017}, address = {Valencia, Spain}, publisher = {Association for Computational Linguistics}, pages = {110--119}, url = {http://www.aclweb.org/anthology/E17-2016} }
The Language of Place: Semantic Value from Geospatial Context. Ann Cocos and Chris Callison-Burch. EACL 2017. Short papers. Abstract The Language of Place: Semantic Value from Geospatial Context There is a relationship between what we say and where we say it. Word embeddings are usually trained assuming that semantically-similar words occur within the same textual contexts. We investigate the extent to which semantically-similar words occur within the same geospatial contexts. We enrich a corpus of geolocated Twitter posts with physical data derived from Google Places and OpenStreetMap, and train word embeddings using the resulting geospatial contexts. Intrinsic evaluation of the resulting vectors shows that geographic context alone does provide useful information about semantic relatedness. BibTex The Language of Place: Semantic Value from Geospatial Context @InProceedings{cocos-callisonburch:2017:EACLshort, author = {Cocos, Anne and Callison-Burch, Chris}, title = {The Language of Place: Semantic Value from Geospatial Context}, booktitle = {Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers}, month = {April}, year = {2017}, address = {Valencia, Spain}, publisher = {Association for Computational Linguistics}, pages = {99--104}, url = {http://www.aclweb.org/anthology/E17-2016} }
Crowd Control: Effectively Utilizing Unscreened Crowd Workers for Biomedical Data Annotation. Anne Cocos, Ting Qiana, Chris Callison-Burch, and Aaron J. Masino. Journal of Biomedical Informatics 2017. Abstract Crowd Control: Effectively Utilizing Unscreened Crowd Workers for Biomedical Data Annotation Annotating unstructured texts in Electronic Health Records data is usually a necessary step for conducting machine learning research on such datasets. Manual annotation by domain experts provides data of the best quality, but has become increasingly impractical given the rapid increase in the volume of EHR data. In this article, we examine the effectiveness of crowdsourcing with unscreened online workers as an alternative for transforming unstructured texts in EHRs into annotated data that are directly usable in supervised learning models. We find the crowdsourced annotation data to be just as effective as expert data in training a sentence classification model to detect the mentioning of abnormal ear anatomy in radiology reports of audiology. Furthermore, we have discovered that enabling workers to self-report a confidence level associated with each annotation can help researchers pinpoint less-accurate annotations requiring expert scrutiny. Our findings suggest that even crowd workers without specific domain knowledge can contribute effectively to the task of annotating unstructured EHR datasets. BibTex Crowd Control: Effectively Utilizing Unscreened Crowd Workers for Biomedical Data Annotation @article{Cocos-EtAl:2017:Biomedical-Informatics, author = {Anne Cocos and Ting Qiana and Chris Callison-Burch and Aaron Masino}, title = {Crowd Control: Effectively Utilizing Unscreened Crowd Workers for Biomedical Data Annotation}, journal = {Journal of Biomedical Informatics}, volume = {}, number = {}, year = {2017}, url = {http://www.sciencedirect.com/science/article/pii/S1532046417300746} }
2016
The Gun Violence Database. Ellie Pavlick and Chris Callison-Burch. Bloomberg Data for Good Exchange 2016. Abstract The Gun Violence Database We describe the Gun Violence Database (GVDB), a large and growing database of gun violence incidents in the United States. The GVDB is built from the detailed information found in local news reports about gun violence, and is constructed via a large-scale crowdsourced annotation effort through our web site, http://gun-violence.org/. We argue that centralized and publicly available data about gun violence can facilitate scientific, fact-based discussion about a topic that is often dominated by politics and emotion. We describe our efforts to automate the construction of the database using state-of-the-art natural language processing (NLP) technologies, eventually enabling a fully-automated, highly-scalable resource for research on this important public health problem.
The Gun Violence Database: A new task and data set for NLP. Ellie Pavlick, Heng Ji, Xiaoman Pan and Chris Callison-Burch. EMNLP 2016. Short papers. Abstract The Gun Violence Database: A new task and data set for NLP We argue that NLP researchers are especially well-positioned to contribute to the national discussion about gun violence. Reasoning about the causes and outcomes of gun violence is typically dominated by politics and emotion, and data-driven research on the topic is stymied by a shortage of data and a lack of federal funding. However, data abounds in the form of unstructured text from news articles across the country. This is an ideal application of NLP technologies, such as relation extraction, coreference resolution, and event detection. We introduce a new and growing dataset, the Gun Violence Database, in order to facilitate the adaptation of current NLP technologies to the domain of gun violence, thus enabling better social science research on this important and under-resourced problem. BibTex The Gun Violence Database: A new task and data set for NLP @inproceedings{Pavlick-EtAl:2016:EMNLP, author = {Ellie Pavlick and Heng Ji and Xiaoman Pan and Chris Callison-Burch}, title = {The Gun Violence Database: A new task and data set for {NLP}}, booktitle = {Proceedings of The 2016 Conference on Empirical Methods on Natural Language Processing (EMNLP)}, month = {November}, year = {2016}, address = {Austin, TX}, url = {http://www.cis.upenn.edu/~ccb/publications/gun-violence-database.pdf} }
Tense Manages to Predict Implicative Behavior in Verbs. Ellie Pavlick and Chris Callison-Burch. EMNLP 2016. Short papers. Abstract Tense Manages to Predict Implicative Behavior in Verbs Implicative verbs (e.g. manage) entail their compliment clauses, while non-implicative verbs (e.g. want) do not. For example, while managing to solve the problem entails solving the problem, no such inference follows from wanting to solve the problem. Differentiating between implicative and non-implicative verbs is therefore an essential component of natural language understanding, relevant to applications such as textual entailment and summarization. We present a simple method for predicting implicativeness which exploits known constraints on the tense of implicative verbs and their compliments. We show that this yields an effective, data-driven way of capturing this nuanced property in verbs. BibTex Tense Manages to Predict Implicative Behavior in Verbs @inproceedings{Pavlick-Callison-Burch:2016:EMNLP, author = {Ellie Pavlick and Chris Callison-Burch}, title = {Tense Manages to Predict Implicative Behavior in Verbs}, booktitle = {Proceedings of The 2016 Conference on Empirical Methods on Natural Language Processing (EMNLP)}, month = {November}, year = {2016}, address = {Austin, TX}, url = {http://www.cis.upenn.edu/~ccb/publications/tense-predicts-implicative-verbs.pdf} }
So-Called Non-Subsective Adjectives. Best Paper Award. Ellie Pavlick and Chris Callison-Burch. STARSEM 2016. Abstract So-Called Non-Subsective Adjectives The interpretation of adjective-noun pairs plays a crucial role in tasks such as recognizing textual entailment. Formal semantics often places adjectives into a taxonomy which should dictate adjectives’ entailment behavior when placed in adjective-noun compounds. However, we show experimentally that the behavior of subsective adjectives (e.g. red) versus non-subsective adjectives (e.g. fake) is not as cut and dry as often assumed. For example, inferences are not always symmetric: while ID is generally considered to be mutually exclusive with fake ID, fake ID is considered to entail ID. We discuss the implications of these findings for automated natural language understanding. BibTex So-Called Non-Subsective Adjectives @inproceedings{Pavlick-Callison-Burch:2016:ACL, author = {Ellie Pavlick and Chris Callison-Burch}, title = {So-Called Non-Subsective Adjectives}, booktitle = {*SEM 2016: The Fifth Joint Conference on Lexical and Computational Semantics}, month = {August}, year = {2016}, address = {Berlin, Germany}, url = {http://www.cis.upenn.edu/~ccb/publications/non-subsective-adjectives.pdf} }
Most babies are little and most problems are huge: Compositional Entailment in Adjective-Nouns. Ellie Pavlick and Chris Callison-Burch. ACL 2016. Abstract Most babies are little and most problems are huge: Compositional Entailment in Adjective-Nouns We examine adjective-noun (AN) composition in the task of recognizing textual entailment (RTE). We analyze behavior of ANs in large corpora and show that, despite conventional wisdom, adjectives do not always restrict the denotation of the nouns they modify. We use natural logic to characterize the variety of entailment relations that can result from AN composition. Predicting these relations depends on context and on common-sense knowledge, making AN composition especially challenging for current RTE systems. We demonstrate the inability of current state-of-the-art systems to handle AN composition in a simplified RTE task which involves the insertion of only a single word. Data BibTex Most babies are little and most problems are huge: Compositional Entailment in Adjective-Nouns @inproceedings{Pavlick-Callison-Burch:2016:ACL, author = {Ellie Pavlick and Chris Callison-Burch}, title = {Most babies are little and most problems are huge: Compositional Entailment in Adjective-Nouns}, booktitle = {The 54th Annual Meeting of the Association for Computational Linguistics (ACL)}, month = {August}, year = {2016}, address = {Berlin, Germany}, url = {http://www.cis.upenn.edu/~ccb/publications/compositional-entailment-in-adjective-nouns.pdf} }
Simple PPDB: A Paraphrase Database for Simplification. Ellie Pavlick and Chris Callison-Burch. ACL 2016. Short papers. Abstract Simple PPDB: A Paraphrase Database for Simplification We release the Simple Paraphrase Database, a subset of of the Paraphrase Database (PPDB) adapted for the task of text simplification. We train a supervised model to associate simplification scores with each phrase pair, producing rankings competitive with state-of-the-art lexical simplification models. Our new simplification database contains 4.4 million paraphrase rules, making it the largest available resource for lexical simplification. Data BibTex Simple PPDB: A Paraphrase Database for Simplification @inproceedings{Pavlick-Callison-Burch:2016:ACL, author = {Ellie Pavlick and Chris Callison-Burch}, title = {Simple {PPDB}: A Paraphrase Database for Simplification}, booktitle = {The 54th Annual Meeting of the Association for Computational Linguistics (ACL)}, month = {August}, year = {2016}, address = {Berlin, Germany}, url = {http://www.cis.upenn.edu/~ccb/publications/simple-ppdb.pdf} }
Clustering Paraphrases by Word Sense. Anne Cocos and Chris Callison-Burch. NAACL 2016. Abstract Clustering Paraphrases by Word Sense Automatically generated databases of English paraphrases have the drawback that they return a single list of paraphrases for an input word or phrase. This means that all senses of polysemous words are grouped together, unlike WordNet which partitions different senses into separate synsets. We present a new method for clustering paraphrases by word sense, and apply it to the Paraphrase Database (PPDB). We investigate the performance of hierarchical and spectral clustering algorithms, and systematically explore different ways of defining the similarity matrix that they use as input. Our method produces sense clusters that are qualitatively and quantitatively good, and that represent a substantial improvement to the PPDB resource. BibTex Clustering Paraphrases by Word Sense @inproceedings{Cocos-Callison-Burch:2016:NAACL, author = {Anne Cocos and Chris Callison-Burch}, title = {Clustering Paraphrases by Word Sense}, booktitle = {The 2016 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016)}, month = {June}, year = {2016}, address = {San Diego, California}, url = {http://www.cis.upenn.edu/~ccb/publications/clustering-paraphrases-by-word-sense.pdf} }
Sentential Paraphrasing as Black-Box Machine Translation. Courtney Napoles, Chris Callison-Burch, and Matt Post. NAACL 2016. Short papers. Abstract Sentential Paraphrasing as Black-Box Machine Translation We present a simple, prepackaged solution to generating paraphrases of English sentences. We use the Paraphrase Database (PPDB) for monolingual sentence rewriting and provide machine translation language packs: prepackaged, tuned models that can be downloaded and used to generate paraphrases on a standard Unix environment. The language packs can be treated as a black box or customized to specific tasks. In this demonstration, we will explain how to use the included interactive web-based tool to generate sentential paraphrases. BibTex Sentential Paraphrasing as Black-Box Machine Translation @inproceedings{Napoles-et-al:2016:NAACL-demos, author = {Courtney Napoles and Chris Callison-Burch and Matt Post}, title = {Sentential Paraphrasing as Black-Box Machine Translation}, booktitle = {The 2016 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016)}, month = {June}, year = {2016}, address = {San Diego, California}, url = {http://www.cis.upenn.edu/~ccb/publications/sentential-paraphrasing-demo-paper.pdf} }
Optimizing Statistical Machine Translation for Text Simplification. Wei Xu, Courtney Napoles, Ellie Pavlick, Jim Chen, and Chris Callison-Burch. TACL 2016. Abstract Optimizing Statistical Machine Translation for Text Simplification Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task. BibTex Optimizing Statistical Machine Translation for Text Simplification @article{Xu-EtAl:2016:TACL, author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch}, title = {Optimizing Statistical Machine Translation for Text Simplification}, journal = {Transactions of the Association for Computational Linguistics}, volume = {4}, year = {2016}, url = {http://www.cis.upenn.edu/~ccb/publications/optimizing-machine-translation-for-text-simplifciation.pdf}, pages = {401--415} }
A Comprehensive Analysis of Bilingual Lexicon Induction. Ann Irvine and Chris Callison-Burch. Computational Linguistics 2016. Abstract A Comprehensive Analysis of Bilingual Lexicon Induction Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages. In this paper we present the most comprehensive analysis of bilingual lexicon induction to date. We present experiments on a wide range of languages and data sizes. We examine translation into English from 25 foreign languages -- Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Spanish, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese and Welsh. We analyze the behavior of bilingual lexicon induction on low frequency words, rather than testing solely on high frequency words, as previous research has done. Low frequency words are more relevant to statistical machine translation, where systems typically lack translations of rare words that fall outside of their training data. We systematically explore a wide range of features and phenomena that affect the quality of the translations discovered by bilingual lexicon induction. We give illustrative examples of the highest ranking translations for orthogonal signals of translation equivalence like contextual similarity and temporal similarity. We analyze the effects of frequency and burstiness, and the sizes of the seed bilingual dictionaries and the monolingual training corpora. Additionally, we introduce a novel discriminative approach to bilingual lexicon induction. Our discriminative model is capable of combining a wide variety of features, which individually provide only weak indications of translation equivalence. When feature weights are discriminatively set, these signals produce dramatically higher translation quality than previous approaches that combined signals in an unsupervised fashion (e.g. using minimum reciprocal rank). We also directly compare our model's performance against a sophisticated generative approach, the matching canonical correlation analysis (MCCA) algorithm used by Haghighi et al (2008). Our algorithm achieves an accuracy of 42% versus MCCA's 15%.
End-to-End Statistical Machine Translation with Zero or Small Parallel Texts. Ann Irvine and Chris Callison-Burch. Journal of Natural Language Engineering 2016. Abstract End-to-End Statistical Machine Translation with Zero or Small Parallel Texts We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually-estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora. BibTex End-to-End Statistical Machine Translation with Zero or Small Parallel Texts @article{Irvine-Callison-Burch:2015:JNLE, author = {Ann Irvine and Chris Callison-Burch}, title = {End-to-End Statistical Machine Translation with Zero or Small Parallel Texts}, journal = {Journal of Natural Language Engineering}, volume = {22}, issue = {4}, year = {2016}, url = {http://www.cis.upenn.edu/~ccb/publications/end-to-end-smt-with-zero-or-small-bitexts.pdf}, pages = {517-548} }
2015
The Shield of Heroic Memories (mp3). Chris Callison-Burch. The Adventure Zone podcast 2015. Abstract The Shield of Heroic Memories (mp3) I designed an item for The Adventure Zone, a comedy podcast about three brothers playing D&D with their dad. The McElroy brothers were incredibly enthusiastic about my submission. I want all of my paper reviews to say what they said, "That's already radical and then my boy Chris Callison-Burch kicked it up a notch. It's brilliant."
Adding Semantics to Data-Driven Paraphrasing. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. ACL 2015. Abstract Adding Semantics to Data-Driven Paraphrasing We add an interpretable semantics to the paraphrase database (PPDB). To date, the relationship between the phrase pairs in the database has been weakly defined as approximately equivalent. We show that in fact these pairs represent a variety of relations, including directed entailment (little girl/girl) and exclusion (nobody/someone). We automatically assign semantic entailment relations to entries in PPDB using features derived from past work on discovering inference rules from text and semantic taxonomy induction. We demonstrate that our model assigns these entailment relations with high accuracy. In a downstream RTE task, our labels rival relations from WordNet and improve the coverage of a proof-based RTE system by 17%. Data BibTex Adding Semantics to Data-Driven Paraphrasing @inproceedings{Pavlick-EtAl:2015:ACL, author = {Ellie Pavlick and Johan Bos and Malvina Nissim and Charley Beller and Benjamin Van Durme and Chris Callison-Burch}, title = {Adding Semantics to Data-Driven Paraphrasing}, booktitle = {The 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)}, month = {July}, year = {2015}, address = {Beijing, China}, url = {http://www.cis.upenn.edu/~ccb/publications/adding-semantics-to-data-driven-paraphrasing.pdf} }
PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevich, Ben Van Durme, Chris Callison-Burch. ACL 2015. Short papers. Abstract PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification We present a new release of the Paraphrase Database. PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0’s heuristic rankings. Each paraphrase pair in the database now also includes fine-grained entailment relations, word embedding similarities, and style annotations. Data Website BibTex PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification @InProceedings{PavlickEtAl-2015:ACL:Semantics, author = {Ellie Pavlick and Pushpendre Rastogi and Juri Ganitkevich and Ben Van Durme, Chris Callison-Burch}, title = {{PPDB} 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification} booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)}, month = {July}, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, }
Domain-Specific Paraphrase Extraction. Ellie Pavlick, Juri Ganitkevich, Tsz Ping Chan, Xuchen Yao, Ben Van Durme, Chris Callison-Burch. ACL 2015. Short papers. Abstract Domain-Specific Paraphrase Extraction The validity of applying paraphrase rules depends on the domain of the text that they are being applied to. We develop a novel method for extracting domain-specific paraphrases. We adapt the bilingual pivoting paraphrase method to bias the training data to be more like our target domain of biology. Our best model results in higher precision while retaining complete recall, giving a 10% relative improvement in AUC. BibTex Domain-Specific Paraphrase Extraction @InProceedings{PavlickEtAl-2015:ACL:Domain, author = {Ellie Pavlick and Juri Ganitkevich and Tsz Ping Chan and Xuchen Yao and Ben Van Durme, Chris Callison-Burch}, title = {Domain-Specific Paraphrase Extraction}, booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)}, month = {July}, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, }
FrameNet+: Fast Paraphrastic Tripling of FrameNet. Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi, Chris Callison-Burch, Mark Drezde, Ben Van Durme. ACL 2015. Short papers. Abstract FrameNet+: Fast Paraphrastic Tripling of FrameNet We increase the lexical coverage of FrameNet through automatic paraphrasing. We use crowdsourcing to manually filter out bad paraphrases in order to ensure a high-precision resource. Our expanded FrameNet contains an additional 22K lexical units, a 3-fold increase over the current FrameNet, and achieves 40% better coverage when evaluated in a practical setting on New York Times data. Data BibTex FrameNet+: Fast Paraphrastic Tripling of FrameNet @InProceedings{PavlickEtAl-2015:ACL:FNPlus, author = {Ellie Pavlick and Travis Wolfe and Pushpendre Rastogi and Chris Callison-Burch and Mark Drezde and Benjamin Van Durme}, title = {FrameNet+: Fast Paraphrastic Tripling of FrameNet}, booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)}, month = {July}, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, }
Problems in Current Text Simplification Research: New Data Can Help. Wei Xu, Chris Callison-Burch, and Courtney Napoles. TACL 2015. Abstract Problems in Current Text Simplification Research: New Data Can Help Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources. BibTex Problems in Current Text Simplification Research: New Data Can Help @article{Xu-EtAl:2015:TACL, author = {Wei Xu and Chris Callison-Burch and Courtney Napoles}, title = {Problems in Current Text Simplification Research: New Data Can Help}, journal = {Transactions of the Association for Computational Linguistics}, volume = {3}, year = {2015}, url = {http://www.cis.upenn.edu/~ccb/publications/new-data-for-text-simplification.pdf}, pages = {283--297} }
Cost Optimization for Crowdsourcing Translation. Mingkun Gao, Wei Xu, and Chris Callison-Burch. NAACL 2015. Abstract Cost Optimization for Crowdsourcing Translation Crowdsourcing makes it possible to create translations at much lower cost than hiring professional translators. However, it is still expensive to obtain the millions of translations that are needed to train statistical machine translation systems. We propose two mechanisms to reduce the cost of crowdsourcing while maintaining high translation quality. First, we develop a method to reduce redundant translations. We train a linear model to evaluate the translation quality on a sentence-by-sentence basis, and fit a threshold between acceptable and unacceptable translations. Unlike past work, which always paid for a fixed number of translations for each source sentence and then chose the best from them, we can stop earlier and pay less when we receive a translation that is good enough. Second, we introduce a method to reduce the pool of translators by quickly identifying bad translators after they have translated only a few sentences. This also allows us to rank translators, so that we re-hire only good translators to reduce cost. BibTex Cost Optimization for Crowdsourcing Translation @inproceedings{Gao-EtAl:2015:NAACL, author = {Mingkun Gao and Wei Xu and Chris Callison-Burch}, title = {Cost Optimization in Crowdsourcing Translation}, booktitle = {Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2015)}, month = {June}, year = {2015}, address = {Denver, Colorado}, url = {http://www.cis.upenn.edu/~ccb/publications/cost-optimization-for-crowdsourcing-translation.pdf} }
SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter. Wei Xu, Chris Callison-Burch, and Bill Dolan. SemEval 2015. Abstract SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter In this shared task, we present evaluations on two related tasks Paraphrase Identification (PI) and Semantic Textual Similarity (SS) systems for the Twitter data. Given a pair of sentences, participants are asked to produce a binary yes/no judgement or a graded score to measure their semantic equivalence. The task features a newly constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs. A total of 19 teams participated, submitting 36 runs to the PI task and 26 runs to the SS task. The evaluation shows encouraging results and open challenges for future research. The best systems scored a F1-measure of 0.674 for the PI task and a Pearson correlation of 0.619 for the SS task respectively, comparing to a strong baseline using logistic regression model of 0.589 F1 and 0.511 Pearson; while the best SS systems can often reach >0.80 Pearson on well-formed text. This shared task also provides insights into the relation between the PI and SS tasks and suggests the importance to bringing these two research areas together. We make all the data, baseline systems and evaluation scripts publicly available. BibTex SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter @inproceedings{Xu-EtAl:2015:semeval, author = {Wei Xu and Chris Callison-Burch and William B. Dolan}, title = {{SemEval-2015 Task} 1: Paraphrase and Semantic Similarity in {Twitter} ({PIT})}, booktitle = {Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval)}, year = {2015}, url = }
Effectively Crowdsourcing Radiology Report Annotations. Anne Cocos, Aaron J. Masino, Ting Qian, Ellie Pavlick, and Chris Callison-Burch. Sixth International Workshop on Health Text Mining and Information Analysis 2015. Abstract Effectively Crowdsourcing Radiology Report Annotations Crowdsourcing platforms are a popular choice for researchers to gather text annotations quickly at scale. We investigate whether crowdsourced annotations are useful when the labeling task requires medical domain knowledge. Comparing a sentence classification model trained with expert-annotated sentences to the same model trained on crowd-labeled sentences, we find the crowdsourced training data to be just as effective as the manually produced dataset. We can improve the accuracy of the crowd-fueled model without collecting further labels by filtering out worker labels applied with low confidence. BibTex Effectively Crowdsourcing Radiology Report Annotations @InProceedings{CocosEtAl-2015:LOUHI:Radiology, author = {Anne Cocos and Aaron J. Masino and Ting Qian and Ellie Pavlick and Chris Callison-Burch}, title = {Effectively Crowdsourcing Radiology Report Annotations} booktitle = {Sixth International Workshop on Health Text Mining and Information Analysis}, month = {November}, year = {2015}, address = {Lisbon, Portugal}, }
Automatically Scoring Freshman Writing: A Preliminary Investigation. Courtney Napoles and Chris Callison-Burch. Workshop on Innovative Use of NLP for Building Educational Applications 2015. Abstract Automatically Scoring Freshman Writing: A Preliminary Investigation In this work, we explore applications of automatic essay scoring (AES) to a corpus of essays written by college freshmen and discuss the challenges we faced. While most AES systems evaluate highly constrained writing, we developed a system that handles open-ended, long-form writing. We present a novel corpus for this task, containing more than 3,000 essays and drafts written for a freshman writing course. We describe statistical analysis of the corpus and identify problems with automatically scoring this type of data. Finally, we demonstrate how to overcome grader bias by using a multi-task setup, and predict scores as well as human graders on a different dataset. Finally, we discuss how AES can help teachers assign more uniform grades. BibTex Automatically Scoring Freshman Writing: A Preliminary Investigation @InProceedings{napoles-callisonburch:2015:bea, author = {Napoles, Courtney and Callison-Burch, Chris}, title = {Automatically Scoring Freshman Writing: A Preliminary Investigation}, booktitle = {Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications}, month = {June}, year = {2015}, address = {Denver, Colorado}, publisher = {Association for Computational Linguistics}, pages = {254--263} }
Extracting Structured Information via Automatic + Human Computation. Ellie Pavlick and Chris Callison-Burch. HCOMP 2015. Abstract Extracting Structured Information via Automatic + Human Computation We present a system for extracting structured information from unstructured text using a combination of information retrieval, natural language processing, machine learning, and crowdsourcing. We test our pipeline by building a structured database of gun violence incidents in the United States. The results of our pilot study demonstrate that the proposed methodology is a viable way of collecting large-scale, up-to-date data for public health, public policy, and social science research. BibTex Extracting Structured Information via Automatic + Human Computation @InProceedings{PavlickAndCallisonBurch-2015:HCOMP:GVDB, author = {Ellie Pavlick and Chris Callison-Burch}, title = {Extracting Structured Information via Automatic + Human Computation} booktitle = {HCOMP}, month = {November}, year = {2015}, address = {San Diego, California}, }
Ideological Perspective Detection Using Semantic Features. Heba Elfardy, Mona Diab and Chris Callison-Burch. STARSEM 2015. Abstract Ideological Perspective Detection Using Semantic Features In this paper, we propose the use of word sense disambiguation and latent semantic features to automatically identify a person’s perspective from his/her written text. We run an Amazon Mechanical Turk experiment where we ask Turkers to answer a set of constrained and open-ended political questions drawn from the American National Election Studies (ANES). We then extract the proposed features from the answers to the open-ended questions and use them to predict the answer to one of the constrained questions, namely, their preferred Presidential Candidate. In addition to this newly created dataset, we also evaluate our proposed approach on a second standard dataset of "Ideological-Debates". This latter dataset contains topics from four domains: Abortion, Creationism, Gun Rights and Gay Rights. Experimental results show that using word sense disambiguation and latent semantics, whether separately or combined, beats the majority and random baselines on the cross-validation and held-out-test sets for both the ANES and the four domains of the "Ideological Debates" datasets. Moreover combining both feature sets outperforms a stronger unigram-only classification system. BibTex Ideological Perspective Detection Using Semantic Features @InProceedings{elfardy-diab-callisonburch:2015:*SEM2015, author = {Elfardy, Heba and Diab, Mona and Callison-Burch, Chris}, title = {Ideological Perspective Detection Using Semantic Features}, booktitle = {Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics}, month = {June}, year = {2015}, address = {Denver, Colorado}, publisher = {Association for Computational Linguistics}, pages = {137--146}, url = {http://www.aclweb.org/anthology/S15-1015} }
2014
Arabic Dialect Identification. Omar Zaidan and Chris Callison-Burch. Computational Linguistics 2014. Abstract Arabic Dialect Identification The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabic – the true “native languages” of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual dataset rich in dialectal Arabic content, called the Arabic Online Commentary Dataset (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the dataset by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one’s own dialect). Using this new annotated dataset, we consider the task of Arabic dialect identification: given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large web crawl consisting of 3.5 million pages mined from online Arabic newspapers. BibTex Arabic Dialect Identification @article{zaidan-callisonburch:CL:2013, author = {Omar F. Zaidan and Chris Callison-Burch}, title = {Arabic Dialect Identification}, journal = {Computational Linguistics}, year = {2014}, volume = {40}, number = {1}, pages = {171-202} }
Extracting Lexically Divergent Paraphrases from Twitter. Wei Xu, Alan Ritter, Chris Callison-Burch, William B. Dolan and Yangfeng Ji. TACL 2014. Abstract Extracting Lexically Divergent Paraphrases from Twitter We present MULTIP (Multi-instance Learning Paraphrase Model), a new model suited to identify paraphrases within the short messages on Twitter. We jointly model paraphrase relations between word and sentence pairs and assume only sentence-level annotations during learning. Using this principled latent variable model alone, we achieve the performance competitive with a state-of-the-art method which combines a latent space model with a feature-based supervised classifier. Our model also captures lexically divergent paraphrases that differ from yet complement previous methods; combining our model with previous work significantly outperforms the state-of-the-art. In addition, we present a novel annotation methodology that has allowed us to crowdsource a paraphrase corpus from Twitter. We make this new dataset available to the research community. BibTex Extracting Lexically Divergent Paraphrases from Twitter @article{Xu-EtAl-2014:TACL, author = {Wei Xu and Alan Ritter and Chris Callison-Burch and William B. Dolan and Yangfeng Ji}, title = {Extracting Lexically Divergent Paraphrases from {Twitter}}, journal = {Transactions of the Association for Computational Linguistics}, volume = {2}, number = {}, year = {2014}, pages = {435--448}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/extracting-paraphrases-from-twitter.pdf} }
Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation. Gaurav Kumar, Yuan Cao, Ryan Cotterell, Chris Callison-Burch, Daniel Povey, and Sanjeev Khudanpur. IWSLT 2014. Abstract Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation Translation of the output of automatic speech recognition (ASR) systems, also known as speech translation, has received a lot of research interest recently. This is especially true for programs such as DARPA BOLT which focus on improving spontaneous human-human conversation across languages. However, this research is hindered by the dearth of datasets developed for this explicit purpose. For Egyptian Arabic-English, in particular, no parallel speech-transcription-translation dataset exists in the same domain. In order to support research in speech translation, we introduce the Callhome Egyptian Arabic-English Speech Translation Corpus. This supplements the existing LDC corpus with four reference translations for each utterance in the transcripts. The result is a three-way parallel dataset of Egyptian Arabic Speech, transcriptions and English translations. BibTex Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation @InProceedings{kumar-EtAl:2014:IWSLT, author = {Matt Post and Gaurav Kumar and Adam Lopez and Damianos Karakos and Chris Callison-Burch and Sanjeev Khudanpur}, title = {Translations of the {CALLHOME} {Egyptian} {Arabic} corpus for conversational speech translation}, booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)} month = {December}, year = {2014}, address = {Lake Tahoe, USA}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/callhome-egyptian-arabic-speech-translations.pdf} }
Poetry of the Crowd: A Human Computation Algorithm to Convert Prose into Rhyming Verse. Quanze Chen, Chenyang Lei, Wei Xu, Ellie Pavlick and Chris Callison-Burch. HCOMP Poster 2014. Abstract Poetry of the Crowd: A Human Computation Algorithm to Convert Prose into Rhyming Verse Poetry composition is a very complex task that requires a poet to satisfy multiple constraints concurrently. We believe that the task can be augmented by combining the creative abilities of humans with computational algorithms that efficiently constrain and permute available choices. We present a hybrid method for generating poetry from prose that combines crowdsourcing with natural language processing (NLP) machinery. We test the ability of crowd workers to accomplish the technically challenging and creative task of composing poems. BibTex Poetry of the Crowd: A Human Computation Algorithm to Convert Prose into Rhyming Verse @InProceedings{Chen-et-al:HCOMP:2014, author = {Quanze Chen and Chenyang Lei and Wei Xu and Ellie Pavlick and Chris Callison-Burch}, title = {Poetry of the Crowd: A Human Computation Algorithm to Convert Prose into Rhyming Verse}, booktitle = {The Second AAAI Conference on Human Computation and Crowdsourcing (HCOMP-2014)}, month = {November}, year = {2014}, url = {http://cis.upenn.edu/~ccb/publications/poetry-generation-with-crowdsourcing.pdf} }
Crowd-Workers: Aggregating Information Across Turkers To Help Them Find Higher Paying Work. Chris Callison-Burch. HCOMP Poster 2014. Abstract Crowd-Workers: Aggregating Information Across Turkers To Help Them Find Higher Paying Work The Mechanical Turk crowdsourcing platform currently fails to provide the most basic piece of information to enable workers to make informed decisions about which tasks to undertake: what is the expected hourly pay? Mechanical Turk advertises a reward amount per assignment, but does not give any indication of how long each assignment will take. We have developed a browser plugin that tracks the length of time it takes to complete a task, and a web service that aggregates the information across many workers. Crowd-Workers. com allows workers to discovery higher paying work by sorting tasks by estimated hourly rate. BibTex Crowd-Workers: Aggregating Information Across Turkers To Help Them Find Higher Paying Work @InProceedings{Chen-et-al:HCOMP:2014, author = {Chris Callison-Burch}, title = {Crowd-Workers: Aggregating Information Across Turkers To Help Them Find Higher Paying Work}, booktitle = {The Second AAAI Conference on Human Computation and Crowdsourcing (HCOMP-2014)}, month = {November}, year = {2014}, url = {http://cis.upenn.edu/~ccb/publications/crowd-workers.pdf} }
The Language Demographics of Amazon Mechanical Turk. Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch. TACL 2014. Abstract The Language Demographics of Amazon Mechanical Turk We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers' self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems. Data Code BibTex The Language Demographics of Amazon Mechanical Turk @article{Pavlick-EtAl-2014:TACL, author = {Ellie Pavlick and Matt Post and Ann Irvine and Dmitry Kachaev and Chris Callison-Burch}, title = {The Language Demographics of {Amazon Mechanical Turk}}, journal = {Transactions of the Association for Computational Linguistics}, volume = {2}, number = {Feb}, year = {2014}, pages = {79--92}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/language-demographics-of-mechanical-turk.pdf} }
Hallucinating Phrase Translations for Low Resource MT. Ann Irvine and Chris Callison-Burch. CoNLL 2014. Abstract Hallucinating Phrase Translations for Low Resource MT We demonstrate that “hallucinating” phrasal translations can significantly improve the quality of machine translation in low resource conditions. Our hallucinated phrase tables consist of entries composed from multiple unigram translations drawn from the baseline phrase table and from translations that are induced from monolingual corpora. The hallucinated phrase table is very noisy. Its translations are low precision but high recall. We counter this by introducing 30 new feature functions (including a variety of monolingually-estimated features) and by aggressively pruning the phrase table. Our analysis evaluates the intrinsic quality of our hallucinated phrase pairs as well as their impact in end-to-end Spanish-English and Hindi-English MT. BibTex Hallucinating Phrase Translations for Low Resource MT @InProceedings{irvine-callisonburch:2014:W14-16, author = {Irvine, Ann and Callison-Burch, Chris}, title = {Hallucinating Phrase Translations for Low Resource MT}, booktitle = {Proceedings of the Eighteenth Conference on Computational Natural Language Learning}, month = {June}, year = {2014}, pages = {160--170}, url = {http://www.aclweb.org/anthology/W14-1617} }
Using Comparable Corpora to Adapt MT Models to New Domains. Ann Irvine and Chris Callison-Burch. WMT 2014. Abstract Using Comparable Corpora to Adapt MT Models to New Domains In previous work we showed that when using an SMT model trained on old-domain data to translate text in a new-domain, most errors are due to unseen source words, unseen target translations, and inaccurate translation model scores (Irvine et al., 2013a). In this work, we target errors due to inaccurate translation model scores using new-domain comparable corpora, which we mine from Wikipedia. We assume that we have access to a large olddomain parallel training corpus but only enough new-domain parallel data to tune model parameters and do evaluation. We use the new-domain comparable corpora to estimate additional feature scores over the phrase pairs in our baseline models. Augmenting models with the new features improves the quality of machine translations in the medical and science domains by up to 1.3 BLEU points over very strong baselines trained on the 150 million word Canadian Hansard dataset. BibTex Using Comparable Corpora to Adapt MT Models to New Domains @InProceedings{irvine-callisonburch:2014:W14-33, author = {Irvine, Ann and Callison-Burch, Chris}, title = {Using Comparable Corpora to Adapt MT Models to New Domains}, booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation}, month = {June}, year = {2014}, pages = {437--444}, url = {http://www.aclweb.org/anthology/W14-3357} }
Are Two Heads are Better than One? Crowdsourced Translation via a Two-Step Collaboration between Translators and Editors. Rui Yan, Mingkun Gao, Ellie Pavlick, and Chris Callison-Burch. ACL 2014. Abstract Are Two Heads are Better than One? Crowdsourced Translation via a Two-Step Collaboration between Translators and Editors Crowdsourcing is a viable mechanism for creating training data for machine translation. It provides a low cost, fast turn-around way of processing large volumes of data. However, when compared professional translation, naive collection of translations from non-professionals yields low-quality results. Careful quality control is necessary for crowdsourcing to work well. In this paper, we examine the challenges of a two-step collaboration process with translation and post-editing by non-professionals. We develop graph-based ranking models that automatically select the best output from multiple redundant versions of translations and edits, and improves translation quality closer to professionals. BibTex Are Two Heads are Better than One? Crowdsourced Translation via a Two-Step Collaboration between Translators and Editors @InProceedings{Yan-EtAl-2014:ACL, author = {Rui Yan and Mingkun Gao and Ellie Pavlick and Chris Callison-Burch}, title = {Are Two Heads are Better than One? Crowdsourced Translation via a Two-Step Collaboration between Translators and Editors}, booktitle = {The 52nd Annual Meeting of the Association for Computational Linguistics}, month = {June}, year = {2014}, address = {Baltimore, Maryland}, publisher = {Association for Computional Linguistics}, url = {http://www.cis.upenn.edu/~ccb/publications/crowdsourced-translation-via-collaboration-between-translators-and-editors.pdf} }
PARADIGM: Paraphrase Diagnostics through Grammar Matching. Jonathan Weese, Juri Ganitkevitch, and Chris Callison-Burch. EACL 2014. Abstract PARADIGM: Paraphrase Diagnostics through Grammar Matching Paraphrase evaluation is typically done either manually or through indirect, task-based evaluation. We introduce an intrinsic evaluation PARADIGM which measures the goodness of paraphrase collections that are represented using synchronous grammars. We formulate two measures that evaluate these paraphrase grammars using gold standard sentential paraphrases drawn from a monolingual parallel corpus. The first measure calculates how often a paraphrase grammar is able to synchronously parse the sentence pairs in the corpus. The second measure enumerates paraphrase rules from the monolingual parallel corpus and calculates the overlap between this reference paraphrase collection and the paraphrase resource being evaluated. We demonstrate the use of these evaluation metrics on paraphrase collections derived from three different data types: multiple translations of classic French novels, comparable sentence pairs drawn from different newspapers, and bilingual parallel corpora. We show that PARADIGM correlates with human judgments more strongly than BLEU on a task-based evaluation of paraphrase quality. BibTex PARADIGM: Paraphrase Diagnostics through Grammar Matching @InProceedings{Weese-EtAl:2014:EACL, author = {Jonathan Weese and Juri Ganitkevitch and Chris Callison-Burch}, title = {PARADIGM: Paraphrase Diagnostics through Grammar Matching}, booktitle = {14th Conference of the European Chapter of the Association for Computational Linguistics}, month = {April}, year = {2014}, address = {Gothenburg, Sweden}, publisher = {Association for Computional Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/paradigm-paraphrase-evaluation.pdf}} }
Crowdsourcing for Grammatical Error Correction. Ellie Pavlick, Rui Yan, and Chris Callison-Burch. CSCW Poster 2014. Abstract Crowdsourcing for Grammatical Error Correction We discuss the problem of grammatical error correction, which has gained attention for its usefulness both in the development of tools for learners of foreign languages and as a component of statistical machine translation systems. We believe the task of suggesting grammar and style corrections in writing is well suited to a crowdsourcing solution but is currently hindered by the difficulty of automatic quality control. In this proposal, we motivate the problem of grammatical error correction and outline the challenges of ensuring quality in a setting where traditional methods of aggregation (e.g. majority vote) fail to produce the desired results. We then propose a design for quality control and present preliminary results indicating the potential of crowd workers to provide a scalable solution. BibTex Crowdsourcing for Grammatical Error Correction @InProceedings{Pavlick-EtAl:2014:CSCW, author = {Ellie Pavlick and Rui Yan and Chris Callison-Burch}, title = {Crowdsourcing for Grammatical Error Correction}, booktitle = {17th ACM Conference on Computer Supported Cooperative Work and Social Computing, Companion Volume}, month = {February}, year = {2014}, address = {Baltimore, Maryland}, publisher = {Association for Computing Machinery}, pages = {209--213}, url = {http://cis.upenn.edu/~ccb/publications/crowdsourcing-for-grammatical-error-correction.pdf}} }
The Multilingual Paraphrase Database. Juri Ganitkevitch and Chris Callison-Burch. LREC 2014. Abstract The Multilingual Paraphrase Database We release a massive expansion of the paraphrase database (PPDB) that now includes a collection of paraphrases in 23 different languages. The resource is derived from large volumes of bilingual parallel data. Our collection is extracted and ranked using state of the art methods. The multilingual PPDB has over a billion paraphrase pairs in total, covering the following languages: Arabic, Bulgarian, Chinese, Czech, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Portugese, Romanian, Russian, Slovak, Slovenian, and Swedish. BibTex The Multilingual Paraphrase Database @InProceedings{Ganitkevitch-Callison-Burch-2014:LREC, author = {Juri Ganitkevitch and Chris Callison-Burch}, title = {The Multilingual Paraphrase Database}, booktitle = {The 9th edition of the Language Resources and Evaluation Conference}, month = {May}, year = {2014}, address = {Reykjavik, Iceland}, pages = {}, publisher = {European Language Resources Association}, url = {http://cis.upenn.edu/~ccb/publications/ppdb-multilingual.pdf} }
The American Local News Corpus. Ann Irvine, Joshua Langfus, and Chris Callison-Burch. LREC 2014. Abstract The American Local News Corpus We present the American Local News Corpus (ALNC), containing over 4 billion words of text from 2, 652 online newspapers in the United States. Each article in the corpus is associated with a timestamp, state, and city. All 50 U.S. states and 1, 924 cities are represented. We detail our method for taking daily snapshots of thousands of local and national newspapers and present two example corpus analyses. The first explores how different sports are talked about over time and geography. The second compares per capita murder rates with news coverage of murders across the 50 states. The ALNC is about the same size as the Gigaword corpus and is growing continuously. Version 1.0 is available for research use. BibTex The American Local News Corpus @InProceedings{Irvine-EtAl-2014:LREC, author = {Ann Irvine and Joshua Langfus and Chris Callison-Burch}, title = {The {American} Local News Corpus}, booktitle = {The 9th edition of the Language Resources and Evaluation Conference}, month = {May}, year = {2014}, address = {Reykjavik, Iceland}, pages = {}, publisher = {European Language Resources Association}, url = {http://cis.upenn.edu/~ccb/publications/american-local-news-corpus.pdf} }
A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic. Ryan Cotterell and Chris Callison-Burch. LREC 2014. Abstract A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic This paper presents a multi-dialect, multi-genre, human annotated corpus of dialectal Arabic with data obtained from both online newspaper commentary and Twitter. Most Arabic corpora are small and focus on Modern Standard Arabic (MSA). There has been recent interest, however, in the construction of dialectal Arabic corpora. This work differs from previously constructed corpora in two ways. First, we include coverage of five dialects of Arabic: Egyptian, Gulf, Levantine, Maghrebi and Iraqi. This is the most complete coverage of any dialectal corpus known to the authors. In addition to data, we provide results for the Arabic dialect identification task that outperform those reported in Zaidan and Callison-Burch (2011). BibTex A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic @InProceedings{Cotterell-Callison-Burch-2014:LREC, author = {Ryan Cotterell and Chris Callison-Burch}, title = {A Multi-Dialect, Multi-Genre Corpus of Informal Written {Arabic}}, booktitle = {The 9th edition of the Language Resources and Evaluation Conference}, month = {May}, year = {2014}, address = {Reykjavik, Iceland}, pages = {}, publisher = {European Language Resources Association}, url = {http://cis.upenn.edu/~ccb/publications/arabic-dialect-corpus-2.pdf} }
An Algerian Arabic-French Code-Switched Corpus. Ryan Cotterell, Adithya Renduchintala, Naomi Saphra, and Chris Callison-Burch. LREC Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools 2014. Abstract An Algerian Arabic-French Code-Switched Corpus Arabic is not just one language, but rather a collection of dialects in addition to Modern Standard Arabic (MSA). While MSA is used in formal situations, dialects are the language of every day life. Until recently, there was very little dialectal Arabic in written form. With the advent of social-media, however, the landscape has changed. We provide the first romanized code-switched Algerian Arabic-French corpus annotated for word-level language id. We review the history and sociological factors that make the linguistic situation in Algerian unique and highlight the value of this corpus to the natural language processing and linguistics communities. To build this corpus, we crawled an Algerian newspaper and extracted the comments from the news story. We discuss the informal nature of the language in the corpus and the challenges it will present. Additionally, we provide a preliminary analysis of the corpus. We then discuss some potential uses of our corpus of interest to the computational linguistics community. BibTex An Algerian Arabic-French Code-Switched Corpus @InProceedings{Cotterell-EtAl-2014:LREC-WS, author = {Ryan Cotterell and Adithya Renduchintala and Naomi Saphra and Chris Callison-Burch}, title = {An {Algerian Arabic-French} Code-Switched Corpus}, booktitle = {Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools}, month = {May}, year = {2014}, address = {Reykjavik, Iceland}, pages = {}, publisher = {European Language Resources Association}, url = {http://cis.upenn.edu/~ccb/publications/arabic-french-codeswitching.pdf} }
2013
Open letter to President Obama. Chris Callison-Burch. Unpublished 2013. Abstract Open letter to President Obama I wrote an open letter to President Obama about my former PhD student, Omar Zaidan, who had his student visa revoked on the eve of his PhD defense, and who has not been allowed to return to the US in 1.5 years. The letter was read by over 35,000 people in the first week after I published it.
Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus. Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch and Sanjeev Khudanpur. IWSLT 2013. Abstract Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For Spanish-English translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callhome Spanish-English Speech Translation Corpus, supplementing existing LDC audio and transcripts with (a) ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and (b) English translations obtained on Amazon’s Mechanical Turk. The result is a four-way parallel dataset of Spanish audio, transcriptions, ASR lattices, and English translations of approximately 38 hours of speech, with defined training, development, and held-out test sets. We conduct baseline machine translation experiments using models trained on the provided training data, and validate the dataset by corroborating a number of known results in the field, including the utility of in-domain (information, conversational) training data, increased performance translating lattices (instead of recognizer 1-best output), and the relationship between word error rate and BLEU score. BibTex Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus @InProceedings{post-EtAl:2013:IWSLT, author = {Matt Post and Gaurav Kumar and Adam Lopez and Damianos Karakos and Chris Callison-Burch and Sanjeev Khudanpur}, title = {Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus}, booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)} month = {December}, year = {2013}, address = {Heidelberg, Germany}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/improved-speech-to-speech-translation.pdf} }
Semi-Markov Phrase-based Monolingual Alignment. Xuchen Yao, Ben Van Durme, Chris Callison-Burch and Peter Clark. EMNLP 2013. Abstract Semi-Markov Phrase-based Monolingual Alignment We introduce a novel discriminative model for phrase-based monolingual alignment using a semi-Markov CRF. Our model achieves state-of-the-art alignment accuracy on two phrase=based alignment datasets (RTE and paraphrase), while doing significantly better than other strong baselines in both non-identical alignment and phrase-only alignment. Additional experiments highlight the potential benefit of our alignment model to RTE, paraphrase identification and question answering, where even a naive application of our model’s alignment score approaches the state of the art. BibTex Semi-Markov Phrase-based Monolingual Alignment @InProceedings{yao-EtAl:2013:EMNLP, author = {Xuchen Yao and Benjamin {Van Durme} and Chris Callison-Burch and Peter Clark}, title = {Semi-Markov Phrase-based Monolingual Alignment}, booktitle = {Proceedings of EMNLP} month = {October}, year = {2013}, address = {Seattle, Washington}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/semi-markov-phrase-based-monolingual-alignment.pdf} }
Findings of the 2013 Workshop on Statistical Machine Translation. Ondrej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. WMT 2013. Abstract Findings of the 2013 Workshop on Statistical Machine Translation We present the results of the WMT13 shared tasks, which included a translation task, a task for run-time estimation of machine translation quality, and an unofficial metrics task. This year, 143 machine translation systems were submitted to the ten translation tasks from 23 institutions. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually, in our largest manual evaluation to date. The quality estimation task had four subtasks, with a total of 14 teams, submitting 55 entries. BibTex Findings of the 2013 Workshop on Statistical Machine Translation @InProceedings{bojar-EtAl:2013:WMT, author = {Bojar, Ond\v{r}ej and Buck, Christian and Callison-Burch, Chris and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Monz, Christof and Post, Matt and Soricut, Radu and Specia, Lucia}, title = {Findings of the 2013 {Workshop on Statistical Machine Translation}}, booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation}, month = {August}, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, pages = {1--44}, url = {http://www.aclweb.org/anthology/W13-2201} }
Joshua 5.0: Sparser, better, faster, server. Matt Post, Juri Ganitkevitch, Luke Orland, Jonathan Weese, Yuan Cao, and Chris Callison-Burch. WMT 2013. Abstract Joshua 5.0: Sparser, better, faster, server We describe improvements made over the past year to Joshua, an open-source translation system for parsing-based machine translation. The main contributions this past year are significant improvements in both speed and usability of the grammar extraction and decoding steps. We have also rewritten the decoder to use a sparse feature representation, enabling training of large numbers of features with discriminative training methods. BibTex Joshua 5.0: Sparser, better, faster, server @InProceedings{post-EtAl:2013:WMT, author = {Post, Matt and Ganitkevitch, Juri and Orland, Luke and Weese, Jonathan and Cao, Yuan and Callison-Burch, Chris}, title = {Joshua 5.0: Sparser, Better, Faster, Server}, booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation}, month = {August}, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, pages = {206--212}, url = {http://www.aclweb.org/anthology/W13-2226} }
Combining Bilingual and Comparable Corpora for Low Resource Machine Translation. Ann Irvine and Chris Callison-Burch. WMT 2013. Abstract Combining Bilingual and Comparable Corpora for Low Resource Machine Translation Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data. The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates). In this work, we use an additional data resource, comparable corpora, to improve both. Beginning with a small bitext and corresponding phrase-based SMT model, we improve coverage by using bilingual lexicon induction techniques to learn new translations from comparable corpora. Then, we supplement the model’s feature space with translation scores estimated over comparable corpora in order to improve accuracy. We observe improvements between 0.5 and 1.7 BLEU translating Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu into English. BibTex Combining Bilingual and Comparable Corpora for Low Resource Machine Translation @InProceedings{irvine-callisonburch:2013:WMT, author = {Irvine, Ann and Callison-Burch, Chris}, title = {Combining Bilingual and Comparable Corpora for Low Resource Machine Translation}, booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation}, month = {August}, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, pages = {262--270}, url = {http://www.aclweb.org/anthology/W13-2233} }
A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Peter Clark, Ben Van Durme and Chris Callison-Burch. ACL 2013. Short papers. Abstract A Lightweight and High Performance Monolingual Word Aligner Fast alignment is essential for many natural language tasks. But in the setting of monolingual alignment, previous work has not been able to align more than one sentence pair per second. We describe a discriminatively trained monolingual word aligner that uses a Conditional Random Field to globally decode the best alignment with features drawn from source and target sentences. Using just part-of-speech tags and WordNet as external resources, our aligner gives state-of-the-art result, while being an order-of-magnitude faster than the previous best performing system. BibTex A Lightweight and High Performance Monolingual Word Aligner @InProceedings{yao-EtAl:2013:ACL, author = {Xuchen Yao and Peter Clark and Benjamin {Van Durme} and Chris Callison-Burch}, title = {A Lightweight and High Performance Monolingual Word Aligner}, booktitle = {Proceedings of the 2013 Conference of the Association for Computational Linguistics (ACL 2013)}, month = {July}, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/monolingual-word-aligner.pdf} }
PARMA: A Predicate Argument Aligner. Travis Wolfe, Benjamin Van Durme, Mark Dredze, Nicholas Andrews, Charley Beller, Chris Callison-Burch, Jay DeYoung, Justin Snyder, Jonathan Weese, Tan Xu and Xuchen Yao. ACL 2013. Short papers. Abstract PARMA: A Predicate Argument Aligner We introduce PARMA, a system for crossdocument, semantic predicate and argument alignment. Our system combines a number of linguistic resources familiar to researchers in areas such as recognizing textual entailment and question answering, integrating them into a simple discriminative model. PARMA achieves state of the art results on an existing and a new dataset. We suggest that previous efforts have focussed on data that is biased and too easy, and we provide a more difficult dataset based on translation data which has a low baseline which we beat by 17% F1. BibTex PARMA: A Predicate Argument Aligner @InProceedings{wolfe-EtAl:2013:ACL, author = {Travis Wolfe and Benjamin {Van Durme} and Mark Dredze and Nicholas Andrews and Charley Beller and Chris Callison-Burch and Jay DeYoung and Justin Snyder and Jonathan Weese and Tan Xu and Xuchen Yao}, title = {{PARMA}: A Predicate Argument Aligner}, booktitle = {Proceedings of the 2013 Conference of the Association for Computational Linguistics (ACL 2013)}, month = {July}, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/parma.pdf} }
Learning to translate with products of novices: a suite of open-ended challenge problems for teaching MT. Adam Lopez, Matt Post, Chris Callison-Burch, Jonathan Weese, Juri Ganitkevitch, Narges Ahmidi, Olivia Buzek, Leah Hanson, Beenish Jamil, Matthias Lee, Ya-Ting Lin, Henry Pao, Fatima Rivera, Leili Shahriyari, Debu Sinha, Adam Teichert, Stephen Wampler, Michael Weinberger, Daguang Xu, Lin Yang, and Shang Zhao. TACL 2013. Abstract Learning to translate with products of novices: a suite of open-ended challenge problems for teaching MT Machine translation (MT) draws from several different disciplines, making it a complex subject to teach. There are excellent pedagogical texts, but problems in MT and current algorithms for solving them are best learned by doing. As a centerpiece of our MT course, we devised a series of open-ended challenges for students in which the goal was to improve performance on carefully constrained instances of four key MT tasks: alignment, decoding, evaluation, and reranking. Students brought a diverse set of techniques to the problems, including some novel solutions which performed remarkably well. A surprising and exciting outcome was that student solutions or their combinations fared competitively on some tasks, demonstrating that even newcomers to the field can help improve the state-of-the-art on hard NLP problems while simultaneously learning a great deal. The problems, baseline code, and results are freely available. BibTex Learning to translate with products of novices: a suite of open-ended challenge problems for teaching MT @article{Lopez-etal:TACL:2013, author = {Adam Lopez and Matt Post and Chris Callison-Burch and Jonathan Weese and Juri Ganitkevitch and Narges Ahmidi and Olivia Buzek and Leah Hanson and Beenish Jamil and Matthias Lee and Ya-Ting Lin and Henry Pao and Fatima Rivera and Leili Shahriyari and Debu Sinha and Adam Teichert and Stephen Wampler and Michael Weinberger and Daguang Xu and Lin Yang and and Shang Zhao}, title = {Learning to translate with products of novices: a suite of open-ended challenge problems for teaching {MT}}, journal = {Transactions of the Association for Computational Linguistics}, year = {2013}, volume = {1}, number = {May}, pages = {166--177} }
Dirt Cheap Web-Scale Parallel Text from the Common Crawl. Jason Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch and Adam Lopez. ACL 2013. Abstract Dirt Cheap Web-Scale Parallel Text from the Common Crawl Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource. BibTex Dirt Cheap Web-Scale Parallel Text from the Common Crawl @InProceedings{smith-EtAl:2013:ACL, author = {Jason Smith and Herve Saint-Amand and Magdalena Plamada and Philipp Koehn and Chris Callison-Burch and Adam Lopez}, title = {Dirt Cheap Web-Scale Parallel Text from the {Common Crawl}}, booktitle = {Proceedings of the 2013 Conference of the Association for Computational Linguistics (ACL 2013)}, month = {July}, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/bitexts-from-common-crawl.pdf} }
PPDB: The Paraphrase Database. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. NAACL 2013. Short papers. Abstract PPDB: The Paraphrase Database We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff. BibTex PPDB: The Paraphrase Database @InProceedings{ganitkevitch-EtAl:2013:NAACL, author = {Juri Ganitkevitch and Benjamin VanDurme and Chris Callison-Burch}, title = {{PPDB}: The Paraphrase Database}, booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2013)}, month = {June}, year = {2013}, address = {Atlanta, Georgia}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/ppdb.pdf} }
Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals. Ann Irvine and Chris Callison-Burch. NAACL 2013. Short papers. Abstract Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals Prior research into learning translations from source and target language monolingual texts has treated the task as an unsupervised learning problem. Although many techniques take advantage of a seed bilingual lexicon, this work is the first to use that data for supervised learning to combine a diverse set of signals derived from a pair of monolingual corpora into a single discriminative model. Even in a low resource machine translation setting, where induced translations have the potential to improve performance substantially, it is reasonable to assume access to some amount of data to perform this kind of optimization. Our work shows that only a few hundred translation pairs are needed to achieve strong performance on the bilingual lexicon induction task, and our approach yields an average relative gain in accuracy of nearly 50% over an unsupervised baseline. Large gains in accuracy hold for all 22 languages (low and high resource) that we investigate. BibTex Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals @InProceedings{irvine-callisonburch:2013:NAACL, author = {Ann Irvine and Chris Callison-Burch}, title = {Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals}, booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2013)}, month = {June}, year = {2013}, address = {Atlanta, Georgia}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/supervised-bilingual-lexicon-induction.pdf} }
Answer Extraction as Sequence Tagging with Tree Edit Distance. Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch and Peter Clark. NAACL 2013. Abstract Answer Extraction as Sequence Tagging with Tree Edit Distance Our goal is to extract answers from pre-retrieved sentences for Question Answering (QA). We construct a linear-chain Conditional Random Field based on pairs of questions and their possible answer sentences, learning the association between questions and answer types. This casts answer extraction as an answer sequence tagging problem for the first time, where knowledge of shared structure between question and source sentence is incorporated through features based on Tree Edit Distance (TED). Our model is free of manually created question and answer templates, fast to run (processing 200 QA pairs per second excluding parsing time), and yields an F1 of 63.3% on a new public dataset based on prior TREC QA evaluations. The developed system is open-source, and includes an implementation of the TED model that is state of the art in the task of ranking QA pairs. BibTex Answer Extraction as Sequence Tagging with Tree Edit Distance @InProceedings{yao-EtAl:2013:NAACL, author = {Xuchen Yao and Benjamin {Van Durme} and Chris Callison-Burch and Peter Clark}, title = {Answer Extraction as Sequence Tagging with Tree Edit Distance}, booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2013)}, month = {June}, year = {2013}, address = {Atlanta, Georgia}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/answer-extraction-as-sequence-tagging.pdf} }
2012
Findings of the 2012 Workshop on Statistical Machine Translation. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. WMT 2012. Abstract Findings of the 2012 Workshop on Statistical Machine Translation This paper presents the results of the WMT12 shared tasks, which included a translation task, a task for machine translation evaluation metrics, and a task for run-time estimation of machine translation quality. We conducted a large-scale manual evaluation of 103 machine translation systems submitted by 34 teams. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 12 evaluation metrics. We introduced a new quality estimation task this year, and evaluated submissions from 11 teams. BibTex Findings of the 2012 Workshop on Statistical Machine Translation @InProceedings{callisonburch-EtAl:2012:WMT, author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Post, Matt and Soricut, Radu and Specia, Lucia}, title = {Findings of the 2012 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation}, month = {June}, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, pages = {10--51}, url = {http://cis.upenn.edu/~ccb/publications/findings-of-the-wmt12-shared-tasks.pdf} }
Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing. Matt Post, Chris Callison-Burch, and Miles Osborne. WMT 2012. Abstract Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing Recent work has established the efficacy of Amazon's Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community. Data BibTex Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing @InProceedings{post-callisonburch-osborne:2012:WMT, author = {Post, Matt and Callison-Burch, Chris and Osborne, Miles}, title = {Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing}, booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation}, month = {June}, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, pages = {401--409}, url = {http://www.aclweb.org/anthology/W12-3152} }
Using Categorial Grammar to Label Translation Rules. Jonathan Weese, Chris Callison-Burch, and Adam Lopez. WMT 2012. Abstract Using Categorial Grammar to Label Translation Rules Adding syntactic labels to synchronous context-free translation rules can improve performance, but labeling with phrase structure constituents, as in GHKM (Galley et al., 2004), excludes potentially useful translation rules. SAMT (Zollmann and Venugopal, 2006) introduces heuristics to create new non-constituent labels, but these heuristics introduce many complex labels and tend to add rarely-applicable rules to the translation grammar. We introduce a labeling scheme based on categorial grammar, which allows syntactic labeling of many rules with a minimal, well-motivated label set. We show that our labeling scheme performs comparably to SAMT on an Urdu–English translation task, yet the label set is an order of magnitude smaller, and translation is twice as fast. BibTex Using Categorial Grammar to Label Translation Rules @InProceedings{weese-callisonburch-lopez:2012:WMT, author = {Weese, Jonathan and Callison-Burch, Chris and Lopez, Adam}, title = {Using Categorial Grammar to Label Translation Rules}, booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation}, month = {June}, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, pages = {222--231}, url = {http://cis.upenn.edu/~ccb/publications/using-categorial-grammar-to-label-translation-rules.pdf} }
Joshua 4.0: Packing, PRO, and Paraphrases. Juri Ganitkevitch, Yuan Cao, Jonathan Weese, Matt Post, and Chris Callison-Burch. WMT 2012. Abstract Joshua 4.0: Packing, PRO, and Paraphrases We present Joshua 4.0, the newest version of our open-source decoder for parsing-based statistical machine translation. The main contributions in this release are the introduction of a compact grammar representation based on packed tries, and the integration of our implementation of pairwise ranking optimization, J-PRO. We further present the extension of the Thrax SCFG grammar extractor to pivot-based extraction of syntactically informed sentential paraphrases. BibTex Joshua 4.0: Packing, PRO, and Paraphrases @InProceedings{ganitkevitch-EtAl:2012:WMT, author = {Ganitkevitch, Juri and Cao, Yuan and Weese, Jonathan and Post, Matt and Callison-Burch, Chris}, title = {Joshua 4.0: Packing, PRO, and Paraphrases}, booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation}, month = {June}, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, pages = {283--291}, url = {http://cis.upenn.edu/~ccb/publications/joshua-4.0.pdf} }
Expectations of Word Sense in Parallel Corpora. Xuchen Yao, Benjamin Van Durme and Chris Callison-Burch. NAACL 2012. Short papers. Abstract Expectations of Word Sense in Parallel Corpora Given a parallel corpus, if two distinct words in language A, a and a2, are aligned to the same word b in language B, then this might signal that b is polysemous, or it might signal a and a2 are synonyms. Both assumptions with successful work have been put forward in the literature. We investigate these assumptions, along with other questions of word sense, by looking at sampled parallel sentences containing tokens of the same type in English, asking how often they mean the same thing when they are: 1. aligned to the same foreign type; and 2. aligned to different foreign types. Results for French-English and Chinese-English parallel corpora show similar behavior: Synonymy is only very weakly the more prevalent scenario, where both cases regularly occur. BibTex Expectations of Word Sense in Parallel Corpora @InProceedings{yao-vandurme-callisonburch:2012:NAACL-HLT, author = {Yao, Xuchen, {Van Durme}, Benjamin and Callison-Burch, Chris}, title = {Expectations of Word Sense in Parallel Corpora}, booktitle = {The 2012 Conference of the North American Chapter of the Association for Computational Linguistics}, month = {June}, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, pages = {621--625}, url = {http://www.aclweb.org/anthology/N12-1078} }
Processing Informal, Romanized Pakistani Text Messages. Ann Irvine, Jonathan Weese, and Chris Callison-Burch. the NAACL Workshop on Language in Social Media 2012. Abstract Processing Informal, Romanized Pakistani Text Messages Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, romanized Urdu messages into the native Arabic script and normalize non-standard SMS language. Doing so prepares the messages for existing downstream processing tools, such as machine translation, which are typically trained on well-formed, native script text. Our model combines information at the word and character levels, allowing it to handle out-of-vocabulary items. Compared with a baseline deterministic approach, our system reduces both word and character error rate by over 50%. BibTex Processing Informal, Romanized Pakistani Text Messages @InProceedings{irvine-weese-callisonburch:2012:LSM, author = {Irvine, Ann and Weese, Jonathan and Callison-Burch, Chris}, title = {Processing Informal, Romanized Pakistani Text Messages}, booktitle = {Proceedings of the Second Workshop on Language in Social Media}, month = {June}, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, pages = {75--78}, url = {http://www.aclweb.org/anthology/W12-2109} }
Monolingual Distributional Similarity for Text-to-Text Generation. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. STARSEM 2012. Abstract Monolingual Distributional Similarity for Text-to-Text Generation Previous work on paraphrase extraction and application has relied on either parallel datasets, or on distributional similarity metrics over large text corpora. Our approach combines these two orthogonal sources of information and directly integrates them into our paraphrasing system’s log-linear model. We compare different distributional similarity feature-sets and show significant improvements in grammaticality and meaning retention on the example text-to-text generation task of sentence compression, achieving state-of-the-art quality. BibTex Monolingual Distributional Similarity for Text-to-Text Generation @InProceedings{Ganitkevitch-etal:2012:StarSEM, author = {Juri Ganitkevitch and Benjamin {Van Durme} and Chris Callison-Burch}, title = {Monolingual Distributional Similarity for Text-to-Text Generation}, booktitle = {*SEM First Joint Conference on Lexical and Computational Semantics}, month = {June}, year = {2012}, address = {Montreal}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/monolingual-distributional-similarity-for-text-to-text-generation.pdf} }
Machine Translation of Arabic Dialects. Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar F. Zaidan and Chris Callison-Burch. NAACL 2012. Abstract Machine Translation of Arabic Dialects Arabic dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialect sentences are selected from a large corpus of Arabic web text, and translated using Mechanical Turk. We use this data to build Dialect Arabic MT systems. Small amounts of dialect data have a dramatic impact on the quality of translation. When translating Egyptian and Levantine test sets, our Dialect Arabic MT system performs 5.8 and 6.8 BLEU points higher than a Modern Standard Arabic MT system trained on a 150 million word Arabic-English parallel corpus -- over 100 times the amount of data as our dialect corpora. BibTex Machine Translation of Arabic Dialects @InProceedings{Zbib-etal:2012:NAACL, author = {Rabih Zbib and Erika Malchiodi and Jacob Devlin and David Stallard and Spyros Matsoukas and Richard Schwartz and John Makhoul and Omar F. Zaidan and Chris Callison-Burch}, title = {Machine Translation of Arabic Dialects}, booktitle = {The 2012 Conference of the North American Chapter of the Association for Computational Linguistics}, month = {June}, year = {2012}, address = {Montreal}, publisher = {Association for Computational Linguistics}, url = {http://cis.upenn.edu/~ccb/publications/machine-translation-of-arabic-dialects.pdf} }
Toward Statistical Machine Translation without Parallel Corpora. Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. EACL 2012. Abstract Toward Statistical Machine Translation without Parallel Corpora We estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrase-tables. We propose a novel algorithm to estimate re-ordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed, and show that 82%+ of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features. BibTex Toward Statistical Machine Translation without Parallel Corpora @InProceedings{klementiev-etal:2012:EACL, author = {Alex Klementiev and Ann Irvine and Chris Callison-Burch and David Yarowsky}, title = {Toward Statistical Machine Translation without Parallel Corpora}, booktitle = {Proceedings of the 13th Conference of the European Chapter of the Association for computational Linguistics}, month = {April}, year = {2012}, address = {Avignon, France} publisher = {Association for Computational Linguistics}, }
Use of Modality and Negation in Semantically-Informed Syntactic MT. Kathryn Baker, Bonnie Dorr, Michael Bloodgood, Chris Callison-Burch, Wes Filardo, Christine Piatko, Lori Levin, and Scott Miller. Computational Linguistics 2012. Abstract Use of Modality and Negation in Semantically-Informed Syntactic MT This article describes the resource- and system-building efforts of an eight-week JHU Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, a (publicly available) MN lexicon, and two automated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation) and a holder (an experiencer of modality). We describe how our MN lexicon was produced semi-automatically and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set. We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. While the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described in this paper. The resulting system significantly outperformed a linguistically naïve baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality. BibTex Use of Modality and Negation in Semantically-Informed Syntactic MT @article{baker-etal:2012:CL, author = {Kathryn Baker and Bonnie Dorr and Michael Bloodgood and Chris Callison-Burch and Nathaniel Filardo and Christine Piatko and Lori Levin and Scott Miller}, title = {Use of Modality and Negation in Semantically-Informed Syntactic MT}, journal = {Computational Linguistics}, year = {2012}, volume = {38}, number = {2}, pages = {411-438} }
2011
Findings of the 2011 Workshop on Statistical Machine Translation. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. WMT 2011. Abstract Findings of the 2011 Workshop on Statistical Machine Translation This paper presents the results of the WMT11 shared tasks, which included a translation task, a system combination task, and a task for machine translation evaluation metrics. We conducted a large-scale manual evaluation of 148 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 21 evaluation metrics. This year featured a Haitian Creole to English task translating SMS messages sent to an emergency response service in the aftermath of the Haitian earthquake. We also conducted a pilot ‘tunable metrics’ task to test whether optimizing a fixed system to different metrics would result in perceptibly different translation quality. BibTex Findings of the 2011 Workshop on Statistical Machine Translation @InProceedings{callisonburch-EtAl:2011:WMT, author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Zaidan, Omar}, title = {Findings of the 2011 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation}, month = {July}, year = {2011}, address = {Edinburgh, Scotland}, publisher = {Association for Computational Linguistics}, pages = {22--64}, url = {http://www.aclweb.org/anthology/W11-2103} }
Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation. Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme. EMNLP 2011. Abstract Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation Previous work has shown that high quality phrasalparaphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems. BibTex Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation @InProceedings{ganitkevitch-EtAl:2011:EMNLP, author = {Ganitkevitch, Juri and Callison-Burch, Chris and Napoles, Courtney and {Van Durme}, Benjamin}, title = {Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation}, booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing}, month = {July}, year = {2011}, address = {Edinburgh, Scotland, UK.}, publisher = {Association for Computational Linguistics}, pages = {1168--1179}, url = {http://www.aclweb.org/anthology/D11-1108} }
Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity. Charley Chan, Chris Callison-Burch, and Benjamin Van Durme. GEMS 2011. Abstract Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity This paper improves an existing bilingual paraphrase extraction technique using monolingual distributional similarity to rerank candidate paraphrases. Raw monolingual data provides a complementary and orthogonal source of information that lessens the commonly observed errors in bilingual pivot-based methods. Our experiments reveal that monolingual scoring of bilingually extracted paraphrases has a significantly stronger correlation with human judgment for grammaticality than the probabilities assigned by the bilingual pivoting method does. The results also show that monolingual distribution similarity can serve as a threshold for high precision paraphrase selection. BibTex Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity @InProceedings{chan-callisonburch-vandurme:2011:GEMS, author = {Chan, Tsz Ping and Callison-Burch, Chris and {Van Durme}, Benjamin}, title = {Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity}, booktitle = {Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics}, month = {July}, year = {2011}, address = {Edinburgh, UK}, publisher = {Association for Computational Linguistics}, pages = {33--42}, url = {http://www.aclweb.org/anthology/W11-2504} }
Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor. Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez. WMT 2011. Abstract Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor We present progress on Joshua, an open source decoder for hierarchical and syntax-based machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats. BibTex Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor @InProceedings{weese-EtAl:2011:WMT, author = {Weese, Jonathan and Ganitkevitch, Juri and Callison-Burch, Chris and Post, Matt and Lopez, Adam}, title = {Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor}, booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation}, month = {July}, year = {2011}, address = {Edinburgh, Scotland}, publisher = {Association for Computational Linguistics}, pages = {478--484}, url = {http://www.aclweb.org/anthology/W11-2160} }
WikiTopics: What is Popular on Wikipedia and Why. Byung Gyu Ahn, Ben Van Durme and Chris Callison-Burch. ACL Workshop on Automatic Summarization for Different Genres, Media, and Languages 2011. Abstract WikiTopics: What is Popular on Wikipedia and Why We establish a novel task in the spirit of news summarization and topic detection and tracking (TDT): daily determination of the topics newly popular with Wikipedia readers. Central to this effort is a new public dataset consisting of the hourly page view statistics of all Wikipedia articles over the last three years. We give baseline results for the tasks of: discovering individual pages of interest, clustering these pages into coherent topics, and extracting the most relevant summarizing sentence for the reader. When compared to human judgements, our system shows the viability of this task, and opens the door to a range of exciting future work. BibTex WikiTopics: What is Popular on Wikipedia and Why @InProceedings{ahn-vandurme-callisonburch:2011:SummarizationWorkshop, author = {Ahn, Byung Gyu and {Van Durme}, Benjamin and Callison-Burch, Chris}, title = {WikiTopics: What is Popular on Wikipedia and Why}, booktitle = {Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages}, month = {June}, year = {2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics}, pages = {33--40}, url = {http://www.aclweb.org/anthology/W11-0505} }
Evaluating sentence compression: Pitfalls and suggested remedies. Courtney Napoles, Ben Van Durme. Workshop on Monolingual Text-To-Text Generation 2011. Abstract Evaluating sentence compression: Pitfalls and suggested remedies This work surveys existing evaluation methodologies for the task of sentence compression, identifies their shortcomings, and proposes alternatives. In particular, we examine the problems of evaluating paraphrastic compression and comparing the output of different models. We demonstrate that compression rate is a strong predictor of compression quality and that perceived improvement over other models is often a side effect of producing longer output. BibTex Evaluating sentence compression: Pitfalls and suggested remedies @InProceedings{napoles-vandurme-callisonburch:2011:T2TW-2011, author = {Napoles, Courtney and {Van Durme}, Benjamin and Callison-Burch, Chris}, title = {Evaluating Sentence Compression: Pitfalls and Suggested Remedies}, booktitle = {Proceedings of the Workshop on Monolingual Text-To-Text Generation}, month = {June}, year = {2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics}, pages = {91--97}, url = {http://www.aclweb.org/anthology/W11-1611} }
Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion. Courtney Napoles, Chris Callison-Burch, Juri Ganitevitch, Ben Van Durme. Workshop on Monolingual Text-To-Text Generation 2011. Abstract Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion We present a substitution-only approach to sentence compression which “tightens” a sentence by reducing its character length. Replacing phrases with shorter paraphrases yields paraphrastic compressions as short as 60% of the original length. In support of this task, we introduce a novel technique for re-ranking paraphrases extracted from bilingual corpora. At high compression rates1 paraphrastic compressions outperform a state-of-the-art deletion model in an oracle experiment. For further compression, deleting from oracle paraphrastic compressions preserves more meaning than deletion alone. In either setting, paraphrastic compression shows promise for surpassing deletion-only methods. BibTex Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion @InProceedings{napoles-EtAl:2011:T2TW-2011, author = {Napoles, Courtney and Callison-Burch, Chris and Ganitkevitch, Juri and {Van Durme}, Benjamin}, title = {Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion}, booktitle = {Proceedings of the Workshop on Monolingual Text-To-Text Generation}, month = {June}, year = {2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics}, pages = {84--90}, url = {http://www.aclweb.org/anthology/W11-1610} }
Paraphrase Fragment Extraction from Monolingual Comparable Corpora. Rui Wang and Chris Callison-Burch. BUCC 2011. Abstract Paraphrase Fragment Extraction from Monolingual Comparable Corpora We present a novel paraphrase fragment pair extraction method that uses a monolingual comparable corpus containing different articles about the same topics or events. The procedure consists of document pair extraction, sentence pair extraction, and fragment pair extraction. At each stage, we evaluate the intermediate results manually, and tune the later stages accordingly. With this minimally supervised approach, we achieve 62% of accuracy on the paraphrase fragment pairs we collected and 67% extracted from the MSR corpus. The results look promising, given the minimal supervision of the approach, which can be further scaled up. BibTex Paraphrase Fragment Extraction from Monolingual Comparable Corpora @InProceedings{wang-callisonburch:2011:BUCC, author = {Wang, Rui and Callison-Burch, Chris}, title = {Paraphrase Fragment Extraction from Monolingual Comparable Corpora}, booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web}, month = {June}, year = {2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics}, pages = {52--60}, url = {http://www.aclweb.org/anthology/W11-1208} }
The Arabic Online Commentary Dataset: An Annotated Dataset of Informal Arabic with High Dialectal Content. Omar Zaidan and Chris Callison-Burch. ACL 2011. Short papers. Abstract The Arabic Online Commentary Dataset: An Annotated Dataset of Informal Arabic with High Dialectal Content The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation. BibTex The Arabic Online Commentary Dataset: An Annotated Dataset of Informal Arabic with High Dialectal Content @InProceedings{zaidan-callisonburch:2011:ACL-HLT2011, author = {Zaidan, Omar F. and Callison-Burch, Chris}, title = {The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {37--41}, url = {http://www.aclweb.org/anthology/P11-2007} }
Crowdsourcing Translation: Professional Quality from Non-Professionals. Omar Zaidan and Chris Callison-Burch. ACL 2011. Abstract Crowdsourcing Translation: Professional Quality from Non-Professionals Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional translators. The total cost is more than an order of magnitude lower than professional translation. BibTex Crowdsourcing Translation: Professional Quality from Non-Professionals @InProceedings{zaidan-callisonburch:2011:ACL-HLT2011, author = {Zaidan, Omar F. and Callison-Burch, Chris}, title = {Crowdsourcing Translation: Professional Quality from Non-Professionals}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {1220--1229}, url = {http://www.aclweb.org/anthology/P11-1122} }
Incremental Syntactic Language Models for Phrase-based Translation. Lane Schwartz, Chris Callison-Burch, William Schuler and Stephen Wu. ACL 2011. Abstract Incremental Syntactic Language Models for Phrase-based Translation This paper describes a novel technique for incorporating syntactic knowledge into phrasebased machine translation through incremental syntactic parsing. Bottom-up and topdown parsers typically require a completed string as input. This requirement makes it difficult to incorporate them into phrase-based translation, which generates partial hypothesized translations from left-to-right. Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation. We give a formal definition of one such lineartime syntactic language model, detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system. We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perplexity. BibTex Incremental Syntactic Language Models for Phrase-based Translation @InProceedings{schwartz-EtAl:2011:ACL-HLT20111, author = {Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen}, title = {Incremental Syntactic Language Models for Phrase-based Translation}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {620--631}, url = {http://www.aclweb.org/anthology/P11-1063} }
2010
Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. Omar Zaidan and Chris Callison-Burch. NAACL 2010. Short papers. Abstract Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators In the field of machine translation, automatic metrics have proven quite valuable in system development for tracking progress and measuring the impact of incremental changes. However, human judgment still plays a large role in the context of evaluating MT systems. For example, the GALE project uses human-targeted translation edit rate (HTER), wherein the MT output is scored against a post-edited version of itself (as opposed to being scored against an existing human reference). This poses a problem for MT researchers, since HTER is not an easy metric to calculate, and would require hiring and training human annotators to perform the editing task. In this work, we explore soliciting those edits from untrained human annotators, via the online service Amazon Mechanical Turk. We show that the collected data allows us to predict HTER-ranking of documents at a significantly higher level than the ranking obtained using automatic metrics. BibTex Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators @InProceedings{zaidan-callisonburch:2010:NAACLHLT, author = {Zaidan, Omar F. and Callison-Burch, Chris}, title = {Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators}, booktitle = {Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics}, month = {June}, year = {2010}, address = {Los Angeles, California}, publisher = {Association for Computational Linguistics}, pages = {369--372}, url = {http://www.aclweb.org/anthology/N10-1057} }
Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach. Kathryn Baker, Michael Bloodgood, Chris Callison-Burch, Bonnie Dorr, Scott Miller, Christine Piatko, Nathaniel W. Filardo, and Lori Levin. AMTA 2010. Abstract Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach We describe a unified and coherent syntactic framework for supporting a semantically-informed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English translation task. This finding supports the hypothesis (posed by many researchers in the MT community, e.g., in DARPA GALE) that both syntactic and semantic information are critical for improving translation quality—and further demonstrates that large gains can be achieved for low-resource languages with different word order than English. BibTex Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach @InProceedings{Baker-EtAl:2010:AMTA, author = {Kathryn Baker and Michael Bloodgood and Chris Callison-Burch and Bonnie J. Dorr and Nathaniel W. Filardo and Lori Levin and Scott Miller and Christine Piatko}, title = {Semantically-Informed Machine Translation: A Tree-Grafting Approach}, booktitle = {Proceedings of The Ninth Biennial Conference of the Association for Machine Translation in the Americas}, address = {Denver, Colorado}, url = {http://www.mt-archive.info/AMTA-2010-Baker.pdf}, year = {2010} }
Transliterating From All Languages. Ann Irvine, Alex Klementiev, and Chris Callison-Burch. AMTA 2010. Abstract Transliterating From All Languages Much of the previous work on transliteration has depended on resources and attributes specific to particular language pairs. In this work, rather than focus on a single language pair, we create robust models for transliterating from all languages in a large, diverse set to English. We create training data for 150 languages by mining name pairs from Wikipedia. We train 13 systems and analyze the effects of the amount of training data on transliteration performance. We also present an analysis of the types of errors that the systems make. Our analyses are particularly valuable for building machine translation systems for low resource languages, where creating and integrating a transliteration module for a language with few NLP resources may provide substantial gains in translation performance. BibTex Transliterating From All Languages @InProceedings{Irvine-EtAl:2010:AMTA, author = {Ann Irvine and Chris Callison-Burch and Alexandre Klementiev} title = {Transliterating From All Languages}, booktitle = {Proceedings of The Ninth Biennial Conference of the Association for Machine Translation in the Americas}, address = {Denver, Colorado}, url = {http://cis.upenn.edu/~ccb/publications/transliterating-from-all-languages.pdf}, year = {2010} }
Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies. Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Ann Irvine, Lane Schwartz, Wren N. G. Thornton, Ziyuan Wang, Jonathan Weese and Omar F. Zaidan. WMT 2010. Abstract Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies We describe the progress we have made in the past year on Joshua (Li et al., 2009a), an open source toolkit for parsing-based machine translation. The new functionality includes: support for translation grammars with a rich set of syntactic nonterminals, the ability for external modules to posit constraints on how spans in the input sentence should be translated, lattice parsing for dealing with input uncertainty, a semiring framework that provides a unified way of doing various dynamic programming calculations, variational decoding for approximating the intractable MAP decoding, hypergraph-based discriminative training for better feature engineering, a parallelized MERT module, document-level and tail-based MERT, visualization of the derivation trees, and a cleaner pipeline for MT experiments. BibTex Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies @InProceedings{li-EtAl:2010:WMT, author = {Li, Zhifei and Callison-Burch, Chris and Dyer, Chris and Ganitkevitch, Juri and Irvine, Ann and Khudanpur, Sanjeev and Schwartz, Lane and Thornton, Wren and Wang, Ziyuan and Weese, Jonathan and Zaidan, Omar}, title = {Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies}, booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR}, month = {July}, year = {2010}, address = {Uppsala, Sweden}, publisher = {Association for Computational Linguistics}, pages = {133--137}, url = {http://www.aclweb.org/anthology/W10-1718} }
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation. Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, Omar Zaidan. WMT 2010. Abstract Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation This paper presents the results of the WMT10 and MetricsMATR10 shared tasks,1 which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 104 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 26 metrics. This year we also investigated increasing the number of human judgments by hiring non-expert annotators through Amazon’s Mechanical Turk. BibTex Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation @InProceedings{callisonburch-EtAl:2010:WMT, author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Peterson, Kay and Przybocki, Mark and Zaidan, Omar}, title = {Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation}, booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR}, month = {July}, year = {2010}, address = {Uppsala, Sweden}, publisher = {Association for Computational Linguistics}, pages = {17--53}, url = {http://www.aclweb.org/anthology/W10-1703} }
Large-Scale, Cost-Focused Active Learning for Statistical Machine Translation. Michael Bloodgood and Chris Callison-Burch. ACL 2010. Abstract Large-Scale, Cost-Focused Active Learning for Statistical Machine Translation We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it, gathering annotations via Amazon Mechanical Turk, and find that we get an order of magnitude increase in performance rates of improvement. BibTex Large-Scale, Cost-Focused Active Learning for Statistical Machine Translation @InProceedings{bloodgood-callisonburch:2010:ACL, author = {Bloodgood, Michael and Callison-Burch, Chris}, title = {Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation}, booktitle = {Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics}, month = {July}, year = {2010}, address = {Uppsala, Sweden}, publisher = {Association for Computational Linguistics}, pages = {854--864}, url = {http://www.aclweb.org/anthology/P10-1088} }
Creating Speech and Language Data With Amazon’s Mechanical Turk. Chris Callison-Burch and Mark Dredze. NAACL Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk 2010. Abstract Creating Speech and Language Data With Amazon’s Mechanical Turk In this paper we give an introduction to using Amazon's Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL2010 Workshop. 24 researchers participated in the workshop's $100 challenge to create data for speech and language applications. BibTex Creating Speech and Language Data With Amazon’s Mechanical Turk @InProceedings{callisonburch-dredze:2010:MTURK, author = {Callison-Burch, Chris and Dredze, Mark}, title = {Creating Speech and Language Data With {Amazon's Mechanical Turk}}, booktitle = {Proceedings of the {NAACL HLT} 2010 Workshop on Creating Speech and Language Data with {Amazon's Mechanical Turk}}, month = {June}, year = {2010}, address = {Los Angeles}, publisher = {Association for Computational Linguistics}, pages = {1--12}, url = {http://www.aclweb.org/anthology/W10-0701} }
Using Mechanical Turk to Build Machine Translation Evaluation Sets. Michael Bloodgood and Chris Callison-Burch. NAACL Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk 2010. Abstract Using Mechanical Turk to Build Machine Translation Evaluation Sets Building machine translation (MT) test sets is a relatively expensive task. As MT becomes increasingly desired for more and more language pairs and more and more domains, it becomes necessary to build test sets for each case. In this paper, we investigate using Amazon's Mechanical Turk (MTurk) to make MT test sets cheaply. We find that MTurk can be used to make test sets much cheaper than professionally-produced test sets. More importantly, in experiments with multiple MT systems, we find that the MTurk-produced test sets yield essentially the same conclusions regarding system performance as the professionally-produced test sets yield. BibTex Using Mechanical Turk to Build Machine Translation Evaluation Sets @InProceedings{bloodgood-callisonburch:2010:MTURK, author = {Bloodgood, Michael and Callison-Burch, Chris}, title = {Using {Mechanical Turk} to Build Machine Translation Evaluation Sets}, booktitle = {Proceedings of the {NAACL HLT} 2010 Workshop on Creating Speech and Language Data with {Amazon's Mechanical Turk}}, month = {June}, year = {2010}, address = {Los Angeles}, publisher = {Association for Computational Linguistics}, pages = {208--211}, url = {http://www.aclweb.org/anthology/W10-0733} }
Crowdsourced Accessibility: Elicitation of Wikipedia Articles. Scott Novotoney and Chris Callison-Burch. NAACL Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk 2010. Abstract Crowdsourced Accessibility: Elicitation of Wikipedia Articles Mechanical Turk is useful for generating complex speech resources like conversational speech transcription. In this work, we explore the next step of eliciting narrations of Wikipedia articles to improve accessibility for low-literacy users. This task proves a useful test-bed to implement qualitative vetting of workers based on difficult to define metrics like narrative quality. Working with the Mechanical Turk API, we collected sample narrations, had other Turkers rate these samples and then granted access to full narration HITs depending on aggregate quality. While narrating full articles proved too onerous a task to be viable, using other Turkers to perform vetting was very successful. Elicitation is possible on Mechanical Turk, but it should conform to suggested best practices of simple tasks that can be completed in a streamlined workflow. BibTex Crowdsourced Accessibility: Elicitation of Wikipedia Articles @InProceedings{novotney-callisonburch:2010:MTURK, author = {Novotney, Scott and Callison-Burch, Chris}, title = {Crowdsourced Accessibility: Elicitation of Wikipedia Articles}, booktitle = {Proceedings of the {NAACL HLT} 2010 Workshop on Creating Speech and Language Data with {Amazon's Mechanical Turk}}, month = {June}, year = {2010}, address = {Los Angeles}, publisher = {Association for Computational Linguistics}, pages = {41--44}, url = {http://www.aclweb.org/anthology/W10-0706} }
Cheap Facts and Counter-Facts. Rui Wang and Chris Callison-Burch. NAACL Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk 2010. Abstract Cheap Facts and Counter-Facts This paper describes our experiments of using Amazon's Mechanical Turk to generate (counter-)facts from texts for certain named entities. We give the human annotators a paragraph of text and a highlighted named entity. They will write down several (counter-)facts about this named entity in that context. The analysis of the results is performed by comparing the acquired data with the recognizing textual entailment (RTE) challenge dataset. BibTex Cheap Facts and Counter-Facts @InProceedings{wang-callisonburch:2010:MTURK, author = {Wang, Rui and Callison-Burch, Chris}, title = {Cheap Facts and Counter-Facts}, booktitle = {Proceedings of the {NAACL HLT} 2010 Workshop on Creating Speech and Language Data with {Amazon's Mechanical Turk}}, month = {June}, year = {2010}, address = {Los Angeles}, publisher = {Association for Computational Linguistics}, pages = {163--167}, url = {http://www.aclweb.org/anthology/W10-0725} }
Stream-based Translation Models for Statistical Machine Translation. Abby Levenberg, Chris Callison-Burch, and Miles Osborne. NAACL 2010. Abstract Stream-based Translation Models for Statistical Machine Translation Typical statistical machine translation systems are trained with static parallel corpora. Here we account for scenarios with a continuous incoming stream of parallel training data. Such scenarios include daily governmental proceedings, sustained output from translation agencies, or crowd-sourced translations. We show incorporating recent sentence pairs from the stream improves performance compared with a static baseline. Since frequent batch retraining is computationally demanding we introduce a fast incremental alternative using an online version of the EM algorithm. To bound our memory requirements we use a novel data-structure and associated training regime. When compared to frequent batch retraining, our online time and space-bounded model achieves the same performance with significantly less computational overhead. BibTex Stream-based Translation Models for Statistical Machine Translation @InProceedings{levenberg-callisonburch-osborne:2010:NAACLHLT, author = {Levenberg, Abby and Callison-Burch, Chris and Osborne, Miles}, title = {Stream-based Translation Models for Statistical Machine Translation}, booktitle = {Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics}, month = {June}, year = {2010}, address = {Los Angeles, California}, publisher = {Association for Computational Linguistics}, pages = {394--402}, url = {http://www.aclweb.org/anthology/N10-1062} }
Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. Scott Novotney and Chris Callison-Burch. NAACL 2010. Abstract Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription Deploying an automatic speech recognition system with reasonable performance requires expensive and time-consuming in-domain transcription. Previous work demonstrated that non-professional annotation through Amazon’s Mechanical Turk can match professional quality. We use Mechanical Turk to transcribe conversational speech for as little as one thirtieth the cost of professional transcription. The higher disagreement of non-professional transcribers does not have a significant effect on system performance. While previous work demonstrated that redundant transcription can improve data quality, we found that resources are better spent collecting more data. Finally, we describe a quality control method without needing professional transcription. BibTex Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription @InProceedings{novotney-callisonburch:2010:NAACLHLT, author = {Novotney, Scott and Callison-Burch, Chris}, title = {Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription}, booktitle = {Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics}, month = {June}, year = {2010}, address = {Los Angeles, California}, publisher = {Association for Computational Linguistics}, pages = {207--215}, url = {http://www.aclweb.org/anthology/N10-1024} }
Integrating Output from Specialized Modules in Machine Translation: Transliteration in Joshua. Ann Irvine, Mike Kayser, Zhifei Li, Wren Thornton, and Chris Callison-Burch. PBML 2010. Abstract Integrating Output from Specialized Modules in Machine Translation: Transliteration in Joshua In many cases in SMT we want to allow specialized modules to propose translation fragments to the decoder and allow them to compete with translations contained in the phrase table. Transliteration is one module that may produce such specialized output. In this paper, as an example, we build a specialized Urdu transliteration module and integrate its output into an Urdu–English MT system. The module marks-up the test text using an XML format, and the decoder allows alternate translations (transliterations) to compete. BibTex Integrating Output from Specialized Modules in Machine Translation: Transliteration in Joshua @article{Irvine-EtAl:2010:PBML, author = {Ann Irvine and Mike Kayser and Zhifei Li and Wren Thornton and Chris Callison-Burch }, title = {Integrating Output from Specialized Modules in Machine Translation: Transliteration in {J}oshua}, journal = {The Prague Bulletin of Mathematical Linguistics}, volume = {93}, pages = {107--116}, year = {2010} }
Visualizing Data Structures in Parsing-Based Machine Translation. Jonathan Weese and Chris Callison-Burch. PBML 2010. Abstract Visualizing Data Structures in Parsing-Based Machine Translation As machine translation (MT) systems grow more complex and incorporate more linguistic knowledge, it becomes more difficult to evaluate independent pieces of the MT pipeline. Being able to inspect many of the intermediate data structures used during MT decoding allows a more fine-grained evaluation of MT performance, helping to determine which parts of the current process are effective and which are not. In this article, we present an overview of the visualization tools that are currently distributed with the Joshua (Li et al., 2009) MT decoder. We explain their use and present an example of how visually inspecting the decoder’s data structures has led to useful improvements in the MT model. BibTex Visualizing Data Structures in Parsing-Based Machine Translation @article{Weese-CallisonBurch:2010:PBML, author = {Jonathan Weese and Chris Callison-Burch}, title = {Visualizing Data Structures in Parsing-based Machine Translation}, journal = {The Prague Bulletin of Mathematical Linguistics}, volume = {93}, pages = {127--136}, year = {2010} }
Hierarchical Phrase-Based Grammar Extraction in Joshua: Suffix Arrays and Prefix Trees. Lane Schwartz and Chris Callison-Burch. PBML 2010. Abstract Hierarchical Phrase-Based Grammar Extraction in Joshua: Suffix Arrays and Prefix Trees While example-based machine translation has long used corpus information at run-time, statistical phrase-based approaches typically include a preprocessing stage where an aligned parallel corpus is split into phrases, and parameter values are calculated for each phrase using simple relative frequency estimates. This paper describes an open source implementation of the crucial algorithms presented in (Lopez, 2008) which allow direct run-time calculation of SCFG translation rules in Joshua. BibTex Hierarchical Phrase-Based Grammar Extraction in Joshua: Suffix Arrays and Prefix Trees @article{Schwartz-CallisonBurch:2010:PBML, author = {Lane Schwartz and Chris Callison-Burch }, title = {Hierarchical Phrase-Based Grammar Extraction in Joshua: Suffix Arrays and Prefix Tree}, journal = {The Prague Bulletin of Mathematical Linguistics}, volume = {93}, pages = {157--166}, year = {2010} }
2009
Semantically Informed Machine Translation (SIMT). Kathy Baker, Steven Bethard, Michael Bloodgood, Ralf Brown, Chris Callison-Burch, Glen Coppersmith, Bonnie Dorr, Wes Filardo, Kendall Giles, Anni Irvine, Mike Kayser, Lori Levin, Justin Martineau, Jim Mayﬁeld, Scott Miller, Aaron Phillips, Andrew Philpot, Christine Piatko, Lane Schwartz and David Zajic. SCAL Summer Workshop Final Report. HLTCOE 2009. Abstract Semantically Informed Machine Translation (SIMT) This report describes the findings of the machine translation team from the first Summer Camp for Applied Language Exploration (SCALE) hosted at the Human Language Technology Center of Excellence located at Johns Hopkins University. This intensive, eight week workshop brought together 20 students, faculty and researchers to conduct research on the topic of Semantically Informed Machine Translation (SIMT). The type of semantics that were examined at the SIMT workshop were "High Information Value Elements," or HIVEs, which include named entities (such as people or organizations) and modalities (indications that a statement represents something that has taken place or is a belief or an intention). These HIVEs were examined in the context of machine translation between Urdu and English. The goal of the workshop was to identify and translate HIVEs from the foreign language, and to investigate whether incorporating this sort of structured semantic information into machine translation (MT) systems could produce better translations. BibTex Semantically Informed Machine Translation (SIMT) @techreport{Baker-EtAl:2010:HLTCOE, author = {Kathy Baker and Steven Bethard and Michael Bloodgood and Ralf Brown and Chris Callison-Burch and Glen Coppersmith and Bonnie Dorr and Wes Filardo and Kendall Giles and Anni Irvine and Mike Kayser and Lori Levin and Justin Martineau and Jim Mayﬁeld and Scott Miller and Aaron Phillips and Andrew Philpot and Christine Piatko and Lane Schwartz and David Zajic}, title = {Semantically Informed Machine Translation}, address = {Human Language Technology Center of Excellence}, institution = {Johns Hopkins University, Baltimore, MD}, number = {002}, url = {http://web.jhu.edu/bin/u/l/HLTCOE-TechReport-002-SIMT.pdf}, year = {2010} }
Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk. Nominated for the ACL 2019 Test of Time Award. Chris Callison-Burch. EMNLP 2009. Abstract Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk Manual evaluation of translation quality is generally thought to be excessively time consuming and expensive. We explore a fast and inexpensive way of doing it using Amazon’s Mechanical Turk to pay small sums to a large number of non-expert annotators. For $10 we redundantly recreate judgments from a WMT08 translation task. We find that when combined non-expert judgments have a high-level of agreement with the existing gold-standard judgments of machine translation quality, and correlate more strongly with expert judgments than Bleu does. We go on to show that Mechanical Turk can be used to calculate human-mediated translation edit rate (HTER), to conduct reading comprehension experiments with machine translation, and to create high quality reference translations. BibTex Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk @InProceedings{callisonburch:2009:EMNLP, author = {Callison-Burch, Chris}, title = {Fast, Cheap, and Creative: Evaluating Translation Quality Using {Amazon's} {Mechanical Turk}}, booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing}, month = {August}, year = {2009}, address = {Singapore}, publisher = {Association for Computational Linguistics}, pages = {286--295}, url = {http://www.aclweb.org/anthology/D/D09/D09-1030} }
Feasibility of Human-in-the-loop Minimum Error Rate Training. Omar Zaidan and Chris Callison-Burch. EMNLP 2009. Abstract Feasibility of Human-in-the-loop Minimum Error Rate Training Minimum error rate training (MERT) involves choosing parameter values for a machine translation (MT) system that maximize performance on a tuning set as measured by an automatic evaluation metric, such as BLEU. The method is best when the system will eventually be evaluated using the same metric, but in reality, most MT evaluations have a human-based component. Although performing MERT with a human-based metric seems like a daunting task, we describe a new metric, RYPT, which takes human judgments into account, but only requires human input to build a database that can be reused over and over again, hence eliminating the need for human input at tuning time. In this investigative study, we analyze the diversity (or lack thereof) of the candidates produced during MERT, we describe how this redundancy can be used to our advantage, and show that RYPT is a better predictor of translation quality than BLEU. BibTex Feasibility of Human-in-the-loop Minimum Error Rate Training @InProceedings{zaidan-callisonburch:2009:EMNLP, author = {Zaidan, Omar F. and Callison-Burch, Chris}, title = {Feasibility of Human-in-the-loop Minimum Error Rate Training}, booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing}, month = {August}, year = {2009}, address = {Singapore}, publisher = {Association for Computational Linguistics}, pages = {52--61}, url = {http://www.aclweb.org/anthology/D/D09/D09-1006} }
Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. Yuval Marton, Chris Callison-Burch and Philip Resnik. EMNLP 2009. Abstract Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called “low density” languages. But pivoting requires additional parallel texts. We address this problem by deriving paraphrases monolingually, using distributional semantic similarity measures, thus providing access to larger training resources, such as comparable and unrelated monolingual corpora. We present what is to our knowledge the first successful integration of a collocational approach to untranslated words with an end-to-end, state of the art SMT system demonstrating significant translation improvements in a low-resource setting. BibTex Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases @InProceedings{marton-callisonburch-resnik:2009:EMNLP, author = {Marton, Yuval and Callison-Burch, Chris and Resnik, Philip}, title = {Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases}, booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing}, month = {August}, year = {2009}, address = {Singapore}, publisher = {Association for Computational Linguistics}, pages = {381--390}, url = {http://www.aclweb.org/anthology/D/D09/D09-1040} }
Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences. Nikesh Garera, Chris Callison-Burch and David Yarowsky. CoNLL 2009. Abstract Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences This paper presents novel improvements to the induction of translation lexicons from monolingual corpora using multilingual dependency parses. We introduce a dependency-based context model that incorporates long-range dependencies, variable context sizes, and reordering. It provides a 16% relative improvement over the baseline approach that uses a fixed context window of adjacent words. Its Top 10 accuracy for noun translation is higher than that of a statistical translation model trained on a Spanish-English parallel corpus containing 100,000 sentence pairs. We generalize the evaluation to other word-types, and show that the performance can be increased to 18% relative by preserving part-of-speech equivalencies during translation. BibTex Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences @InProceedings{garera-callisonburch-yarowsky:2009:CoNLL, author = {Garera, Nikesh and Callison-Burch, Chris and Yarowsky, David}, title = {Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences}, booktitle = {Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)}, month = {June}, year = {2009}, address = {Boulder, Colorado}, publisher = {Association for Computational Linguistics}, pages = {129--137}, url = {http://www.aclweb.org/anthology/W09-1117} }
Findings of the 2009 Workshop on Statistical Machine Translation. Chris Callison-Burch, Philipp Koehn, Christof Monz and Josh Schroeder. WMT 2009. Abstract Findings of the 2009 Workshop on Statistical Machine Translation This paper presents the results of the WMT09 shared tasks, which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 87 machine translation systems and 22 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality, for more than 20 metrics. We present a new evaluation technique whereby system output is edited and judged for correctness. BibTex Findings of the 2009 Workshop on Statistical Machine Translation @InProceedings{callisonburch-EtAl:2009:WMT, author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Schroeder, Josh}, title = {Findings of the 2009 {W}orkshop on {S}tatistical {M}achine {T}ranslation}, booktitle = {Proceedings of the Fourth Workshop on Statistical Machine Translation}, month = {March}, year = {2009}, address = {Athens, Greece}, publisher = {Association for Computational Linguistics}, pages = {1--28}, url = {http://www.aclweb.org/anthology/W09-0401} }
Joshua: An Open Source Toolkit for Parsing-based Machine Translation. Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese and Omar Zaidan. WMT 2009. Abstract Joshua: An Open Source Toolkit for Parsing-based Machine Translation We describe Joshua, an open source toolkit for statistical machine translation. Joshua implements all of the algorithms required for synchronous context free grammars (SCFGs): chart-parsing, ngram language model integration, beamand cube-pruning, and k-best extraction. The toolkit also implements suffix-array grammar extraction and minimum error rate training. It uses parallel and distributed computing techniques for scalability. We demonstrate that the toolkit achieves state of the art translation performance on the WMT09 French-English translation task. BibTex Joshua: An Open Source Toolkit for Parsing-based Machine Translation @InProceedings{li-EtAl:2009:WMT1, author = {Li, Zhifei and Callison-Burch, Chris and Dyer, Chris and Khudanpur, Sanjeev and Schwartz, Lane and Thornton, Wren and Weese, Jonathan and Zaidan, Omar}, title = {{Joshua}: An Open Source Toolkit for Parsing-Based Machine Translation}, booktitle = {Proceedings of the Fourth Workshop on Statistical Machine Translation}, month = {March}, year = {2009}, address = {Athens, Greece}, publisher = {Association for Computational Linguistics}, pages = {135--139}, url = {http://www.aclweb.org/anthology/W09-0424} }
Decoding in Joshua: Open Source, Parsing-Based Machine Translation. Zhifei Li, Chris Callison-Burch, Sanjeev Khudanpur, and Wren Thornton. PBML 2009. Abstract Decoding in Joshua: Open Source, Parsing-Based Machine Translation We describe a scalable decoder for parsing-based machine translation. Thee decoder is written in Java and implements all the essential algorithms described in (Chiang, 2007) and (Li and Khudanpur, 2008b): chart-parsing, n-gram language model integration, beamand cube-pruning, and k-best extraction. Additionally, parallel and distributed computing techniques are exploited to make it scalable. We demonstrate experimentally that our decoder is more than 30 times faster than a baseline decoder written in Python. BibTex Decoding in Joshua: Open Source, Parsing-Based Machine Translation @article{Li-EtAl:2010:PBML, author = {Lane Schwartz and Chris Callison-Burch }, title = {Hierarchical Phrase-Based Grammar Extraction in Joshua: Suffix Arrays and Prefix Tree}, journal = {The Prague Bulletin of Mathematical Linguistics}, volume = {91}, pages = {47--56}, year = {2009} }
2008
Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. Chris Callison-Burch. EMNLP 2008. Abstract Syntactic Constraints on Paraphrases Extracted from Parallel Corpora We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs. In order to retain broad coverage of non-constituent phrases, complex syntactic labels are introduced. A manual evaluation indicates a 19% absolute improvement in paraphrase quality over the baseline method. BibTex Syntactic Constraints on Paraphrases Extracted from Parallel Corpora @InProceedings{callisonburch:2008:EMNLP, author = {Callison-Burch, Chris}, title = {Syntactic Constraints on Paraphrases Extracted from Parallel Corpora}, booktitle = {Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing}, month = {October}, year = {2008}, address = {Honolulu, Hawaii}, publisher = {Association for Computational Linguistics}, pages = {196--205}, url = {http://www.aclweb.org/anthology/D08-1021} }
ParaMetric: An Automatic Evaluation Metric for Paraphrasing. Chris Callison-Burch, Trevor Cohn, Mirella Lapata. CoLing 2008. Abstract ParaMetric: An Automatic Evaluation Metric for Paraphrasing We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques. BibTex ParaMetric: An Automatic Evaluation Metric for Paraphrasing @InProceedings{callisonburch-cohn-lapata:2008:Coling, author = {Callison-Burch, Chris and Cohn, Trevor and Lapata, Mirella}, title = {ParaMetric: An Automatic Evaluation Metric for Paraphrasing}, booktitle = {Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)}, month = {August}, year = {2008}, address = {Manchester, UK}, publisher = {Coling 2008 Organizing Committee}, pages = {97--104}, url = {http://www.aclweb.org/anthology/C08-1013} }
Further Meta-Evaluation of Machine Translation. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz and Josh Schroeder. WMT 2008. Abstract Further Meta-Evaluation of Machine Translation This paper analyzes the translation quality of machine translation systems for 10 language pairs translating between Czech, English, French, German, Hungarian, and Spanish. We report the translation quality of over 30 diverse translation systems based on a large-scale manual evaluation involving hundreds of hours of effort. We use the human judgments of the systems to analyze automatic evaluation metrics for translation quality, and we report the strength of the correlation with human judgments at both the system-level and at the sentence-level. We validate our manual evaluation methodology by measuring intra- and inter-annotator agreement, and collecting timing information.Note: This paper was corrected subsequent to publication. BibTex Further Meta-Evaluation of Machine Translation @InProceedings{callisonburch-EtAl:2008:WMT, author = {Callison-Burch, Chris and Fordyce, Cameron and Koehn, Philipp and Monz, Christof and Schroeder, Josh}, title = {Further Meta-Evaluation of Machine Translation}, booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation}, month = {June}, year = {2008}, address = {Columbus, Ohio}, publisher = {Association for Computational Linguistics}, pages = {70--106}, url = {http://www.aclweb.org/anthology/W/W08/W08-0309} }
Constructing Corpora for the Development and Evaluation of Paraphrase Systems. Trevor Cohn, Chris Callison-Burch, Mirella Lapata. Computational Linguistics 2008. Abstract Constructing Corpora for the Development and Evaluation of Paraphrase Systems Automatic paraphrasing is an important component in many natural language processing tasks. In this paper we present a new parallel corpus with paraphrase annotations. We adopt a definition of paraphrase based on word-alignments and show that it yields high inter-annotator agreement. As Kappa is suited to nominal data, we employ an alternative agreement statistic which is appropriate for structured alignment tasks. We discuss how the corpus can be usefully employed in evaluating paraphrase systems automatically (e.g., by measuring precision, recall and F1) and also in developing linguistically rich paraphrase models based on syntactic structure BibTex Constructing Corpora for the Development and Evaluation of Paraphrase Systems @article{cohn-callisonburch-lapata:2008:CL, author = {Trevor Cohn and Chris Callison-Burch and Mirella Lapata}, title = {Constructing Corpora for the Development and Evaluation of Paraphrase Systems}, journal = {Computational Linguistics}, year = {2008}, volume = {34}, number = {4}, pages = {597--614} }
Affinity Measures based on the Graph Laplacian. Delip Rao, David Yarowsky, Chris Callison-Burch. of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing at CoLing 2008. Abstract Affinity Measures based on the Graph Laplacian Several language processing tasks can be inherently represented by a weighted graph where the weights are interpreted as a measure of relatedness between two vertices. Measuring similarity between arbitrary pairs of vertices is essential in solving several language processing problems on these datasets. Random walk based measures perform better than other path based measures like shortest-path. We evaluate several random walk measures and propose a new measure based on commute time. We use the psuedo inverse of the Laplacian to derive estimates for commute times in graphs. Further, we show that this pseudo inverse based measure could be improved by discarding the least significant eigenvectors, corresponding to the noise in the graph construction process, using singular value decomposition. BibTex Affinity Measures based on the Graph Laplacian @InProceedings{rao-yarowsky-callisonburch:2008:TG3, author = {Rao, Delip and Yarowsky, David and Callison-Burch, Chris}, title = {Affinity Measures Based on the Graph {L}aplacian}, booktitle = {Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing}, month = {August}, year = {2008}, address = {Manchester, UK}, publisher = {Coling 2008 Organizing Committee}, pages = {41--48}, url = {http://www.aclweb.org/anthology/W08-2006} }
2007
Paraphrasing and Translation. Chris Callison-Burch. PhD Thesis, University of Edinburgh 2007. Abstract Paraphrasing and Translation Paraphrasing and translation have previously been treated as unconnected natural language processing tasks. Whereas translation represents the preservation of meaning when an idea is rendered in the words in a different language, paraphrasing represents the preservation of meaning when an idea is expressed using different words in the same language. We show that the two are intimately related. The major contributions of this thesis are as follows:We define a novel technique for automatically generating paraphrases using bilingual parallel corpora, which are more commonly used as training data for statistical models of translation.We show that paraphrases can be used to improve the quality of statistical machine translation by addressing the problem of coverage and introducing a degree of generalization into the models.We explore the topic of automatic evaluation of translation quality, and show that the current standard evaluation methodology cannot be guaranteed to correlate with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon either data sources which were uncommon such as multiple translation of the same source text, or language specific resources such as parsers, our approach is able to harness more widely parallel corpora and can be applied to any language which has a parallel corpus. The technique was evaluated by replacing phrases with their paraphrases, and asking judges whether the meaning of the original phrase was retained and whether the resulting sentence remained grammatical. Paraphrases extracted from a parallel corpus with manual alignments are judged to be accurate (both meaningful and grammatical) 75% of the time, retaining the meaning of the original phrase 85% of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be easily integrated into statistical machine translation. A paraphrase model derived from parallel corpora other than the one used to train the translation model can be used to increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches. BibTex Paraphrasing and Translation @PhdThesis{callisonburch:2007:thesis, author = {Chris Callison-Burch}, title = {Paraphrasing and Translation}, school = {University of Edinburgh}, address = {Edinburgh, Scotland}, year = {2007}, url = {http://cis.upenn.edu/~ccb/publications/callison-burch-thesis.pdf} }
(Meta-) Evaluation of Machine Translation. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz and Josh Schroeder. WMT 2007. Abstract (Meta-) Evaluation of Machine Translation This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra- and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation metrics with human judgments. This meta-evaluation reveals surprising facts about the most commonly used methodologies. BibTex (Meta-) Evaluation of Machine Translation @InProceedings{callisonburch-EtAl:2007:WMT, author = {Callison-Burch, Chris and Fordyce, Cameron and Koehn, Philipp and Monz, Christof and Schroeder, Josh}, title = {(Meta-) Evaluation of Machine Translation}, booktitle = {Proceedings of the Second Workshop on Statistical Machine Translation}, month = {June}, year = {2007}, address = {Prague, Czech Republic}, publisher = {Association for Computational Linguistics}, pages = {136--158}, url = {http://www.aclweb.org/anthology/W/W07/W07-0718} }
Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Confusion Network Decoding. Philipp Koehn, Nicola Bertoldi, Ondrej Bojar, Chris Callison-Burch, Alexandra Constantin, Brooke Cowan, Chris Dyer, Marcello Federico, Evan Herbst, Hieu Hoang, Christine Moran, Wade Shen, and Richard Zens. CLSP Summer Workshop Final Report WS, Johns Hopkins University 2007. Abstract Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Confusion Network Decoding The 2006 Language Engineering Workshop Open Source Toolkit for Statistical Machine Translationhad the objective to advance the current state-of-the-art in statistical machine translation through richer input and richer annotation of the training data. The workshop focused on three topics: factored translation models, confusion network decoding, and the development of an open source toolkit that incorporates this advancements. This report describes the scientific goals, the novel methods, and experimental results of the workshop. It also documents details of the implementation of the open source toolkit. BibTex Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Confusion Network Decoding @techreport{Koehn-EtAl:2007:CLSP, author = { Philipp Koehn and Nicola Bertoldi and Ondrej Bojar and Chris Callison-Burch and Alexandra Constantin and Brooke Cowan and Chris Dyer and Marcello Federico and Evan Herbst and Hieu Hoang and Christine Moran and Wade Shen and Richard Zens}, title = {Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Confusion Network Decoding. }, institution = {Johns Hopkins University}, number = {WS-2006}, type = {CLSP Summer Workshop Final Report}, year = {2007} }
Moses: Open source toolkit for statistical machine translation. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. ACL 2007. Abstract Moses: Open source toolkit for statistical machine translation We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks. BibTex Moses: Open source toolkit for statistical machine translation @InProceedings{koehn-EtAl:2007:PosterDemo, author = {Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and Dyer, Chris and Bojar, Ondrej and Constantin, Alexandra and Herbst, Evan}, title = {Moses: Open Source Toolkit for Statistical Machine Translation}, booktitle = {Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions}, month = {June}, year = {2007}, address = {Prague, Czech Republic}, publisher = {Association for Computational Linguistics}, pages = {177--180}, url = {http://www.aclweb.org/anthology/P07-2045} }
Paraphrase Substitution for Recognizing Textual Entailment. Wauter Bosma and Chris Callison-Burch. Evaluation of Multilingual and Multimodalformation Retrieval, Lecture Notes in Computer Science, C Peters et al editors 2007. Abstract Paraphrase Substitution for Recognizing Textual Entailment We describe a method for recognizing textual entailment that uses the length of the longest common subsequence (LCS) between two texts as its decision criterion. Rather than requiring strict word matching in the common subsequences, we perform a flexible match using automatically generated paraphrases. We find that the use of paraphrases over strict word matches represents an average F-measure improvement from 0.22 to 0.36 on the CLEF 2006 Answer Validation Exercise for 7 languages. BibTex Paraphrase Substitution for Recognizing Textual Entailment @InProceedings{bosma-callisonburch:2006:CLEF, author = {Wauter Bosma and Chris Callison-Burch}, title = {Paraphrase Substitution for Recognizing Textual Entailment}, booktitle = {Proceedings of CLEF}, year = {2006} url = {http://cis.upenn.edu/~ccb/publications/paraphrase-substitution-for-recognizing-textual-entailment.pdf} }
2006
Improved Statistical Machine Translation Using Paraphrases. Chris Callison-Burch, Philipp Koehn and Miles Osborne. NAACL 2006. Abstract Improved Statistical Machine Translation Using Paraphrases Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a stateof-the-art SMT system with paraphrases leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches. BibTex Improved Statistical Machine Translation Using Paraphrases @InProceedings{callisonburch-koehn-osborne:2006:HLT-NAACL06-Main, author = {Callison-Burch, Chris and Koehn, Philipp and Osborne, Miles}, title = {Improved Statistical Machine Translation Using Paraphrases}, booktitle = {Proceedings of the Human Language Technology Conference of the NAACL, Main Conference}, month = {June}, year = {2006}, address = {New York City, USA}, publisher = {Association for Computational Linguistics}, pages = {17--24}, url = {http://www.aclweb.org/anthology/N/N06/N06-1003} }
Re-evaluating the Role of Bleu in Machine Translation Research. Chris Callison-Burch, Miles Osborne and Philipp Koehn. EACL 2006. Abstract Re-evaluating the Role of Bleu in Machine Translation Research We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu’s correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores. BibTex Re-evaluating the Role of Bleu in Machine Translation Research @InProceedings{callisonburch-koehn-osborne:2006:HLT-NAACL06-Main, author = {Callison-Burch, Chris and Osborne, Miles and Koehn, Philipp}, title = {Re-evaluating the Role of BLEU in Machine Translation Research}, booktitle = {11th Conference of the European Chapter of the Association for Computational Linguistics}, month = {April}, year = {2006}, address = {Trento, Italy}, publisher = {Association for Computational Linguistics}, pages = {249--256}, url = {http://aclweb.org/anthology-new/E/E06/E06-1032} }
Constraining the Phrase-Based, Joint Probability Statistical Translation Model. Alexandra Birch, Chris Callison-Burch and Miles Osborne. WMT 2006. Abstract Constraining the Phrase-Based, Joint Probability Statistical Translation Model The Joint Probability Model proposed by Marcu and Wong (2002) provides a probabilistic framework for modeling phrase-based statistical machine translation (SMT). The model’s usefulness is, however, limited by the computational complexity of estimating parameters at the phrase level. We present a method of constraining the search space of the Joint Probability Model based on statistically and linguistically motivated word alignments. This method reduces the complexity and size of the Joint Model and allows it to display performance superior to the standard phrase-based models for small amounts of training material. BibTex Constraining the Phrase-Based, Joint Probability Statistical Translation Model @InProceedings{birch-EtAl:2006:WMT, author = {Birch, Alexandra and Callison-Burch, Chris and Osborne, Miles and Koehn, Philipp}, title = {Constraining the Phrase-Based, Joint Probability Statistical Translation Model}, booktitle = {Proceedings on the Workshop on Statistical Machine Translation}, month = {June}, year = {2006}, address = {New York City}, publisher = {Association for Computational Linguistics}, pages = {154--157}, url = {http://www.aclweb.org/anthology/W/W06/W06-3123} }
2005
Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases. Chris Callison-Burch, Colin Bannard and Josh Schroeder. ACL 2005. Abstract Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our suffix array-based data structure. We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality. BibTex Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases @InProceedings{callisonburch-bannard-schroeder:2005:ACL, author = {Callison-Burch, Chris and Bannard, Colin and Schroeder, Josh}, title = {Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases}, booktitle = {Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)}, month = {June}, year = {2005}, address = {Ann Arbor, Michigan}, publisher = {Association for Computational Linguistics}, pages = {255--262}, url = {http://www.aclweb.org/anthology/P05-1032}, }
Paraphrasing with Bilingual Parallel Corpora. Colin Bannard and Chris Callison-Burch. ACL 2005. Abstract Paraphrasing with Bilingual Parallel Corpora Previous work has used monolingual parallel corpora to extract and generate paraphrases. We show that this task can be done using bilingual parallel corpora, a much more commonly available resource. Using alignment techniques from phrase-based statistical machine translation, we show how paraphrases in one language can be identified using a phrase in another language as a pivot. We define a paraphrase probability that allows paraphrases extracted from a bilingual parallel corpus to be ranked using translation probabilities, and show how it can be refined to take contextual information into account. We evaluate our paraphrase extraction and ranking methods using a set of manual word alignments, and contrast the quality with paraphrases extracted from automatic alignments. BibTex Paraphrasing with Bilingual Parallel Corpora @InProceedings{bannard-callisonburch:2005:ACL, author = {Bannard, Colin and Callison-Burch, Chris}, title = {Paraphrasing with Bilingual Parallel Corpora}, booktitle = {Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)}, month = {June}, year = {2005}, address = {Ann Arbor, Michigan}, publisher = {Association for Computational Linguistics}, pages = {597--604}, url = {http://www.aclweb.org/anthology/P05-1074}, }
A Compact Data Structure for Searchable Translation Memories. Chris Callison-Burch, Colin Bannard and Josh Schroeder. EAMT 2005. Abstract A Compact Data Structure for Searchable Translation Memories In this paper we describe searchable translation memories, which allow translators to search their archives for possible translations of phrases. We describe how statistical machine translation can be used to align sub-sentential units in a translation memory, and rank them by their probability. We detail a data structure that allows for memory-efficient storage of the index. We evaluate the accuracy of translations retrieved from a searchable translation memory built from 50,000 sentence pairs, and find a precision of 86.6% for the top ranked translations. BibTex A Compact Data Structure for Searchable Translation Memories @InProceedings{callison-burch-EtAl:2005:EAMT, author = {Chris Callison-Burch and Colin Bannard and Josh Schroeder}, title = {A Compact Data Structure for Searchable Translation Memories}, booktitle = {European Association for Machine Translation}, year = {2005} }
Linear B System Description for the 2005 NIST MT Evaluation Exercise. Chris Callison-Burch. Machine Translation Evaluation Workshop 2005. Abstract Linear B System Description for the 2005 NIST MT Evaluation Exercise This document describes Linear B’s entry for the 2005 NIST MT Evaluation exercise. Linear B examined the efficacy of human-aided statistical machine translation by looking at the improvements that could be had by involving non-Arabic speakers in the translation process. We examined two conditions: one in which non-Arabic speakers edited the output of a statistical machine translation system, and one in which they were allowed to select phrasal translations from a chart of possible translations for an Arabic sentence, and then edit the text. BibTex Linear B System Description for the 2005 NIST MT Evaluation Exercise @InProceedings{callisonburch:2005:NIST, author = {Chris Callison-Burch }, title = {A Compact Data Structure for Searchable Translation Memories}, booktitle = {Proceedings of Machine Translation Evaluation Workshop}, year = {2005} }
Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. IWSLT 2005. Abstract Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation Our participation in the IWSLT 2005 speech translation task is our first effort to work on limited domain speech data. We adapted our statistical machine translation system that performed successfully in previous DARPA competitions on open domain text translations. We participated in the supplied corpora transcription track. We achieved the highest BLEU score in 2 out of 5 language pairs and had competitive results for the other language pairs. BibTex Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation @InProceedings{Koehn-EtAl:2005:IWSLT, author = {Philipp Koehn and Amittai Axelrod and Alexandra Birch and Chris Callison-Burch and Miles Osborne and David Talbot and Michael White}, title = {Edinburgh System Description for the 2005 {IWSLT} Speech Translation Evaluation}, booktitle = {Proceedings of International Workshop on Spoken Language Translation}, year = {2005}, url = {http://cis.upenn.edu/~ccb/publications/iwslt05-report.pdf} }
2004
Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora. Chris Callison-Burch, David Talbot and Miles Osborne. ACL 2004. Abstract Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including word-aligned data during training. Incorporating word-level alignments into the parameter estimation of the IBM models reduces alignment error rate and increases the Bleu score when compared to training the same models only on sentence-aligned data. On the Verbmobil data set, we attain a 38% reduction in the alignment error rate and a higher Bleu score with half as many training examples. We discuss how varying the ratio of word-aligned to sentence-aligned data affects the expected performance gain. BibTex Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora @inproceedings{callisonburch-talbot-osborne:2004:ACL, author = {Callison-Burch, Chris and Talbot, David and Osborne, Miles}, title = {Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora}, booktitle = {Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume}, year = {2004}, month = {July}, address = {Barcelona, Spain}, pages = {175--182}, url = {http://www.aclweb.org/anthology/P04-1023}, }
Searchable Translation Memories. Chris Callison-Burch, Colin Bannard and Josh Schroeder. ASLIB Translating and the Computer 2004. Abstract Searchable Translation Memories In this paper we introduce a technique for creating searchable translation memories. Linear B’s searchable translation memories allow a translator to type in a phrase and retrieve a ranked list of possible translations for that phrase, which is ordered based on the likelihood of the translations. The searchable translation memories use translation models similar to those used in statistical machine translation. In this paper we first describe the technical details of how the TMs are indexed and how translations are assigned probabilities, and then evaluate a searchable TM using precision and recall metrics. BibTex Searchable Translation Memories @inproceedings{Callison-Burch:2004:ASLIB, author = {Chris Callison-Burch and Colin Bannard and Josh Schroeder}, title = {Searchable Translation Memories}, booktitle = {Proceedings of ASLIB Translating and the Computer 26}, year = {2004} }
Improved Statistical Translation Through Editing. Chris Callison-Burch, Colin Bannard and Josh Schroeder. EAMT 2004. Abstract Improved Statistical Translation Through Editing In this paper we introduce Linear B’s statistical machine translation system. We describe how Linear B’s phrase-based translation models are learned from a parallel corpus, and show how the quality of the translations produced by our system can be improved over time through editing. There are two levels at which our translations can be edited. The first is through a simple correction of the text that is produced by our system. The second is through a mechanism which allows an advanced user to examine the sentences that a particular translation was learned from. The learning process can be improved by correcting which phrases in the sentence should be considered translations of each other. BibTex Improved Statistical Translation Through Editing @inproceedings{Callison-Burch-EtAl:2004:EAMT, author = {Chris Callison-Burch and Colin Bannard and Josh Schroeder}, title = {Improved Statistical Translation Through Editing}, booktitle = {European Association for Machine Translation}, year = {2004} }
2003
Statistical Natural Language Processing. Chris Callison-Burch and Miles Osborne. A Handbook for Language Engineers, Ali Farghaly, Editor 2003. Abstract Statistical Natural Language Processing Statistical natural language processing (SNLP) is a field lying in the intersection of natural language processing and machine learning. SNLP differs from traditional natural language processing in that instead of having a linguist manually construct some model of a given linguistic phenomenon, that model is instead (semi-) automatically constructed from linguistically annotated resources. Methods for assigning partof-speech tags to words, categories to texts, parse trees to sentences, and so on, are (semi-) automatically acquired using machine learning techniques.The recent trend of applying statistical techniques to natural language processing came largely from industrial speech recognition research groups at AT&T's Bell Laboratories and IBM's T.J. Watson Research Center. Statistical techniques in speech recognition have so vastly outstripped the performance of their non-statistical counterparts that rule-based speech recognition systems are essentially no longer an area of research. The success of machine learning techniques in speech processing led to an interest in applying them to a broader range of NLP applications. In addition to being useful from the perspective of producing high-quality results, as in speech recognition, SNLP systems are useful for a number of practical reasons. They are cheap and fast to produce, and they handle the wide variety of input required by a real-world application. SNLP is therefore especially useful in industry. In particular:SNLP affords rapid prototyping. Whereas fully hand-crafted systems are extremely time consuming to build, statistical systems that are automatically trained using corpora can be produced more quickly. This allows many different approaches to be tried and evaluated in a short time-frame. As an example, Cucerzan and Yarowsky described how one might create a new part-of-speech tagger in a single day (Cucerzan and Yarowsky, 2002). An even more ambitious example is Al-Onaizan et al.'s "machine translation in a day" experiment wherein they used statistical techniques to develop a complete Chinese-English machine translation system in a 24-hour period (AlOnaizan et al., 1999).Statistical systems are "robust" (Junqua and van Noord, 2001). Although this has a wide variety of meanings, in SNLP it generally means that a system will always produce some output no matter how badly formed the input is, and no matter how novel it is. For example, a text classification system may be able to classify a text even if all of the words in that text are previously unseen. Handling all kinds of input is necessary in real-world applications; a system which fails to produce output when it is unable to analyze a sentence will not be useful.Statistical systems are often cheaper to produce than hand-crafted rule-based systems. Because the process of creating a statistical system is more automated than the process of creating a rule-based system, the actual number of participants needed to create a system will often be less. Furthermore, because they are learned from data, statistical systems require less knowledge of the particular language being analyzed. This becomes a budgetary issue on a multi-language project because of the expense of hiring language consultants or staff with specialized skills.A common theme with many early SNLP systems was a pride in minimizing the amount of linguistic knowledge used in the system. For example, Fred Jelinek, the then leader of IBM's speech recognition research group, purportedly said, "Every time I fire a linguist, my performance goes up." The sentiment is rather shocking. Should Jelinek's statement strike fear into the hearts of all linguists reading this chapter? Is there a strong opposition between theoretical linguistics and SNLP? Will SNLP put linguists out of work?We put forth a positive answer in this chapter: there is a useful role for linguistic expertise in statistical systems. Jelinek's infamous quote represents biases of the early days of SNLP. While a decade's worth of research has shown that SNLP can be an extremely powerful tool and is able to produce impressive results, recent trends indicate that using naive approaches that are divorced from linguistics can only go so far. There is therefore a revival of interest in integrating more sophisticated linguistic information into statistical models. For example, language models for speech recognition are moving from being word-based "ngram" models towards incorporating statistical grammars (Chelba and Jelinek, 1998, Charniak, 2001). So there is indeed a role for the linguist. This chapter will provide an entry point for linguists entering into the field of SNLP so that they may apply their expertise to enhance an already powerful approach to natural language processing.Lest we represent SNLP as a completely engineering-oriented discipline, we point the interested reader to Abney (1996) which describes a number of ways in which SNLP might inform academic topics in linguistics. For example, SNLP can be useful for psycholinguistic research since systems typically encode graduated notions of well-formedness. This offers a more psychologically plausible alternative to the traditional binary grammatical/ungrammatical distinction. In a similarly academic vein, Johnson (1998) shows how Optimality Theory can be interpreted in terms of statistical models. This in turn suggests a number of interesting directions that OT might take.The rest of this chapter is as follows: We begin by presenting a simple worked example designed to illustrate some of the aspects of SNLP in Section 1.2. After motivating the usefulness of SNLP, we then move onto the core methods used in SNLP: modeling, learning, data and evaluation (Sections 1.3, 1.4, 1.5, and 1.6 respectively). These core methods are followed by a brief review of some of the many applications of SNLP (Section 1.7). We conclude with a discussion (Section 1.8) where we make some comments about the current state of SNLP and possible future directions it might take. BibTex Statistical Natural Language Processing @incollection{Callison-Burch2003b, author = {Chris Callison-Burch and Miles Osborne}, title = {Statistical Natural Language Processing}, booktitle = {A Handbook for Language Engineers}, editor = {Ali Farghaly}, publisher = {CSLI}, year = {2003} }
Bootstrapping Parallel Corpora. Chris Callison-Burch and Miles Osborne. NAACL workshop Building and Using Parallel Texts 2003. Abstract Bootstrapping Parallel Corpora We present two methods for the automatic creation of parallel corpora. Whereas previous work into the automatic construction of parallel corpora has focused on harvesting them from the web, we examine the use of existing parallel corpora to bootstrap data for new language pairs. First, we extend existing parallel corpora using co-training, wherein machine translations are selectively added to training corpora with multiple source texts. Retraining translation models yields modest improvements. Second, we simulate the creation of training data for a language pair for which a parallel corpus is not available. Starting with no human translations from German to English we produce a German to English translation model with 45% accuracy using parallel corpora in other languages. This suggests the method may be useful in the creation of parallel corpora for languages with scarce resources. BibTex Bootstrapping Parallel Corpora @inproceedings{CallisonBurch-Osborne:2003:PARALLEL, author = {Callison-Burch, Chris and Osborne, Miles}, title = {Bootstrapping Parallel Corpora}, booktitle = {Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond}, editor = {Rada Mihalcea and Ted Pedersen}, url = {http://www.aclweb.org/anthology/W03-0310}, year = 2003, pages = {44--49} }
Co-training for Statistical Machine Translation. Chris Callison-Burch and Miles Osborne. the 6th Annual CLUK Research Colloquium 2003. Abstract Co-training for Statistical Machine Translation We present a novel co-training method for statistical machine translation. Since cotraining requires independent views on the data, with each view being sufficient for the labeling task, we use source strings in multiple languages as views on translation. Co-training for statistical machine translation is therefore a type of multi-source translation. We show that using five language pairs our approach can yield improvements of up to 2.5% in word error rates for translation models. Our experiments suggest that co-training is even more effective for languages with highly impoverished parallel corpora: starting with no human translations from German to English we produce a German to English translation model with 45% accuracy using parallel corpora in other languages. BibTex Co-training for Statistical Machine Translation @inproceedings{CallisonBurch-Osborne:2003:CLUK, author = {Callison-Burch, Chris and Osborne, Miles}, title = {Co-Training For Statistical Machine Translation}, booktitle = {Proceedings of the 6th Annual CLUK Research Colloquium}, year = {2003} }
Evaluating Question Answering Systems Using FAQ Answer Injection. Jochen Leidner and Chris Callison-Burch. the 6th Annual CLUK Research Colloquium 2003. Abstract Evaluating Question Answering Systems Using FAQ Answer Injection Question answering (NLQA) systems which retrieve a textual fragment from a document collection that represents the answer to a question are an active field of research. But evaluations currently involve a large amount of manual effort. We propose a new evaluation scheme that uses the insertion of answers from Frequently Asked Questions collections (FAQs) to measure the ability of a system to retrieve it from the corresponding question. We describe how the usefulness of the approach can be assessed and discuss advantages and problems. BibTex Evaluating Question Answering Systems Using FAQ Answer Injection @inproceedings{Leidner-CallisonBurch:2003:CLUK, author = {Jochen L. Leidner and Chris Callison-Burch}, title = {Evaluating Question Answering Systems Using FAQ Answer Injection}, booktitle = {Proceedings of the 6th Annual CLUK Research Colloquium}, year = {2003} }
2002
Co-Training for Statistical Machine Translation. Chris Callison-Burch. Master's thesis, School of Informatics, University of Edinburgh 2002. Abstract Co-Training for Statistical Machine Translation I propose a novel co-training method for statistical machine translation. As co-training requires multiple learners trained on views of the data which are disjoint and sufficient for the labeling task, I use multiple source documents as views on translation. Co-training for statistical machine translation is therefore a type of multi-source translation. Unlike previous mutli-source methods, it improves the overall quality of translations produced by a model, rather than single translations. This is achieved by augmenting the parallel corpora on which the statistical translation models are trained. Experiments suggest that co-training is especially effective for languages with highly impoverished parallel corpora. BibTex Co-Training for Statistical Machine Translation @MastersThesis{Callison-Burch2002, author = {Chris Callison-Burch}, title = {Co-training for Statistical Machine Translation}, school = {University of Edinburgh}, year = {2002} }
2001
Upping the Ante for "Best of Breed" Machine Translation Providers. Chris Callison-Burch. ASLIB Translating and the Computer 2001. Abstract Upping the Ante for "Best of Breed" Machine Translation Providers The notion of "best of breed" among value-added machine translation technology providers is generally defined as providing access to the single best commercially available machine translation engine for each language pair. This paper describes the efforts of Amikai, Inc. to go beyond that definition of best of breed. Rather than relying on a single engine for each pair, we have written a program that automatically selects the best translation from a set of candidate translations generated by multiple commercial machine translation engines. The program is implemented using a simple statistical language modelling technique, and relies on the simplifying assumption that the most fluent item in the set is the best translation. The program was able to produce the best translation in human ranked data up to 19% more often than the single best performing engine. BibTex Upping the Ante for "Best of Breed" Machine Translation Providers @inproceedings{Callison-Burch:2001:ASLIB, title = {Upping the Ante for "Best of Breed" Machine Translation Providers}, author = {Chris Callison-Burch}, booktitle = {Proceedings of ASLIB Translating and the Computer 23}, year = {2001}, }
A program for automatically selecting the best output from multiple machine translation engines. Chris Callison-Burch and Raymond Flournoy. MT Summit 2001. Abstract A program for automatically selecting the best output from multiple machine translation engines This paper describes a program that automatically selects the best translation from a set of translations produced by multiple commercial machine translation engines. The program is simplified by assuming that the most fluent item in the set is the best translation. Fluency is determined using a trigram language model. Results are provided illustrating how well the program performs for human ranked data as compared to each of its constituent engines. BibTex A program for automatically selecting the best output from multiple machine translation engines @inproceedings{Callison-Burch-Flournoy:2001:MTSummit, title = {A Program for Automatically Selecting the Best Output from Multiple Machine Translation Engines}, author = {Chris Callison-Burch and Raymond S. Flournoy}, booktitle = {Proceedings of the Machine Translation Summit VIII}, year = {2001}, }
Secondary Benefits of Feedback and User Interaction in Machine Translation Tools. Raymond Flournoy and Chris Callison-Burch. MT Summit Workshop 2001. Abstract Secondary Benefits of Feedback and User Interaction in Machine Translation Tools User feedback has often been proposed as a method for improving the accuracy of machine translation systems, but useful feedback can also serve a number of secondary benefits, including increasing user confidence in the MT technology and expanding the potential audience of users. Amikai, Inc. has produced a number of communication tools which embed translation technology and which attempt to improve the user experience by maximizing useful user interaction and feedback. As MT continues to develop, further attention needs to be paid to developing the overall user experience, which can improve the utility of translation tools even when translation quality itself plateaus BibTex Secondary Benefits of Feedback and User Interaction in Machine Translation Tools @inproceedings{Flournoy-Callison-Burch:2001:MTSummit, title = {Secondary Benefits of Feedback and User Interaction in Machine Translation Tools}, author = {Raymond S. Flournoy and Chris Callison-Burch}, booktitle = {Workshop paper for "MT2010: Towards a Roadmap for MT" of the MT Summit VIII}, year = {2001}, }
2000
A Computer Model of a Grammar for English Questions. Chris Callison-Burch. Undergraduate thesis, Symbolic Systems Program, Stanford University 2000. Abstract A Computer Model of a Grammar for English Questions This document describes my senior honors project, which is an implementation of a grammar for English questions. I have created a computer model of Ginzburg and Sag’s theory of English interrogative constructions using the parsing software developed at the Center for Study of Language and Information (CSLI). In this chapter I describe the LKB parsing software, give instructions on downloading the system, and comment on the process of grammar engineering. The next chapter gives a summary of Ginzburg and Sag (2000). Chapter 3 details the discrepancies between the Ginzburg and Sag theory and my implementation. Chapter 4 provides a detailed discussion of a set of key example sentences. The appendices contain tables describing all the grammar constructions, lexical rules, types, and example lexical entries used in my implementation. BibTex A Computer Model of a Grammar for English Questions @MISC{Callison-Burch2000, author = {Chris Callison-Burch}, title = {A Computer Model of a Grammar for English Questions}, school = {Stanford University}, address = {Palo Alto, California}, note = {Undergraduate honors thesis}, year = {2000} }

2025

2024

2023

2022