ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.03090
14
4

Interview: A Large-Scale Open-Source Corpus of Media Dialog

7 April 2020
Bodhisattwa Prasad Majumder
Shuyang Li
Jianmo Ni
Julian McAuley
    AuLLM
ArXivPDFHTML
Abstract

Existing conversational datasets consist either of written proxies for dialog or small-scale transcriptions of natural speech. We introduce Ínterview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts. Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance on existing spoken dialog datasets, demonstrating its usefulness in modeling real-world conversations. Ínterview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems. In fact, experiments on two dialog tasks show that leveraging such labels improves performance over strong speaker-agnostic baselines, and enabling models to generate more specific and inquisitive responses in interview-style conversations.

View on arXiv
Comments on this paper