Annotation Instruction
Task Definition
- INPUT
premisea clip of 4 seconds long from the videohypothesisa sentence describing a subsequent event moments later from or right after thepremisequestiona question asking about the event happening in between thepremiseand thehypothesis(abductive) or after thehypothesis(predictive)
- OUTPUT
answeranswer(s) to the question
- Auxiliary INFO
thumbnailthe thumbnail/cover of the videoclip-bgtitle and desc of the clipmovie-bgtitle and background of the movie
Annotation
Feasibility
1-1. Whether there is ghost entity or typo in
answerthat cannot be detected or interpreted frompremiseorhypothesis?
A. yes; B. yes, but can be predicted by inference; C. no1-2. If you choose A for 1-1, then whether the ghost entity or typo can be predicted or interpreted based on
thumbnail,clip-bg, ormovie-bg?
A. yes forthumbnail; B. yes forclip-bgC. yes forthumbnail+clip-bg; D. yes forclip-bg+movie-bg; E. yes forthumbnail+clip-bg+movie-bg; F. no1-3. If you choose C for 1-1 or A/B/C/D/E for 1-2, then whether there is detail leakage from textual information that makes it too easy to get the answer? (e.g. just change 1 or 2 words of the source sentence)
A. yes; B. noMultimodality
2-1. If you choose B/C for 1-1 or A/B/C/D/E for 1-2, then whether the answer can be generated based on information from only one modality (e.g. visual or textual)?
A. yes, based on visual info; B. yes, based on textual info; C. no, we need both2-2. If you choose A/C for 2-1, then what kind(s) of visual information is required to be distilled for answering? (multi-choice)
A. object-attribute; B. scene/place signals; C. human-emotion; D. motion/action; E. spatio-temporal relation; F. others2-3. If you choose C for 2-1, then how can one associate information from the two modalities together? (multi-choice)
A. basic grounding/alignment; B. the two modalities can help specify the events described by each other; C. daily commonsense reasoning; D. othersCommonSense Reasoning
3-1. If you choose B/C for 1-1 or A/B/C/D/E for 1-2, then whether commonsense knowledge is required to get the answer? If so, what kind(s) of commonsense knowledge is included? (multi-choice)
A. no; B. object-attribute; C. basic actions/motions of people/objects; D. correlation between events; E. change in people’s mental states; F. social interactions among people & objects; G. others3-2. Please write down the
rationaleof how to get theanswerto thequestion. If you think the provided information is insufficient, then explain why thequestionis unanswerable.