This gives general information common to the topic-based Y3 test questions: PQ0199, PQ0220, PQ0299, PQ0301, PQ0302 (see CaloY3TestQuestions for the individual questions).
1) Background Info:
We have two general models for modelling topics:
MIT/Brown's unsupervised generative model learns "micro-topics" (probability distributions over words) from all available meeting speech transcript data, and represents topically coherent segments as being generated by constant "macro-topics" - weighted mixtures of these "micro-topics". See the ACL 2006 paper on this page for more details.
CSLI's unsupervised vector space model assumes only a single level, with segments related to "topics", also modelled as probability distributions over words. Different segments correspond to different topics; weighted mixtures of topics can also be used to produce new topics.
The MIT model currently seems to require more data than we expect to have in the Y3 Test, so we are using only the CSLI model. (We are currently in the process of modifying the MIT model to share topics across meeting, email and document data, to get round this - but this isn't implemented yet.)
Both models also produce a segmentation (division of the meetings into topically coherent regions, assigning each one a "(macro-)topic", i.e. a particular probability distribution/weighted mixture). But most of the PQ questions don't require this kind of prior segmentation.
Most PQs here require a posterior segmentation instead ("given a topic T, find when it was discussed"), which we refer to as topic localization here.
As it is difficult to define a "topic" without typing in e.g. a large number of keywords, we are proposing to use the browser to help the user/question-asker with this definition. The user will enter a short description of the topic (e.g. a few keywords), and the browser will present the most relevant micro-topics. The user can then select one or more (possibly with weights) which were really intended and/or seem to sum up what is required. Alternatively we provide OAA solvables which can find the most relevant topics directly - but this is dispreferred as it rules out the possibility of creating user/query-specific weighted mixtures.
We then use more-or-less standard IR techniques to find the relevant time periods based on the required (mixture of) topics.
The topic models learn unsupervised, as more data becomes available.
(The localization techniques can learn by implicit supervision via feedback from the browser - but this isn't implemented yet.)
2) Resource Level:
See individual questions
3) English Interpretation:
See individual questions
4) Description of Learning:
BCALO (baseline) is taken as the information CALO can provide without extracting any information from natural interaction. In this case, this means BCALO cannot answer any topic PQs, and the baseline is therefore zero.
LCALO uses topic localization over all experienced meeting data.
4b) Alternative:
If we want a non-zero BCALO baseline, we could alternatively use a different baseline set of topic models, but with the same general topic localization method as LCALO:
BCALO (baseline) - we use a baseline model (unaffected by meeting observations)
LCALO (learning) - we use the model learnt over all meeting data
If the model is specified in the question (see below), this is controlled directly by the question setting. The baseline model could be e.g. a flat document text; the learnt model an OPI or MA topic-word distribution.
(There is scope for learning the topic localization method parameters too (from implicit supervision via browser feedback) - but this isn't implemented yet.)
5) Answer Strategy:
See individual questions
6) Sensitivity to Parameter Instantiations:
Meeting must exist in the KB and have been processed by the MA suite.
In general, we assume that both |Topic| and |SubjectCategory| parameters are given as named Topic IDs. These should be chosen by the questioner based either on the output of the csliGetTopics OAA solvable or the named topics displayed in the meeting browser.
For certain PQs it makes no sense to take the named topics produced by the CSLI agents - so far, this only seems to be the case for PQ0301 ("Which of the following set of topics |{|io:%SubjectCategory|, |io:%SubjectCategory, . . . io:%SubjectCategory|}| were discussed at |io:%Meeting|"). In this case, we assume that the |SubjectCategory| parameters are given as plain text (e.g. sets of keywords/phrases). The questioner will then have to relate these to the best-fitting named topics using either the output of the csliGetRelevantTopic OAA solvable or the meeting browser search function.
PQs are then solved in terms of named Topic IDs.
Plus see individual questions
7) Answer Key:
8) Scoring Method:
9) Design of LCALO-to-BCALO Transform:
BCALO
10) Critical Learning Period (CLP) Conditions:
Currently, it is only required that all meetings be processed by the MA suite, as the model learning is unsupervised.
(There may be more requirements in future relating to browser feedback.)
11) User Actions to Drive Learning:
Currently, none.
(There may be more requirements in future relating to browser feedback.)