r26 - 16 Nov 2005 - 23:31:04 - JohnNiekraszYou are here:  Calo Web > CaloProject > CaloY3SystemDiscussion

CALO Y3 - CSLI System Discussion

System Components

Communicative Act Segmenter and Identifier

NL Parser and Interpreter

Discourse Interpretation

Topic Segmenter and Identifier

Action-Item and Decision-Point Identifier

Meeting Browser and User Interface

Inputs to CSLI Components

ASR N-Best Hypotheses and Word Lattices

Word lattices (with timing information included) or some intermediate representation like a confusion network will be critical for providing flexibility in nl and discourse analysis and will of course be necessary for creating a feedback loop with the recognizer as to which hypothesis was ultimately chosen.

Speaker Diary

The speaker diary is simply a record of spoken contributions to the discourse, with the contributor identified for each contribution. This will be trivial with headset mics, but may be a significant challenge to the providing party if the close-talking mics are not available. In this case, it may be necessary to provide probabilities for the speaker of each detected utterance.

Prosodic Features

Here is a list of prosodic features that would be useful to downstream algorithms such as discourse structuring, dialogue act detection, floor and addressee extraction, etc. (see Shriberg et al., Speech Communication 32(1-2) (2000):

  • segmentation, duration, and identification of:
    • pauses (not segmental-related pauses like stops)
    • phones / syllables
    • words
    • utterances
    • filled pauses
  • speaker-normalized, filtered, and regularized F0
  • speaker- and channel-normalized energy

OOV Words

New words which are learned by the ASR system or any other system which produces words as output need to be provided to us independent of their presentation to the system as part of the recognized event, so that the word may be integrated into the parser lexicon.

Multimodally-integrated CLib-LFs

We will provide ambiguous semantic LFs in CLib communication model format as output from Gemini (see below in Outputs section) which include potential points for multimodal integration (e.g. spatial deixis; verbal agreements/disagreements which require gestural support such as head nodding). We will then require an integrated version of these LFs to be returned from the multimodal integrator; the integrated versions will add the sub-parts of the Communicate events from other modalities, while keeping to the specified constraints (and possibly adding more). Examples include:

  • Spatial deictic references to be integrated with physical/virtual objects
  • Named object references to be integrated with whiteboard objects
  • Acknowledgements/agreements/disagreements to be integrated with head gestures

Unimodal Gestural and Sketch Events

These are assumed to be produced by the relevant components and consumed by the multimodal integrator, rather than being consumed by any CSLI components directly; we will then consume them via the integrated CLib-LFs as produced by the integrator. CSLI are therefore not directly impacted by the form in which these inputs are supplied to the integrator (although we expect that the integrator will require discrete symbolic input). Events which we expect will be directly useful to us are:

  • Physical Gestures
    • Face/eye gaze direction (including object)
    • Head nods
    • Head shakes
    • Pointing gestures (including object)
  • Whiteboard Sketch
    • Existing objects
    • Object creation/deletion/modification

Head Pose and Gaze Diary

A record of head pose and gaze will be required for addressee and floor detection. The record should be filtered so as to not include spurious rapid changes, and the results should be provided discretely and symbolically rather than with continuous spatial values.

Meeting Knowledge (Potential Discourse Referents)

We require information about all potential discourse referents to be provided in CLib form in a knowledge base. This will be used to anchor semantic LFs to the domain where definite or named descriptions occur for objects such as people, documents and text notes. This process of reference resolution is therefore similar to that performed by the multimodal integrator, but deals with expressions that are not spatially deictic (these will have been previously resolved during multimodal integration). This information will come from several sources including:

  • Participant IDs from CALO desktop login
  • Ongoing tasks and action items from CALO knowledge base
  • Salient documents from CALO desktop
  • Salient text notes from CALO desktop

Project Knowledge

This information is very similar to that supplied as "meeting knowledge" but represents more temporally persistent and project-oriented knowlege, specifically that which plays an important role in the user-level inputs and outputs of the system, such as tasks, milestones, ongoing projects, agendas and responsibilities.

Agenda / Prep Pack

The Prep Pack is assumed to include the agenda and any relevant documents (it may also include email/chat data, but this is discussed separately below). The agenda must be available in a CLib format with separate items identified as such, with their own text descriptions, and with relevant documents identified as relating to their corresponding agenda items. These will be used to generate (unsupervised) prior topic-word models for topic ID.

Project Status and Ongoing Tasks

Projects, tasks and action items are already required for reference resolution (see above), but they are also required as direct inputs to the topic identifier and action item identifier. Projects and tasks will be used to generate (unsupervised) prior topic-word model for topic ID; tasks and action items will be used to allow identification of new discussion of old action items, and association of action items with tasks. All must be specified in CLib format; we will require projects and tasks to include associated text descriptions; action items should include at least associated responsible parties, tasks (with associated descriptions) and deadlines.

Archived Email and Chat Text

Email and chat text relevant to the meeting setup (e.g. agenda discussion, minutes approval from previous meetings) is required as input to the topic and action item identifiers. In Y3 this need only take the form of text, and will be used to generate (unsupervised) prior topic-word models for topic ID. In future years, email/chat exchanges may be used to aid semantic disambiguation and discourse structure assignment, so may require more structured

Outputs from CSLI Components

CLib-LFs from Speech

Output from the Gemini parser; supplied to multimodal integrator; also used as LITW feedback to parser lexicon for new word acquisition. Semantic logical forms in an ontology-based (CLib-compatible, events + role-fillers) format, with CLib annotations where possible (where the coverage of CLib allows). Complete event descriptions where possible by combining fragments; isolated fragments otherwise. Multiple hypotheses both from ambiguous ASR output and parse ambiguity. Not to include at this stage: resolved anaphora/reference (including addressee), communicative acts/dialogue move types.

Floor and Addressee Diary

The floor and addressee diary will be an annotation of communicative acts for which labels are giveon as to the addressee of each utterance and whether or not the utterance was produced while holding the floor (or was a floor holder or grabber).

Possible Uses: May be useful during playback for making clear who is talking to whom, when such information is difficult to perceive directly from the 360-degree camera shot. Is essential for eliminating utterances which do not contribute to the discourse and for resolving personal deixis.

Fully-Resolved CLib-LFs

Output from the discourse understanding component. Most likely set of communicative acts, in CLib Communicate format, including relations to their constituent semantic components (from multiple modalities) and surface speech/gestures. References resolved to CLib entities/events where possible. Classified into dialogue act types and including antecedent relations where possible, thus giving adjacency pair information. Multiple hypotheses given weights/probabilities based on the cumulative process so far (including ASR confidence, syntactic/semantic plausibility, reference/anaphora resolution, addressee resolution, discourse structure).

This can be seen as including the following sub-outputs:

Chosen ASR Hypotheses / Utterance Segments

The speech hypotheses associated with the most likely overall communicative hypotheses and their confidences are available for feedback to the ASR components for LITW. This allows the accumulated syntactic, semantic and pragmatic information to be used to influence the acoustic/language models for better word recognition; the end-pointing models for better utterance segmentation; and the ROVER ASR integrator models for better integration.

Dialogue Acts, Antecedent Structure

Discourse structure can be used to answer test questions directly (e.g. question/answer pairs).

Action Items

Output from the action-item identifier; supplied direct to the CALO knowledge base for end user browing/querying. A set of action-items in CLib format, with each including: the identity of the person responsible; the deadline for completion; reference to the task involved; reference to the discussion for browsing, including identification of the task proposal/discussion and the acceptance/agreement where available.

Success should be assessed on multiple levels, as we expect performance initially to be good only at the shallower levels, with improvement shown later at the deeper levels - see NewIetQuestionsY3:

  • Identification of related utterances
  • Identification of utterance function: assignment of task, assignment of deadline, acceptance
  • Identification of responsible party
  • Identification of deadline
  • Identification of task
  • Relation to previous action items/tasks from knowledge base (including previous meetings)

Topic Segments and Word Vectors

Output from the topic identifier; supplied direct to the CALO knowledge base for end user browsing/querying. An array of topics in CLib format, with each including: a summary as a vector of keywords in order of importance; a corresponding agenda item where identifiable; corresponding temporal segment(s) of the discourse for browsing.

Success should be assessed on multiple levels, as we expect performance initially to be good only at the shallower levels, with improvement shown later at the deeper levels - see NewIetQuestionsY3:

  • Segmentation of the discourse into segments useful for browsing
  • Production of relevant keyword summaries
  • Relation of topics to agenda items
  • Relation of topics to each other
  • Relation of topics to known tasks/projects from knowledge base (including previous meetings)

Decisions and Tasks

Output from the meeting understanding components, but probably not in Y3. See CaloIetTest? .

Inter-component Learning Opportunities

ASR and Utterance Segmentation Feedback

With multiple ASR hypotheses (and potentially confidences in multiple utterance segmentations as well) coming into the our system, we can provide chosen hypotheses and utterance boundaries back to the speech subsystem for learning. The input hypotheses can be pruned based on parse production, fragment combining, and constraints imposed by reference and anaphora resolution, and discourse structuring.

Floor and Addressee Feedback to Gaze and Pose Detection

With head pose and gaze detection (possibly provided as multiple confidences), floor and addressee detection may provide useful feedback to those algorithms.

User Feedback Using the Meeting Browser

A simple GUI where user-level results like action items and topics are presented can serve as a very effective place to receive supervised feedback from human users. While it is unclear how this GUI might be specifically designed, some ideas are a simple up/down evaluation of annotations produced by the system through a one-click interface (similar to email spam tagging), or even refinement of details (like the filling in of a missing assignee slot in an action item). These can all be fed directly back into the action item and topic detection algorithms.

Producing/Reviewing Official Minutes

One possible use for such a browser is in official meeting review/minute production. Many organisations require minutes from a meeting to be reviewed and officially approved either at the end of the meeting itself, or at the beginning of the next meeting. One useful way to use the system output (including decisions/action items/topics) would be to produce a set of potential minutes, viewable in a dedicated GUI on the CALO desktop and perhaps appearing as a set of Powerpoint-style slides. User interaction would then involve editing these minutes until they are acceptable, either by accepting/rejecting each point, or even by editing points and adding new ones. This could be used as very valuable feedback to all stages of the understanding components (including ASR and whiteboard processing).

Computational Requirements

Off-line vs. On-line

In Y3 it seems likely that topic segmentation/ID will have to remain off-line; the Markov model approach we are using will give better results when allowed to see the whole meeting. It should be adaptable in theory to on-line usage via similar techniques to HMM-based speech recognition, but the end-pointing problem is more difficult for us (no useful pauses/prosody to rely on, although the use of discourse features should get us somewhere). We anticipate looking at this in Y4, while Y3 will be spent improving the basic model, using the knowledge base/agenda etc to provide lexical topic model information, and incorporating discourse features.

Parsing and semantic interpretation could be made on-line provided that (a) speech recognition output and multimodal integrator output is provided on-line, and (b) that a delay of the order of a few utterances is acceptable. It probably is - the on-line use we envisage is in reviewing the content (topics, action items, decisions) at the end of the meeting or at particular stages of the meeting (e.g. at the end of particular phases), rather than continuously watching the interpretation develop in real time. The reason we need a delay is that we intend to use context on both sides of an utterance (rather than just prior context) to help determine its discourse role (and therefore its semantic/pragmatic interpretation, correct ASR hypothesis etc). From what we have seen so far, parser speed is unlikely to be a problem. This (and the performance of the downstream components such as reference resolution and discourse structure building) may change as we incorporate multiple probabilistic hypotheses, though.

The major bottleneck for online processing may well be the MOKB and its query speed, and this will affect all components, not just CSLI.

Topic attachments
I Attachment Action Size Date Who Comment
elseodg calo-y3-arch-schematic.odg manage 15.1 K 26 Aug 2005 - 23:22 JohnNiekrasz  
pdfpdf calo-y3-arch-schematic.pdf manage 62.4 K 26 Aug 2005 - 23:22 JohnNiekrasz  
Calo.CaloY3SystemDiscussion moved from Calo.CaloY3ArchitectureDiscussion on 27 Sep 2005 - 21:08 by JohnNiekrasz - put it back
 

Semlab Home      
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Semlab? Send feedback