r2 - 25 Oct 2006 - 19:36:55 - JohnNiekraszYou are here:  Public Web > NomosImporting

NOMOS Corpus Importing

nomos.gif

This page gives instructions on loading supported corpora into NOMOS so that they may be viewed, processed or annotated. One of two possible situations will apply when importing a corpus: (1) a pre-existing import script is available, or (2) you will need to write your own importer plugin. The former is documented on this page, while the latter is documented in the Advanced Usage section. The typical procedure for importing a corpus is as follows:

  1. Preparation: Prepare the source files on disk. The import script will expect source corpora to be available on disk in some specific directory structure so that the needed input files may be found. In general, the directory structure should reflect the original structure of the source archive itself, e.g. that which is on the LDC DVDs. For resources supplied as zip or tar files, a new directory should be made for each resource, and the file should be unpackaged into that directory. For CD/DVD, the directory should mimic the directory structure of the disc. For multi-DVD corpora, special instructions apply (see below).
  2. Configuration: Each script requires the setting of some configuration variables in the nomos.config file so that the script can find the appropriate resources on your local system. This typically involves specifying a name for the root directory in which the corpus may be found.
  3. Running: To run a pre-packaged import script just choose the menu option "Run" > "Run Import Script...". Select the import script you wish to run from the pull-down list, and click "OK". Some import scripts take several minutes. The script will read in any input files from the corpus and will produce a set of NOMOS/OPI models containing the corpus information in NOMOS-annotateable format.
  4. Using: You will of course want to use the annotations produced by the script for visualization or further annotation. This involves comprehending the ontology with which the script has encoded the information. The ontologies are documented in a separate Annotation Ontology Documentation page.

The following sections describe these details for each of the preexisting NOMOS import scripts.

ICSI Meeting Corpus

Preparation

The ICSI Meeting Speech DVDs must be accumulated into a single directory. Each DVD from the LDC comes with the same set of metadata files. The only difference between the DVDs are in the speech/ directory. Therefore, the directory containing the ICSI speech should mimic a single DVD, but the speech/ directory should contain the collection of the contents of the speech/ directories on all the DVDs. The directory structure will look like this:

root/
    doc/
    index.html
    speech/
        Bed001/
        Bed002/
        ...
        Buw001/

The original sound files are in "Shorten"ed NIST Sphere format, which NOMOS cannot play. Therefore, each sound file must be converted to wav format. We have provided a script for converting the files, which can be found in the utils directory of the NOMOS distribution. For every sound file chanX.sph, the script creates a file chanX.sph.unshort.wav. By default, the NOMOS annotations created by the ICSI import script will create references to the latter set of files. Here is the script for your reference:

for i in speech/*/*.sph; do
  shorten -x $i $i.unshort
  sox -t sph $i.unshort $i.unshort.wav 
done

Configuration

The following variables, which each should point to the location of the resources on disk, need to be set in your nomos.config file:

  • corpora.orig.icsi.speech
  • corpora.orig.icsi.transcripts

Running

The import script class name to run is csli.dialog.corpora.scripts.loaders.ImportIcsiMeetingCorpus.

Using

The script produces the following models:

icsi/
  icsimr/
    metadata/v1.n3
    people/v1.n3
    bleeps/v1.n3
  Bdb001/
  Bed002/
  ...
  Buw001/
    segments/v1.n3
    transcripts/v1.n3

The metadata model contains meeting objects, their attributes, and the media files. The people model contains all the participants in the corpus and their persistent attributes. The bleeps file contains the censored periods. The segments model simply contains Events for all of the specified segments in the original transcript files. The transcripts model adds the actual transcripts to those segments which have them.

See the Annotation Ontology Documentation page for details on how the formatted annotations are structured.

NIST Meeting Pilot Corpus

Information on this corpus can be found at http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/index.html.

Preparation

The speech files are distributed on 9 DVDs from the LDC. The DVDs should be accumulated into one directory by joining up the contents of each DVD's speech directory into one common speech directory one level below the corpus parent directory, producing the following directory structure (the other files are duplicated on each DVD and are therefore here just once):

root/
    docs/
    generic_license.html
    index.html
    speech/
        NIST_20011115-1050/
        NIST_20020213-1012/
        ...
        NIST_20031204-1125/

The speech files are delivered in NIST Sphere format, which is incompatible with NOMOS. Therefore, the mixed audio for the head-mounted mics should be converted to wav format (you can use the following script):

for i in root/speech/NIST*/*_HM-mix*.sph; do
  sox -t sph $i $i.wav
done

Note that the directory structure should be preserved and the name of the new files should be the same as the old, but with .wav appended to the end, (e.g. file.sph becomes file.sph.wav).

The corpus transcripts and metadata can be obtained from the LDC and unpackaged using this script:

cd root
gunzip LDC2004T13.tgz
tar -xvf LDC2004T13.tar
rm LDC2004T13.tar

The transcripts available from NIST at http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/transcripts/LDC2004E02.tgz seem to be slightly old.

Configuration

The following variables, which each should point to the location of the resources on disk, need to be set in your nomos.config file:

  • corpora.orig.nist.speech
  • corpora.orig.nist.transcripts

Running

The import script class name to run is csli.dialog.corpora.scripts.loaders.ImportNistMeetingPilotCorpus.

Using

The NIST corpus does not contain many annotations beyond transcripts and some basic metadata about the meeting and participants. The script produces the following models:

nist/
  nistmp/
    media/v1.n3
    meetings/v1.n3
    people/v1.n3
  NIST_20011115-1050/transcripts/v1.n3
  NIST_20020213-1012/transcripts/v1.n3
  ...
  NIST_20031204-1125/transcripts/v1.n3

The media model contains one RecordAudio event for each meeting with start and end times corresponding to the meetings start and end times. These events point (via the file attribute) to the mix of head-mounted mics. The meetings model contains start and end times as well as participant roles. The people model contains all the participants in the corpus and their persistent attributes. The transcripts model contains Speak events for each of the segments in the transcript files, with the corresponding transcription attribute filled in.

See the Annotation Ontology Documentation page for details on how the formatted annotations are structured.

ISL Meeting Corpus (Part 1)

Information on this corpus can be found at http://penance.is.cs.cmu.edu/meeting_room/.

NOTE: The ISL import script does not load meeting m039 due to its unusual segmentation into two recordings.

Preparation

The speech files are distributed on 2 DVDs from the LDC. The DVDs should be accumulated into one directory by joining up the contents of each DVD's speech directory into one common speech directory one level below the corpus parent directory, producing the following directory structure (the other files are duplicated on each DVD and are therefore here just once):

root/
  docs/
  generic_license.html
  index.html
  speech/
    m035_2.wav
    m036_7.wav
    ...
    m064_mix.wav

The corpus transcripts and metadata can be obtained from the LDC and unpackaged using this script:

cd root
gunzip LDC2004T10.tgz
tar -xvf LDC2004T10.tar
rm LDC2004T13.tar

Configuration

The following variables, which each should point to the location of the resources on disk, need to be set in your nomos.config file:

  • corpora.orig.isl.speech
  • corpora.orig.isl.transcripts

Running

The import script class name to run is csli.dialog.corpora.scripts.loaders.ImportIslMeetingCorpus.

Using

The ISL corpus does not contain many annotations beyond transcripts and some basic metadata about the meeting and participants. The script produces the following models:

isl/
  islmp/
    media/v1.n3
    meetings/v1.n3
    people/v1.n3
  m035/transcripts/v1.n3
  m036/transcripts/v1.n3
  ...
  m064/transcripts/v1.n3

The media model contains one RecordAudio event for each meeting with start and end times corresponding to the meetings start and end times. These events point (via the file attribute) to the mix of head-mounted mics. The meetings model contains start and end times as well as participant roles. The people model contains all the participants in the corpus and their persistent attributes. The transcripts model contains Speak events for each of the segments in the transcript files, with the corresponding transcription attribute filled in.

See the Annotation Ontology Documentation page for details on how the formatted annotations are structured.

AMI Meeting Corpus

Information on this corpus can be found at http://www.idiap.ch/amicorpus.

Preparation

The corpus files (not including the annotations) are distributed via the web site using download scripts. The scripts place all the files in a pre-determined structure under a folder called amicorpus. The layout of these files on disk, if left unmodified, will be something like this (with some file types omitted for brevity):

{ROOT}/
  amicorpus/
    {MEETINGNAME}/
    ...
      audio/
      video/

The corpus annotations are available separately as a downloadable archive file. When unzipped, it produces a structure like this:

{ROOT}/
  00README.txt
  abstractive
  AMI-metadata.xml
  ...

Configuration

The following variables need to be set in your nomos.config file in order to allow for importing and later access to media files. Their values should be the directory name of the two ROOT directories listed above (at CSLI, these are called ami-annotations and ami-signals). Both of these directories should be located where they are available to the "original file" and "file lookup" paths (file.lookup.path, corpora.orig.path):

  • corpora.orig.ami.annotations
  • corpora.orig.ami.signals

Running

The import script class name to run is csli.dialog.corpora.importers.ami.ImportAmi.

Using

Coming soon...

CALO Y2 Data

Follow the importing instructions below for importing CALO media files and for creating NOMOS versions of the annotations (e.g. transcriptions, ASR output, etc). This will allow you to use Y2 SRI recordings in NOMOS for viewing, querying and processing.

Configuration

The default NOMOS configuration properties (for the version provided in the SRI CVS) may be found in calo/lib/config/calo.config. In the same manner described in the NOMOS Manual, you will need to override some of these configurations using a calo.local.config file.

All NOMOS import scripts require that the corpora.orig.path configuration parameter is set. This points to the various locations on disk which contain the original files to import. Also, the file.lookup.path key should be set to the same place as well, since this is used when finding files for media playback. Both of these should be absolutely specified.

Next, for each specific corpus that NOMOS can import (or in the CALO case, each main part of a corpus), a parameter should be set to hold the name of the directory used to hold the associated data. These directory names are not paths but are simply the names of the directories. The list of paths in corpora.orig.path are then searched for the directory names given in the corpus-specific properties. The following is how we set up our config file here at CSLI for the Y2 data:

file.lookup.path=[/shared/corpora]
corpora.orig.path=[/shared/corpora]
corpora.orig.sricaloy2.mokbs=sri-calo-y2-mokbs
corpora.orig.sricaloy2.transcripts=sri-calo-y2-transcripts
corpora.orig.sricaloy2.recordings=sri-calo-y2-recordings

Preparation

The importing script expects the data to be on disk in generally the same allocation as that on bigtivo, except with the various sequences placed into a single directory. One should simply download the .tgz files provided on the bigtivo web site, and construct a file structure like this:

[corpora.orig.path]/
  [corpora.orig.sricaloy2.recordings]/
    seq-C/
    seq-D/
    ...
    seq-H/
      1117571796000/
      1117572737000/
      ...
      1117575878000/
        MOKB/
        charter.ink
        CAMEO_130.107.94.66/
        CAMEO_130.107.94.164/
        ...
  [corpora.orig.sricaloy2.mokbs]/
    seq-G-exper-mokb.n3
    seq-G-inexper-mokb.n3
    seq-H-exper-mokb.n3
    seq-H-inexper-mokb.n3
  [corpora.orig.sricaloy2.transcripts]/
    Meeting Sequence G/
    ...
      Meeting 1
      ...
        Transcriptions
          2005_05_24_14_38_26_065_jmarlow.trs
          2005_05_24_14_38_26_336_jpark.trs
          2005_05_24_14_38_26_662_john_pedersen.trs
          trans-14.dtd
          trans-13.dtd

Importing

Open NOMOS and choose Run > Run Import Script.... Choose the class named ImportCaloY2SriRecordings. This script will create datasets compatible with NOMOS into the directory which has been specified using the opi.file.archive parameter.

Using

The meetings are now available as NOMOS sessions. Open the one you want in NOMOS, following the normal procedure for opening a session in NOMOS.

 

Semlab Home      
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Semlab? Send feedback