NOMOS Corpus Importing
This page gives instructions on loading supported corpora into NOMOS so that they may be viewed, processed or annotated. One of two possible situations will apply when importing a corpus: (1) a pre-existing import script is available, or (2) you will need to write your own importer plugin. The former is documented on this page, while the latter is documented in the
Advanced Usage section. The typical procedure for importing a corpus is as follows:
- Preparation: Prepare the source files on disk. The import script will expect source corpora to be available on disk in some specific directory structure so that the needed input files may be found. In general, the directory structure should reflect the original structure of the source archive itself, e.g. that which is on the LDC DVDs. For resources supplied as zip or tar files, a new directory should be made for each resource, and the file should be unpackaged into that directory. For CD/DVD, the directory should mimic the directory structure of the disc. For multi-DVD corpora, special instructions apply (see below).
- Configuration: Each script requires the setting of some configuration variables in the
nomos.config file so that the script can find the appropriate resources on your local system. This typically involves specifying a name for the root directory in which the corpus may be found.
- Running: To run a pre-packaged import script just choose the menu option "Run" > "Run Import Script...". Select the import script you wish to run from the pull-down list, and click "OK". Some import scripts take several minutes. The script will read in any input files from the corpus and will produce a set of NOMOS/OPI models containing the corpus information in NOMOS-annotateable format.
- Using: You will of course want to use the annotations produced by the script for visualization or further annotation. This involves comprehending the ontology with which the script has encoded the information. The ontologies are documented in a separate Annotation Ontology Documentation page.
The following sections describe these details for each of the preexisting NOMOS import scripts.
ICSI Meeting Corpus
Preparation
The ICSI Meeting Speech DVDs must be accumulated into a single directory. Each DVD from the LDC comes with the same set of metadata files. The only difference between the DVDs are in the
speech/ directory. Therefore, the directory containing the ICSI speech should mimic a single DVD, but the
speech/ directory should contain the collection of the contents of the
speech/ directories on all the DVDs. The directory structure will look like this:
root/
doc/
index.html
speech/
Bed001/
Bed002/
...
Buw001/
The original sound files are in "Shorten"ed NIST Sphere format, which NOMOS cannot play. Therefore, each sound file must be converted to wav format. We have provided a script for converting the files, which can be found in the
utils directory of the NOMOS distribution. For every sound file
chanX.sph, the script creates a file
chanX.sph.unshort.wav. By default, the NOMOS annotations created by the ICSI import script will create references to the latter set of files. Here is the script for your reference:
for i in speech/*/*.sph; do
shorten -x $i $i.unshort
sox -t sph $i.unshort $i.unshort.wav
done
Configuration
The following variables, which each should point to the location of the resources on disk, need to be set in your
nomos.config file:
-
corpora.orig.icsi.speech
-
corpora.orig.icsi.transcripts
Running
The import script class name to run is
csli.dialog.corpora.scripts.loaders.ImportIcsiMeetingCorpus.
Using
The script produces the following models:
icsi/
icsimr/
metadata/v1.n3
people/v1.n3
bleeps/v1.n3
Bdb001/
Bed002/
...
Buw001/
segments/v1.n3
transcripts/v1.n3
The
metadata model contains meeting objects, their attributes, and the media files. The
people model contains all the participants in the corpus and their persistent attributes. The bleeps file contains the censored periods. The
segments model simply contains Events for all of the specified segments in the original transcript files. The
transcripts model adds the actual transcripts to those segments which have them.
See the
Annotation Ontology Documentation page for details on how the formatted annotations are structured.
NIST Meeting Pilot Corpus
Information on this corpus can be found at
http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/index.html.
Preparation
The speech files are distributed on 9 DVDs from the LDC. The DVDs should be accumulated into one directory by joining up the contents of each DVD's
speech directory into one common
speech directory one level below the corpus parent directory, producing the following directory structure (the other files are duplicated on each DVD and are therefore here just once):
root/
docs/
generic_license.html
index.html
speech/
NIST_20011115-1050/
NIST_20020213-1012/
...
NIST_20031204-1125/
The speech files are delivered in NIST Sphere format, which is incompatible with NOMOS. Therefore, the mixed audio for the head-mounted mics should be converted to wav format (you can use the following script):
for i in root/speech/NIST*/*_HM-mix*.sph; do
sox -t sph $i $i.wav
done
Note that the directory structure should be preserved and the name of the new files should be the same as the old, but with
.wav appended to the end, (e.g.
file.sph becomes
file.sph.wav).
The corpus transcripts and metadata can be obtained from the LDC and unpackaged using this script:
cd root
gunzip LDC2004T13.tgz
tar -xvf LDC2004T13.tar
rm LDC2004T13.tar
The transcripts available from NIST at
http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/transcripts/LDC2004E02.tgz seem to be slightly old.
Configuration
The following variables, which each should point to the location of the resources on disk, need to be set in your
nomos.config file:
-
corpora.orig.nist.speech
-
corpora.orig.nist.transcripts
Running
The import script class name to run is
csli.dialog.corpora.scripts.loaders.ImportNistMeetingPilotCorpus.
Using
The NIST corpus does not contain many annotations beyond transcripts and some basic metadata about the meeting and participants. The script produces the following models:
nist/
nistmp/
media/v1.n3
meetings/v1.n3
people/v1.n3
NIST_20011115-1050/transcripts/v1.n3
NIST_20020213-1012/transcripts/v1.n3
...
NIST_20031204-1125/transcripts/v1.n3
The
media model contains one
RecordAudio event for each meeting with start and end times corresponding to the meetings start and end times. These events point (via the
file attribute) to the mix of head-mounted mics. The
meetings model contains start and end times as well as
participant roles. The
people model contains all the participants in the corpus and their persistent attributes. The
transcripts model contains
Speak events for each of the segments in the transcript files, with the corresponding
transcription attribute filled in.
See the
Annotation Ontology Documentation page for details on how the formatted annotations are structured.
ISL Meeting Corpus (Part 1)
Information on this corpus can be found at
http://penance.is.cs.cmu.edu/meeting_room/.
NOTE: The ISL import script does not load meeting
m039 due to its unusual segmentation into two recordings.
Preparation
The speech files are distributed on 2 DVDs from the LDC. The DVDs should be accumulated into one directory by joining up the contents of each DVD's
speech directory into one common
speech directory one level below the corpus parent directory, producing the following directory structure (the other files are duplicated on each DVD and are therefore here just once):
root/
docs/
generic_license.html
index.html
speech/
m035_2.wav
m036_7.wav
...
m064_mix.wav
The corpus transcripts and metadata can be obtained from the LDC and unpackaged using this script:
cd root
gunzip LDC2004T10.tgz
tar -xvf LDC2004T10.tar
rm LDC2004T13.tar
Configuration
The following variables, which each should point to the location of the resources on disk, need to be set in your
nomos.config file:
-
corpora.orig.isl.speech
-
corpora.orig.isl.transcripts
Running
The import script class name to run is
csli.dialog.corpora.scripts.loaders.ImportIslMeetingCorpus.
Using
The ISL corpus does not contain many annotations beyond transcripts and some basic metadata about the meeting and participants. The script produces the following models:
isl/
islmp/
media/v1.n3
meetings/v1.n3
people/v1.n3
m035/transcripts/v1.n3
m036/transcripts/v1.n3
...
m064/transcripts/v1.n3
The
media model contains one
RecordAudio event for each meeting with start and end times corresponding to the meetings start and end times. These events point (via the
file attribute) to the mix of head-mounted mics. The
meetings model contains start and end times as well as
participant roles. The
people model contains all the participants in the corpus and their persistent attributes. The
transcripts model contains
Speak events for each of the segments in the transcript files, with the corresponding
transcription attribute filled in.
See the
Annotation Ontology Documentation page for details on how the formatted annotations are structured.
AMI Meeting Corpus
Information on this corpus can be found at
http://www.idiap.ch/amicorpus.
Preparation
The corpus files (not including the annotations) are distributed via the web site using download scripts. The scripts place all the files in a pre-determined structure under a folder called
amicorpus. The layout of these files on disk, if left unmodified, will be something like this (with some file types omitted for brevity):
{ROOT}/
amicorpus/
{MEETINGNAME}/
...
audio/
video/
The corpus annotations are available separately as a downloadable archive file. When unzipped, it produces a structure like this:
{ROOT}/
00README.txt
abstractive
AMI-metadata.xml
...
Configuration
The following variables need to be set in your
nomos.config file in order to allow for importing and later access to media files. Their values should be the directory name of the two
ROOT directories listed above (at CSLI, these are called
ami-annotations and
ami-signals). Both of these directories should be located where they are available to the "original file" and "file lookup" paths (
file.lookup.path,
corpora.orig.path):
-
corpora.orig.ami.annotations
-
corpora.orig.ami.signals
Running
The import script class name to run is
csli.dialog.corpora.importers.ami.ImportAmi.
Using
Coming soon...
CALO Y2 Data
Follow the importing instructions below for importing CALO media files and for creating NOMOS versions of the annotations (e.g. transcriptions, ASR output, etc). This will allow you to use Y2 SRI recordings in NOMOS for viewing, querying and processing.
Configuration
The default NOMOS configuration properties (for the version provided in the SRI CVS) may be found in
calo/lib/config/calo.config. In the same manner described in the
NOMOS Manual, you will need to override some of these configurations using a
calo.local.config file.
All NOMOS import scripts require that the
corpora.orig.path configuration parameter is set. This points to the various locations on disk which contain the original files to import. Also, the
file.lookup.path key should be set to the same place as well, since this is used when finding files for media playback. Both of these should be absolutely specified.
Next, for each specific corpus that NOMOS can import (or in the CALO case, each main part of a corpus), a parameter should be set to hold the name of the directory used to hold the associated data. These directory names are not paths but are simply the names of the directories. The list of paths in
corpora.orig.path are then searched for the directory names given in the corpus-specific properties. The following is how we set up our config file here at CSLI for the Y2 data:
file.lookup.path=[/shared/corpora]
corpora.orig.path=[/shared/corpora]
corpora.orig.sricaloy2.mokbs=sri-calo-y2-mokbs
corpora.orig.sricaloy2.transcripts=sri-calo-y2-transcripts
corpora.orig.sricaloy2.recordings=sri-calo-y2-recordings
Preparation
The importing script expects the data to be on disk in generally the same allocation as that on
bigtivo, except with the various sequences placed into a single directory. One should simply download the
.tgz files provided on the
bigtivo web site, and construct a file structure like this:
[corpora.orig.path]/
[corpora.orig.sricaloy2.recordings]/
seq-C/
seq-D/
...
seq-H/
1117571796000/
1117572737000/
...
1117575878000/
MOKB/
charter.ink
CAMEO_130.107.94.66/
CAMEO_130.107.94.164/
...
[corpora.orig.sricaloy2.mokbs]/
seq-G-exper-mokb.n3
seq-G-inexper-mokb.n3
seq-H-exper-mokb.n3
seq-H-inexper-mokb.n3
[corpora.orig.sricaloy2.transcripts]/
Meeting Sequence G/
...
Meeting 1
...
Transcriptions
2005_05_24_14_38_26_065_jmarlow.trs
2005_05_24_14_38_26_336_jpark.trs
2005_05_24_14_38_26_662_john_pedersen.trs
trans-14.dtd
trans-13.dtd
Importing
Open NOMOS and choose
Run > Run Import Script.... Choose the class named
ImportCaloY2SriRecordings. This script will create datasets compatible with NOMOS into the directory which has been specified using the
opi.file.archive parameter.
Using
The meetings are now available as NOMOS sessions. Open the one you want in NOMOS, following the normal procedure for opening a session in NOMOS.