csli.dialog.app.calo.topic.classification.topicextraction
Class Topic

java.lang.Object
  extended by csli.dialog.app.calo.topic.classification.topicextraction.WordDistribution
      extended by csli.dialog.app.calo.topic.classification.topicextraction.Topic
All Implemented Interfaces:
Serializable

public class Topic
extends WordDistribution

A word distribution class restricted to content words

Author:
laidebeu
See Also:
Serialized Form

Field Summary
protected static Topic global
           
static org.apache.log4j.Logger logger
           
protected  HashSet<Integer> removedWords
           
protected  boolean removeRareWords
          words having count<0.5 will be removed since they can't be significant.
 
Fields inherited from class csli.dialog.app.calo.topic.classification.topicextraction.WordDistribution
distribution, ratios, stemmer
 
Constructor Summary
Topic(boolean rrw)
           
Topic(String s)
           
Topic(Topic a)
           
Topic(WordDistribution a, boolean rrw)
           
 
Method Summary
protected static void addNullWords(Collection<Integer> cs)
           
protected static void addNullWords(File f)
          Completes the stopword list by reading from a file.
 void clean()
           
static void clearFiles()
          Clears the files that have been saved on the disk.
 void delete()
           
static Set<Topic> getCachedTopics()
          Deprecated.  
static Topic getCriticalVector(ArrayList<SausageUtterance> sausages, int beg, int end)
           
static Topic getCriticalVector(ArrayList<SausageUtterance> sausages, int beg, int end, int offsegBeg, int offsegEnd)
           
static Topic getCriticalVector(Topic wd, Topic meetingWD)
           
static Topic getCriticalVector(WordDistribution worddist, ArrayList<SausageUtterance> sausages)
           
 String getDesc(int n)
           
static String[] getMeetingNames()
          Get the list of meetings in the corpus.
 String getName()
           
protected static Set<Integer> getNullWords()
           
 Boolean getTemp()
           
static Topic getTopicFromName(String name)
          Get a named Topic from the pool
 edu.stanford.nlp.util.Counter<String> getTopWordsCounter(int n)
           
 double getWeight(Integer i)
           
static boolean initGlobal()
          Ensures the global word distribution which is used for computing topics, and the associated variables, are loaded.
 void keepSignificant(int n)
           
static Topic mixture(Collection<Pair<Topic,Double>> toMix)
          Computes the mixture of given topics.
 void printToStream(PrintStream out, boolean withNullWords)
           
 void readFromStream(BufferedReader in)
           
static boolean reinitGlobal(String meeting)
          Ensures the global (corpus-wide) word distribution is loaded and includes this meeting - if not, forces it to be re-calculated.
 void removeIrrelevant()
           
 void save(boolean temp)
           
 void setName()
           
 void setName(String name)
           
 void setRemoveableWords(WordDistribution ref)
          Removes all the words that are too rare in a set of WordDistributions.
 void shrink()
           
 void shrink(int n)
           
 void temporarySave()
           
 String toString()
           
protected  void updateNullwords()
           
 void userSave()
           
 
Methods inherited from class csli.dialog.app.calo.topic.classification.topicextraction.WordDistribution
addSausage, addSausage, addSausageUtterance, addSausageUtterance, addSausageUtterances, getCount, getDistribution, getOrthogonalDifference, getStemmer, myexp, mylog, positiveLogSimilarity, positiveSimilarity, removeSausage, removeSausageUtterance, similarity, size, splitWords, topKeys, toString, totalWeight
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

logger

public static org.apache.log4j.Logger logger

global

protected static Topic global

removeRareWords

protected boolean removeRareWords
words having count<0.5 will be removed since they can't be significant.


removedWords

protected HashSet<Integer> removedWords
Constructor Detail

Topic

public Topic(String s)

Topic

public Topic(Topic a)

Topic

public Topic(boolean rrw)

Topic

public Topic(WordDistribution a,
             boolean rrw)
Method Detail

addNullWords

protected static void addNullWords(File f)
Completes the stopword list by reading from a file. Has to be run before working with the topics.

Parameters:
f - the stopwords list (one word per line)

updateNullwords

protected void updateNullwords()

getNullWords

protected static Set<Integer> getNullWords()

addNullWords

protected static void addNullWords(Collection<Integer> cs)

clearFiles

public static void clearFiles()
Clears the files that have been saved on the disk. Only for use during the development phase.


initGlobal

public static boolean initGlobal()
Ensures the global word distribution which is used for computing topics, and the associated variables, are loaded. Will be loaded from topic.extraction.globalWordDistribution.filename if not already loaded; or calculated over the whole corpus if that file does not exist.

Returns:
true if it managed to do it, false if it failed.

reinitGlobal

public static boolean reinitGlobal(String meeting)
Ensures the global (corpus-wide) word distribution is loaded and includes this meeting - if not, forces it to be re-calculated.

Parameters:
meeting -
Returns:
false on error

getWeight

public double getWeight(Integer i)

setRemoveableWords

public void setRemoveableWords(WordDistribution ref)
Removes all the words that are too rare in a set of WordDistributions.

Parameters:
ref - The WordDistribution object out of which we want to get the words that are too rare.

removeIrrelevant

public void removeIrrelevant()

keepSignificant

public void keepSignificant(int n)

shrink

public void shrink()

shrink

public void shrink(int n)

clean

public void clean()

getCriticalVector

public static Topic getCriticalVector(ArrayList<SausageUtterance> sausages,
                                      int beg,
                                      int end,
                                      int offsegBeg,
                                      int offsegEnd)

getCriticalVector

public static Topic getCriticalVector(WordDistribution worddist,
                                      ArrayList<SausageUtterance> sausages)

getCriticalVector

public static Topic getCriticalVector(Topic wd,
                                      Topic meetingWD)

getCriticalVector

public static Topic getCriticalVector(ArrayList<SausageUtterance> sausages,
                                      int beg,
                                      int end)

mixture

public static Topic mixture(Collection<Pair<Topic,Double>> toMix)
Computes the mixture of given topics. The topics are reweighted so that they all have the same importance (instead of having the biggest (in terms of total weight of the words) topics taking more importance than the other ones).

Parameters:
toMix - the set of topics to be merged, with their coefficients
Returns:
the mixed topic

getTopWordsCounter

public edu.stanford.nlp.util.Counter<String> getTopWordsCounter(int n)
Parameters:
n - the number of top words we want
Returns:
the subpart of the counter corresponding to the top n keys.

getDesc

public String getDesc(int n)

getName

public String getName()

setName

public void setName(String name)

setName

public void setName()

toString

public String toString()
Overrides:
toString in class WordDistribution

getTopicFromName

public static Topic getTopicFromName(String name)
Get a named Topic from the pool

Parameters:
name - the name of the Topic
Returns:
the named Topic, or null if no Topic of this name exists

delete

public void delete()

save

public void save(boolean temp)
          throws IOException
Throws:
IOException

userSave

public void userSave()
              throws IOException
Throws:
IOException

temporarySave

public void temporarySave()
                   throws IOException
Throws:
IOException

getCachedTopics

public static Set<Topic> getCachedTopics()
Deprecated. 


getTemp

public Boolean getTemp()

printToStream

public void printToStream(PrintStream out,
                          boolean withNullWords)

readFromStream

public void readFromStream(BufferedReader in)
                    throws IOException
Throws:
IOException

getMeetingNames

public static String[] getMeetingNames()
Get the list of meetings in the corpus.

Returns:
a sorted array of meeting names.