“ISCSLP Tutorial 2”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以“Prof. Chung-Hsien Arousal & Valence coordinator”为内容创建页面)
 
 
(相同用户的5个中间修订版本未显示)
第1行: 第1行:
 
Prof. Chung-Hsien
 
Prof. Chung-Hsien
  
Arousal & Valence coordinator
+
* Arousal & Valence coordinator
 +
* separate emotion process to sub emotions
 +
 
 +
* available databases:
 +
* database collection:
 +
:* acted : GEneva multimodeal emotion portrayals (GEMEP)
 +
:* induced : eNTERFACE'05 EMOTION Database
 +
:* spontaneous: SEMAINE, AFEW
 +
: others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
 +
 
 +
* static vs dynamic modeling
 +
 
 +
STATIC:
 +
:* low level descriptors (LLDs) and functionals
 +
:* good for discriminate high and low-arousal emotions
 +
:* temporal information is lost, no suitable for long utterances, can not detect change in emotion
 +
 
 +
DYNAMIC:
 +
:* frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
 +
:* temporal information is obtained
 +
:* difficult to model context well
 +
:* a large number of local features need to be extracted,
 +
 
 +
* Unit choice for dynamic modeling
 +
:* technical unit: frame, time slice, equally-divided unit
 +
:* meaningful unit: word, syllable, phrases
 +
:* emotionally consistent unit: emotion profiles, emotograms
 +
:* different aspects of speech tasks place in different scale
 +
 
 +
* feature concatenation or decision fusion to exploit the information from segmented units
 +
 
 +
* speech features:
 +
:* prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
 +
:* ZCR, RMS energy, F0, harmonic noise ratio, MFCC
 +
:* MFCC
 +
:* Teager feature is good for detecting streess
 +
 
 +
* recognition models
 +
:* SVM, ANN, HMM, GMM, CART
 +
 
 +
* Emotion distillation framework
 +
:* emotion specific features from the original high-dimensional feature
 +
:* from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
 +
 
 +
* Hierarchical classification structure
 +
:* first detect high/low arosal
 +
 
 +
* Fusion based recognition
 +
:* Feature level fusion
 +
:* decision level fusion
 +
:* Model based fusion: mutli stream HMM
 +
 
 +
* Temporal phase-based modeling
 +
:* divide the emotion into onset, apex, offset
 +
:* using HMM to chracterize one emotional sub-state, instead of the entire emotional state
 +
:* totally 6 states: (onset,apex, offset) X (high, low)
 +
:* Temporal course modeling
 +
 
 +
* Structure-based modeling
 +
:* three level units: utterance, emotion units, sub emotion units
 +
:* use statistic model among different levels
 +
 
 +
 
 +
Hsin-Min Wang
 +
 
 +
* Music information retrieval  (MIR)
 +
:* title search
 +
:* search by query:
 +
:* emotion of songs labelled by persons forms a Gaussian
 +
:* represent the aoustic features of a song by a probabilistic history vector
 +
:* acoustic GMM posterior representation as a feature
 +
:* GMM code book constructed in training (VA GMM)
 +
:* can put the tag into VA space
 +
 
 +
* Video to Audio Retrieval
 +
:* First predict video emotion
 +
:* put audio
 +
:* this can be reverse
 +
 
 +
Emotion variability, by Prof.  Vidhyasaharan Sethu:
 +
 
 +
* GMM supervector  based emotion
 +
:* t-SNNE for visualization in 2-D space
 +
:* remove phone variability by phone dependent GMMs
 +
:* speaker normalization is important for emotion recognition
 +
:* two ways: speaker adaptation & speaker signal
 +
:* KL-based estimation on speaker and emotion variability
 +
:* speaker normalization by feature warping
 +
''':* speaker variation modeling with JFA'''
 +
:* Speaker adaptation : speaker library
 +
 
 +
* Cognitive load by  Julien Epps
 +
:* cognitive load = arousal?
 +
:* load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
 +
:* Glottal features
 +
:* SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift
 +
* Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions

2014年9月13日 (六) 07:05的最后版本

Prof. Chung-Hsien

  • Arousal & Valence coordinator
  • separate emotion process to sub emotions
  • available databases:
  • database collection:
  • acted : GEneva multimodeal emotion portrayals (GEMEP)
  • induced : eNTERFACE'05 EMOTION Database
  • spontaneous: SEMAINE, AFEW
others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
  • static vs dynamic modeling

STATIC:

  • low level descriptors (LLDs) and functionals
  • good for discriminate high and low-arousal emotions
  • temporal information is lost, no suitable for long utterances, can not detect change in emotion

DYNAMIC:

  • frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
  • temporal information is obtained
  • difficult to model context well
  • a large number of local features need to be extracted,
  • Unit choice for dynamic modeling
  • technical unit: frame, time slice, equally-divided unit
  • meaningful unit: word, syllable, phrases
  • emotionally consistent unit: emotion profiles, emotograms
  • different aspects of speech tasks place in different scale
  • feature concatenation or decision fusion to exploit the information from segmented units
  • speech features:
  • prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
  • ZCR, RMS energy, F0, harmonic noise ratio, MFCC
  • MFCC
  • Teager feature is good for detecting streess
  • recognition models
  • SVM, ANN, HMM, GMM, CART
  • Emotion distillation framework
  • emotion specific features from the original high-dimensional feature
  • from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
  • Hierarchical classification structure
  • first detect high/low arosal
  • Fusion based recognition
  • Feature level fusion
  • decision level fusion
  • Model based fusion: mutli stream HMM
  • Temporal phase-based modeling
  • divide the emotion into onset, apex, offset
  • using HMM to chracterize one emotional sub-state, instead of the entire emotional state
  • totally 6 states: (onset,apex, offset) X (high, low)
  • Temporal course modeling
  • Structure-based modeling
  • three level units: utterance, emotion units, sub emotion units
  • use statistic model among different levels


Hsin-Min Wang

  • Music information retrieval (MIR)
  • title search
  • search by query:
  • emotion of songs labelled by persons forms a Gaussian
  • represent the aoustic features of a song by a probabilistic history vector
  • acoustic GMM posterior representation as a feature
  • GMM code book constructed in training (VA GMM)
  • can put the tag into VA space
  • Video to Audio Retrieval
  • First predict video emotion
  • put audio
  • this can be reverse

Emotion variability, by Prof. Vidhyasaharan Sethu:

  • GMM supervector based emotion
  • t-SNNE for visualization in 2-D space
  • remove phone variability by phone dependent GMMs
  • speaker normalization is important for emotion recognition
  • two ways: speaker adaptation & speaker signal
  • KL-based estimation on speaker and emotion variability
  • speaker normalization by feature warping

:* speaker variation modeling with JFA

  • Speaker adaptation : speaker library
  • Cognitive load by Julien Epps
  • cognitive load = arousal?
  • load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
  • Glottal features
  • SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift
  • Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions