ISCSLP Tutorial 2

来自cslt Wiki
2014年9月13日 (六) 07:05Cslt讨论 | 贡献的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

Prof. Chung-Hsien

  • Arousal & Valence coordinator
  • separate emotion process to sub emotions
  • available databases:
  • database collection:
  • acted : GEneva multimodeal emotion portrayals (GEMEP)
  • induced : eNTERFACE'05 EMOTION Database
  • spontaneous: SEMAINE, AFEW
others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
  • static vs dynamic modeling

STATIC:

  • low level descriptors (LLDs) and functionals
  • good for discriminate high and low-arousal emotions
  • temporal information is lost, no suitable for long utterances, can not detect change in emotion

DYNAMIC:

  • frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
  • temporal information is obtained
  • difficult to model context well
  • a large number of local features need to be extracted,
  • Unit choice for dynamic modeling
  • technical unit: frame, time slice, equally-divided unit
  • meaningful unit: word, syllable, phrases
  • emotionally consistent unit: emotion profiles, emotograms
  • different aspects of speech tasks place in different scale
  • feature concatenation or decision fusion to exploit the information from segmented units
  • speech features:
  • prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
  • ZCR, RMS energy, F0, harmonic noise ratio, MFCC
  • MFCC
  • Teager feature is good for detecting streess
  • recognition models
  • SVM, ANN, HMM, GMM, CART
  • Emotion distillation framework
  • emotion specific features from the original high-dimensional feature
  • from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
  • Hierarchical classification structure
  • first detect high/low arosal
  • Fusion based recognition
  • Feature level fusion
  • decision level fusion
  • Model based fusion: mutli stream HMM
  • Temporal phase-based modeling
  • divide the emotion into onset, apex, offset
  • using HMM to chracterize one emotional sub-state, instead of the entire emotional state
  • totally 6 states: (onset,apex, offset) X (high, low)
  • Temporal course modeling
  • Structure-based modeling
  • three level units: utterance, emotion units, sub emotion units
  • use statistic model among different levels


Hsin-Min Wang

  • Music information retrieval (MIR)
  • title search
  • search by query:
  • emotion of songs labelled by persons forms a Gaussian
  • represent the aoustic features of a song by a probabilistic history vector
  • acoustic GMM posterior representation as a feature
  • GMM code book constructed in training (VA GMM)
  • can put the tag into VA space
  • Video to Audio Retrieval
  • First predict video emotion
  • put audio
  • this can be reverse

Emotion variability, by Prof. Vidhyasaharan Sethu:

  • GMM supervector based emotion
  • t-SNNE for visualization in 2-D space
  • remove phone variability by phone dependent GMMs
  • speaker normalization is important for emotion recognition
  • two ways: speaker adaptation & speaker signal
  • KL-based estimation on speaker and emotion variability
  • speaker normalization by feature warping

:* speaker variation modeling with JFA

  • Speaker adaptation : speaker library
  • Cognitive load by Julien Epps
  • cognitive load = arousal?
  • load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
  • Glottal features
  • SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift
  • Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions