“ISCSLP Tutorial 2”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
第68行: 第68行:
 
* Music information retrieval  (MIR)
 
* Music information retrieval  (MIR)
 
:* title search
 
:* title search
:* search by query
+
:* search by query:
 
:* emotion of songs labelled by persons forms a Gaussian
 
:* emotion of songs labelled by persons forms a Gaussian
 
:* represent the aoustic features of a song by a probabilistic history vector
 
:* represent the aoustic features of a song by a probabilistic history vector
 
:* acoustic GMM posterior representation as a feature  
 
:* acoustic GMM posterior representation as a feature  
 
:* GMM code book constructed in training (VA GMM)
 
:* GMM code book constructed in training (VA GMM)
 +
:* can put the tag into VA space
 +
 +
* Video to Audio Retrieval
 +
:* First predict video emotion
 +
:* put audio
 +
:* this can be reverse
 +
 +
Emotion variability, by Prof.  Julien Epps:
 +
 +
* GMM supervector  based emotion
 +
:* t-SNNE for visualization in 2-D space
 +
:* remove phone variability by phone dependent GMMs
 +
:* speaker normalization is important for emotion recognition
 +
:* two ways: speaker adaptation & speaker signal normalization
 +
:* KL-based estimation on speaker and emotion variability
 +
:* speaker normalization by feature warping
 +
''':* speaker variation modeling with JFA'''
 +
:* Speaker adaptation : speaker library
 +
:*

2014年9月13日 (六) 06:38的版本

Prof. Chung-Hsien

  • Arousal & Valence coordinator
  • separate emotion process to sub emotions
  • available databases:
  • database collection:
  • acted : GEneva multimodeal emotion portrayals (GEMEP)
  • induced : eNTERFACE'05 EMOTION Database
  • spontaneous: SEMAINE, AFEW
others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
  • static vs dynamic modeling

STATIC:

  • low level descriptors (LLDs) and functionals
  • good for discriminate high and low-arousal emotions
  • temporal information is lost, no suitable for long utterances, can not detect change in emotion

DYNAMIC:

  • frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
  • temporal information is obtained
  • difficult to model context well
  • a large number of local features need to be extracted,
  • Unit choice for dynamic modeling
  • technical unit: frame, time slice, equally-divided unit
  • meaningful unit: word, syllable, phrases
  • emotionally consistent unit: emotion profiles, emotograms
  • different aspects of speech tasks place in different scale
  • feature concatenation or decision fusion to exploit the information from segmented units
  • speech features:
  • prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
  • ZCR, RMS energy, F0, harmonic noise ratio, MFCC
  • MFCC
  • Teager feature is good for detecting streess
  • recognition models
  • SVM, ANN, HMM, GMM, CART
  • Emotion distillation framework
  • emotion specific features from the original high-dimensional feature
  • from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
  • Hierarchical classification structure
  • first detect high/low arosal
  • Fusion based recognition
  • Feature level fusion
  • decision level fusion
  • Model based fusion: mutli stream HMM
  • Temporal phase-based modeling
  • divide the emotion into onset, apex, offset
  • using HMM to chracterize one emotional sub-state, instead of the entire emotional state
  • totally 6 states: (onset,apex, offset) X (high, low)
  • Temporal course modeling
  • Structure-based modeling
  • three level units: utterance, emotion units, sub emotion units
  • use statistic model among different levels


Hsin-Min Wang

  • Music information retrieval (MIR)
  • title search
  • search by query:
  • emotion of songs labelled by persons forms a Gaussian
  • represent the aoustic features of a song by a probabilistic history vector
  • acoustic GMM posterior representation as a feature
  • GMM code book constructed in training (VA GMM)
  • can put the tag into VA space
  • Video to Audio Retrieval
  • First predict video emotion
  • put audio
  • this can be reverse

Emotion variability, by Prof. Julien Epps:

  • GMM supervector based emotion
  • t-SNNE for visualization in 2-D space
  • remove phone variability by phone dependent GMMs
  • speaker normalization is important for emotion recognition
  • two ways: speaker adaptation & speaker signal normalization
  • KL-based estimation on speaker and emotion variability
  • speaker normalization by feature warping

:* speaker variation modeling with JFA

  • Speaker adaptation : speaker library