“ISCSLP Tutorial 2”版本间的差异

2014年9月13日 (六) 07:05的最后版本

Prof. Chung-Hsien

Arousal & Valence coordinator
separate emotion process to sub emotions

available databases:
database collection:

acted : GEneva multimodeal emotion portrayals (GEMEP)
induced : eNTERFACE'05 EMOTION Database
spontaneous: SEMAINE, AFEW

others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC

static vs dynamic modeling

STATIC:

low level descriptors (LLDs) and functionals
good for discriminate high and low-arousal emotions
temporal information is lost, no suitable for long utterances, can not detect change in emotion

DYNAMIC:

frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
temporal information is obtained
difficult to model context well
a large number of local features need to be extracted,

Unit choice for dynamic modeling

technical unit: frame, time slice, equally-divided unit
meaningful unit: word, syllable, phrases
emotionally consistent unit: emotion profiles, emotograms
different aspects of speech tasks place in different scale

feature concatenation or decision fusion to exploit the information from segmented units

speech features:

prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
ZCR, RMS energy, F0, harmonic noise ratio, MFCC
MFCC
Teager feature is good for detecting streess

recognition models

SVM, ANN, HMM, GMM, CART

Emotion distillation framework

emotion specific features from the original high-dimensional feature
from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output

Hierarchical classification structure

first detect high/low arosal

Fusion based recognition

Feature level fusion
decision level fusion
Model based fusion: mutli stream HMM

Temporal phase-based modeling

divide the emotion into onset, apex, offset
using HMM to chracterize one emotional sub-state, instead of the entire emotional state
totally 6 states: (onset,apex, offset) X (high, low)
Temporal course modeling

Structure-based modeling

three level units: utterance, emotion units, sub emotion units
use statistic model among different levels

Hsin-Min Wang

Music information retrieval (MIR)

title search
search by query:
emotion of songs labelled by persons forms a Gaussian
represent the aoustic features of a song by a probabilistic history vector
acoustic GMM posterior representation as a feature
GMM code book constructed in training (VA GMM)
can put the tag into VA space

Video to Audio Retrieval

First predict video emotion
put audio
this can be reverse

Emotion variability, by Prof. Vidhyasaharan Sethu:

GMM supervector based emotion

t-SNNE for visualization in 2-D space
remove phone variability by phone dependent GMMs
speaker normalization is important for emotion recognition
two ways: speaker adaptation & speaker signal
KL-based estimation on speaker and emotion variability
speaker normalization by feature warping

:* speaker variation modeling with JFA

Speaker adaptation : speaker library

Cognitive load by Julien Epps

cognitive load = arousal?
load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
Glottal features
SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift

Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions

@@ 第1行： / 第1行： @@
 Prof. Chung-Hsien
-Arousal & Valence coordinator
+* Arousal & Valence coordinator
+* separate emotion process to sub emotions
+* available databases:
+* database collection:
+:* acted : GEneva multimodeal emotion portrayals (GEMEP)
+:* induced : eNTERFACE'05 EMOTION Database
+:* spontaneous: SEMAINE, AFEW
+: others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
+* static vs dynamic modeling
+STATIC:
+:* low level descriptors (LLDs) and functionals
+:* good for discriminate high and low-arousal emotions
+:* temporal information is lost, no suitable for long utterances, can not detect change in emotion
+DYNAMIC:
+:* frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
+:* temporal information is obtained
+:* difficult to model context well
+:* a large number of local features need to be extracted,
+* Unit choice for dynamic modeling
+:* technical unit: frame, time slice, equally-divided unit
+:* meaningful unit: word, syllable, phrases
+:* emotionally consistent unit: emotion profiles, emotograms
+:* different aspects of speech tasks place in different scale
+* feature concatenation or decision fusion to exploit the information from segmented units
+* speech features:
+:* prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
+:* ZCR, RMS energy, F0, harmonic noise ratio, MFCC
+:* MFCC
+:* Teager feature is good for detecting streess
+* recognition models
+:* SVM, ANN, HMM, GMM, CART
+* Emotion distillation framework
+:* emotion specific features from the original high-dimensional feature
+:* from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
+* Hierarchical classification structure
+:* first detect high/low arosal
+* Fusion based recognition
+:* Feature level fusion
+:* decision level fusion
+:* Model based fusion: mutli stream HMM
+* Temporal phase-based modeling
+:* divide the emotion into onset, apex, offset
+:* using HMM to chracterize one emotional sub-state, instead of the entire emotional state
+:* totally 6 states: (onset,apex, offset) X (high, low)
+:* Temporal course modeling
+* Structure-based modeling
+:* three level units: utterance, emotion units, sub emotion units
+:* use statistic model among different levels
+Hsin-Min Wang
+* Music information retrieval  (MIR)
+:* title search
+:* search by query:
+:* emotion of songs labelled by persons forms a Gaussian
+:* represent the aoustic features of a song by a probabilistic history vector
+:* acoustic GMM posterior representation as a feature
+:* GMM code book constructed in training (VA GMM)
+:* can put the tag into VA space
+* Video to Audio Retrieval
+:* First predict video emotion
+:* put audio
+:* this can be reverse
+Emotion variability, by Prof.  Vidhyasaharan Sethu:
+* GMM supervector  based emotion
+:* t-SNNE for visualization in 2-D space
+:* remove phone variability by phone dependent GMMs
+:* speaker normalization is important for emotion recognition
+:* two ways: speaker adaptation & speaker signal
+:* KL-based estimation on speaker and emotion variability
+:* speaker normalization by feature warping
+''':* speaker variation modeling with JFA'''
+:* Speaker adaptation : speaker library
+* Cognitive load by  Julien Epps
+:* cognitive load = arousal?
+:* load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
+:* Glottal features
+:* SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift
+* Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions

“ISCSLP Tutorial 2”版本间的差异

2014年9月13日 (六) 07:05的最后版本

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具