“ISCSLP Tutorial 2”版本间的差异
来自cslt Wiki
(相同用户的4个中间修订版本未显示) | |||
第22行: | 第22行: | ||
:* temporal information is obtained | :* temporal information is obtained | ||
:* difficult to model context well | :* difficult to model context well | ||
− | :* a large number of local features need to be extracted | + | :* a large number of local features need to be extracted, |
* Unit choice for dynamic modeling | * Unit choice for dynamic modeling | ||
第28行: | 第28行: | ||
:* meaningful unit: word, syllable, phrases | :* meaningful unit: word, syllable, phrases | ||
:* emotionally consistent unit: emotion profiles, emotograms | :* emotionally consistent unit: emotion profiles, emotograms | ||
+ | :* different aspects of speech tasks place in different scale | ||
+ | * feature concatenation or decision fusion to exploit the information from segmented units | ||
+ | * speech features: | ||
+ | :* prosody feature, pitch, formants, energy, speaking rate, good arosal emotions | ||
+ | :* ZCR, RMS energy, F0, harmonic noise ratio, MFCC | ||
+ | :* MFCC | ||
+ | :* Teager feature is good for detecting streess | ||
− | recognition models | + | * recognition models |
+ | :* SVM, ANN, HMM, GMM, CART | ||
+ | |||
+ | * Emotion distillation framework | ||
+ | :* emotion specific features from the original high-dimensional feature | ||
+ | :* from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output | ||
+ | |||
+ | * Hierarchical classification structure | ||
+ | :* first detect high/low arosal | ||
+ | |||
+ | * Fusion based recognition | ||
+ | :* Feature level fusion | ||
+ | :* decision level fusion | ||
+ | :* Model based fusion: mutli stream HMM | ||
+ | |||
+ | * Temporal phase-based modeling | ||
+ | :* divide the emotion into onset, apex, offset | ||
+ | :* using HMM to chracterize one emotional sub-state, instead of the entire emotional state | ||
+ | :* totally 6 states: (onset,apex, offset) X (high, low) | ||
+ | :* Temporal course modeling | ||
+ | |||
+ | * Structure-based modeling | ||
+ | :* three level units: utterance, emotion units, sub emotion units | ||
+ | :* use statistic model among different levels | ||
+ | |||
+ | |||
+ | Hsin-Min Wang | ||
+ | |||
+ | * Music information retrieval (MIR) | ||
+ | :* title search | ||
+ | :* search by query: | ||
+ | :* emotion of songs labelled by persons forms a Gaussian | ||
+ | :* represent the aoustic features of a song by a probabilistic history vector | ||
+ | :* acoustic GMM posterior representation as a feature | ||
+ | :* GMM code book constructed in training (VA GMM) | ||
+ | :* can put the tag into VA space | ||
+ | |||
+ | * Video to Audio Retrieval | ||
+ | :* First predict video emotion | ||
+ | :* put audio | ||
+ | :* this can be reverse | ||
+ | |||
+ | Emotion variability, by Prof. Vidhyasaharan Sethu: | ||
+ | |||
+ | * GMM supervector based emotion | ||
+ | :* t-SNNE for visualization in 2-D space | ||
+ | :* remove phone variability by phone dependent GMMs | ||
+ | :* speaker normalization is important for emotion recognition | ||
+ | :* two ways: speaker adaptation & speaker signal | ||
+ | :* KL-based estimation on speaker and emotion variability | ||
+ | :* speaker normalization by feature warping | ||
+ | ''':* speaker variation modeling with JFA''' | ||
+ | :* Speaker adaptation : speaker library | ||
+ | |||
+ | * Cognitive load by Julien Epps | ||
+ | :* cognitive load = arousal? | ||
+ | :* load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing... | ||
+ | :* Glottal features | ||
+ | :* SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift | ||
+ | * Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions |
2014年9月13日 (六) 07:05的最后版本
Prof. Chung-Hsien
- Arousal & Valence coordinator
- separate emotion process to sub emotions
- available databases:
- database collection:
- acted : GEneva multimodeal emotion portrayals (GEMEP)
- induced : eNTERFACE'05 EMOTION Database
- spontaneous: SEMAINE, AFEW
- others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
- static vs dynamic modeling
STATIC:
- low level descriptors (LLDs) and functionals
- good for discriminate high and low-arousal emotions
- temporal information is lost, no suitable for long utterances, can not detect change in emotion
DYNAMIC:
- frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
- temporal information is obtained
- difficult to model context well
- a large number of local features need to be extracted,
- Unit choice for dynamic modeling
- technical unit: frame, time slice, equally-divided unit
- meaningful unit: word, syllable, phrases
- emotionally consistent unit: emotion profiles, emotograms
- different aspects of speech tasks place in different scale
- feature concatenation or decision fusion to exploit the information from segmented units
- speech features:
- prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
- ZCR, RMS energy, F0, harmonic noise ratio, MFCC
- MFCC
- Teager feature is good for detecting streess
- recognition models
- SVM, ANN, HMM, GMM, CART
- Emotion distillation framework
- emotion specific features from the original high-dimensional feature
- from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
- Hierarchical classification structure
- first detect high/low arosal
- Fusion based recognition
- Feature level fusion
- decision level fusion
- Model based fusion: mutli stream HMM
- Temporal phase-based modeling
- divide the emotion into onset, apex, offset
- using HMM to chracterize one emotional sub-state, instead of the entire emotional state
- totally 6 states: (onset,apex, offset) X (high, low)
- Temporal course modeling
- Structure-based modeling
- three level units: utterance, emotion units, sub emotion units
- use statistic model among different levels
Hsin-Min Wang
- Music information retrieval (MIR)
- title search
- search by query:
- emotion of songs labelled by persons forms a Gaussian
- represent the aoustic features of a song by a probabilistic history vector
- acoustic GMM posterior representation as a feature
- GMM code book constructed in training (VA GMM)
- can put the tag into VA space
- Video to Audio Retrieval
- First predict video emotion
- put audio
- this can be reverse
Emotion variability, by Prof. Vidhyasaharan Sethu:
- GMM supervector based emotion
- t-SNNE for visualization in 2-D space
- remove phone variability by phone dependent GMMs
- speaker normalization is important for emotion recognition
- two ways: speaker adaptation & speaker signal
- KL-based estimation on speaker and emotion variability
- speaker normalization by feature warping
:* speaker variation modeling with JFA
- Speaker adaptation : speaker library
- Cognitive load by Julien Epps
- cognitive load = arousal?
- load measure: analytical measure (number of ++); physical measure: EEG, ECG/HRV, GSR, respiration; task measure: speech, drawing...
- Glottal features
- SDC: a more logn-gap mfcc data, quite similar to delta_MFCC, however long shift
- Future: relationship between cognitive load vs arousal; mutimodal data, improve discrimination, test under less constrained conditions