“ISCSLP Tutorial 2”版本间的差异
来自cslt Wiki
第22行: | 第22行: | ||
:* temporal information is obtained | :* temporal information is obtained | ||
:* difficult to model context well | :* difficult to model context well | ||
− | :* a large number of local features need to be extracted | + | :* a large number of local features need to be extracted, |
* Unit choice for dynamic modeling | * Unit choice for dynamic modeling | ||
第28行: | 第28行: | ||
:* meaningful unit: word, syllable, phrases | :* meaningful unit: word, syllable, phrases | ||
:* emotionally consistent unit: emotion profiles, emotograms | :* emotionally consistent unit: emotion profiles, emotograms | ||
+ | :* different aspects of speech tasks place in different scale | ||
+ | * feature concatenation or decision fusion to exploit the information from segmented units | ||
+ | * speech features: | ||
+ | :* prosody feature, pitch, formants, energy, speaking rate, good arosal emotions | ||
+ | :* ZCR, RMS energy, F0, harmonic noise ratio, MFCC | ||
+ | :* MFCC | ||
+ | :* Teager feature is good for detecting streess | ||
− | recognition models | + | * recognition models |
+ | :* SVM, ANN, HMM, GMM, CART | ||
+ | |||
+ | * Emotion distillation framework | ||
+ | :* emotion specific features from the original high-dimensional feature | ||
+ | :* from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output | ||
+ | |||
+ | * Hierarchical classification structure | ||
+ | :* first detect high/low arosal | ||
+ | |||
+ | * Fusion based recognition | ||
+ | :* Feature level fusion | ||
+ | :* decision level fusion | ||
+ | :* Model based fusion: mutli stream HMM | ||
+ | |||
+ | * Temporal phase-based modeling | ||
+ | :* divide the emotion into onset, apex, offset | ||
+ | :* using HMM to chracterize one emotional sub-state, instead of the entire emotional state | ||
+ | :* totally 6 states: (onset,apex, offset) X (high, low) | ||
+ | :* Temporal course modeling | ||
+ | |||
+ | * Structure-based modeling | ||
+ | :* three level units: utterance, emotion units, sub emotion units | ||
+ | :* use statistic model among different levels | ||
+ | |||
+ | |||
+ | Hsin-Min Wang | ||
+ | |||
+ | * Music information retrieval (MIR) | ||
+ | :* title search | ||
+ | :* search by query | ||
+ | :* emotion of songs labelled by persons forms a Gaussian | ||
+ | :* represent the aoustic features of a song by a probabilistic history vector | ||
+ | :* acoustic GMM posterior representation as a feature | ||
+ | :* GMM code book constructed in training (VA GMM) |
2014年9月13日 (六) 05:57的版本
Prof. Chung-Hsien
- Arousal & Valence coordinator
- separate emotion process to sub emotions
- available databases:
- database collection:
- acted : GEneva multimodeal emotion portrayals (GEMEP)
- induced : eNTERFACE'05 EMOTION Database
- spontaneous: SEMAINE, AFEW
- others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
- static vs dynamic modeling
STATIC:
- low level descriptors (LLDs) and functionals
- good for discriminate high and low-arousal emotions
- temporal information is lost, no suitable for long utterances, can not detect change in emotion
DYNAMIC:
- frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
- temporal information is obtained
- difficult to model context well
- a large number of local features need to be extracted,
- Unit choice for dynamic modeling
- technical unit: frame, time slice, equally-divided unit
- meaningful unit: word, syllable, phrases
- emotionally consistent unit: emotion profiles, emotograms
- different aspects of speech tasks place in different scale
- feature concatenation or decision fusion to exploit the information from segmented units
- speech features:
- prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
- ZCR, RMS energy, F0, harmonic noise ratio, MFCC
- MFCC
- Teager feature is good for detecting streess
- recognition models
- SVM, ANN, HMM, GMM, CART
- Emotion distillation framework
- emotion specific features from the original high-dimensional feature
- from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
- Hierarchical classification structure
- first detect high/low arosal
- Fusion based recognition
- Feature level fusion
- decision level fusion
- Model based fusion: mutli stream HMM
- Temporal phase-based modeling
- divide the emotion into onset, apex, offset
- using HMM to chracterize one emotional sub-state, instead of the entire emotional state
- totally 6 states: (onset,apex, offset) X (high, low)
- Temporal course modeling
- Structure-based modeling
- three level units: utterance, emotion units, sub emotion units
- use statistic model among different levels
Hsin-Min Wang
- Music information retrieval (MIR)
- title search
- search by query
- emotion of songs labelled by persons forms a Gaussian
- represent the aoustic features of a song by a probabilistic history vector
- acoustic GMM posterior representation as a feature
- GMM code book constructed in training (VA GMM)