“ISCSLP Tutorial 2”版本间的差异
来自cslt Wiki
第68行: | 第68行: | ||
* Music information retrieval (MIR) | * Music information retrieval (MIR) | ||
:* title search | :* title search | ||
− | :* search by query | + | :* search by query: |
:* emotion of songs labelled by persons forms a Gaussian | :* emotion of songs labelled by persons forms a Gaussian | ||
:* represent the aoustic features of a song by a probabilistic history vector | :* represent the aoustic features of a song by a probabilistic history vector | ||
:* acoustic GMM posterior representation as a feature | :* acoustic GMM posterior representation as a feature | ||
:* GMM code book constructed in training (VA GMM) | :* GMM code book constructed in training (VA GMM) | ||
+ | :* can put the tag into VA space | ||
+ | |||
+ | * Video to Audio Retrieval | ||
+ | :* First predict video emotion | ||
+ | :* put audio | ||
+ | :* this can be reverse | ||
+ | |||
+ | Emotion variability, by Prof. Julien Epps: | ||
+ | |||
+ | * GMM supervector based emotion | ||
+ | :* t-SNNE for visualization in 2-D space | ||
+ | :* remove phone variability by phone dependent GMMs | ||
+ | :* speaker normalization is important for emotion recognition | ||
+ | :* two ways: speaker adaptation & speaker signal normalization | ||
+ | :* KL-based estimation on speaker and emotion variability | ||
+ | :* speaker normalization by feature warping | ||
+ | ''':* speaker variation modeling with JFA''' | ||
+ | :* Speaker adaptation : speaker library | ||
+ | :* |
2014年9月13日 (六) 06:38的版本
Prof. Chung-Hsien
- Arousal & Valence coordinator
- separate emotion process to sub emotions
- available databases:
- database collection:
- acted : GEneva multimodeal emotion portrayals (GEMEP)
- induced : eNTERFACE'05 EMOTION Database
- spontaneous: SEMAINE, AFEW
- others: RML,VAM ,FAU AUBO,SAVEE,TUMAVIC,IEMOCAP,SEMAINE MHMC
- static vs dynamic modeling
STATIC:
- low level descriptors (LLDs) and functionals
- good for discriminate high and low-arousal emotions
- temporal information is lost, no suitable for long utterances, can not detect change in emotion
DYNAMIC:
- frame as the basis, LLDs are extracted and modeled by GMMs, HMMs, DTW
- temporal information is obtained
- difficult to model context well
- a large number of local features need to be extracted,
- Unit choice for dynamic modeling
- technical unit: frame, time slice, equally-divided unit
- meaningful unit: word, syllable, phrases
- emotionally consistent unit: emotion profiles, emotograms
- different aspects of speech tasks place in different scale
- feature concatenation or decision fusion to exploit the information from segmented units
- speech features:
- prosody feature, pitch, formants, energy, speaking rate, good arosal emotions
- ZCR, RMS energy, F0, harmonic noise ratio, MFCC
- MFCC
- Teager feature is good for detecting streess
- recognition models
- SVM, ANN, HMM, GMM, CART
- Emotion distillation framework
- emotion specific features from the original high-dimensional feature
- from speech signals, using SVM to generate emotiongrams, and then use HMM, n-gram, LDA, simple sum, give emotion output
- Hierarchical classification structure
- first detect high/low arosal
- Fusion based recognition
- Feature level fusion
- decision level fusion
- Model based fusion: mutli stream HMM
- Temporal phase-based modeling
- divide the emotion into onset, apex, offset
- using HMM to chracterize one emotional sub-state, instead of the entire emotional state
- totally 6 states: (onset,apex, offset) X (high, low)
- Temporal course modeling
- Structure-based modeling
- three level units: utterance, emotion units, sub emotion units
- use statistic model among different levels
Hsin-Min Wang
- Music information retrieval (MIR)
- title search
- search by query:
- emotion of songs labelled by persons forms a Gaussian
- represent the aoustic features of a song by a probabilistic history vector
- acoustic GMM posterior representation as a feature
- GMM code book constructed in training (VA GMM)
- can put the tag into VA space
- Video to Audio Retrieval
- First predict video emotion
- put audio
- this can be reverse
Emotion variability, by Prof. Julien Epps:
- GMM supervector based emotion
- t-SNNE for visualization in 2-D space
- remove phone variability by phone dependent GMMs
- speaker normalization is important for emotion recognition
- two ways: speaker adaptation & speaker signal normalization
- KL-based estimation on speaker and emotion variability
- speaker normalization by feature warping
:* speaker variation modeling with JFA
- Speaker adaptation : speaker library