|
|
(4位用户的43个中间修订版本未显示) |
第1行: |
第1行: |
| + | ==DNN architecture== |
| | | |
− | ===1 SPEECH PERCEPTION, PRODUCTION AND ACQUISITION===
| + | * [http://www.isca-speech.org/archive/Interspeech_2016/pdfs/1446.pdf Ying Zhang et al. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks] |
| + | * [[媒体文件:OUTRAGEOUSLYLARGENEURALNETWORKSTHESPARSELY-GATEDMIXTURE-OF-EXPERTSLAYER.pdf|ICLR2017: OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER]] |
| + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f/fb/LightRNN.pdf lightRNN from microsoft] |
| + | * [https://arxiv.org/pdf/1512.03385v1.pdf Kaiming He et al. Deep Residual Learning for Image Recognition] |
| + | * [http://www.isca-speech.org/archive/Interspeech_2016/pdfs/0515.pdf Wei-Ning Hsu et al. Exploiting Depth and Highway Connections in Convolutional Recurrent Deep Neural Networks for Speech Recognition] |
| + | * [http://t.cn/RfZHxko MICRO 2016 ] |
| + | * [[媒体文件:Cambricon-X.pdf| Cambricon-X: An Accelerator for Sparse Neural Networks]] |
| + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/2/26/REVISE_SATURATED_ACTIVATION_FUNCTIONS.pdf revise saturated activation functions] |
| | | |
| + | ==Visualization== |
| | | |
| + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/b/b9/Visualizing_and_Understanding_Genomic.pdf Visualizing and Understanding Genomic Sequences Using Deep Neural Networks] |
| + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/4/43/On_the_Role_of_Nonlinear_Transformations_in_Deep_Neural_Network_Acoustic_Models.PDF On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models] |
| + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f/f6/Understanding_intermediate_layers_using_linear_classifier_probes.pdf Understanding_intermediate_layers_using_linear_classifier_probes] |
| | | |
− | [[1.1 Models of speech production]]
| + | ==Speaker recognition== |
| | | |
− | [[1.2 Physiology and neurophysiology of speech production]] | + | * [http://cslt.riit.tsinghua.edu.cn/mediawiki/images/1/1b/RedDots.rar# INTERSPEECH 2016 Fri-O-2-2 :Special Session: The RedDots Challenge: Towards Characterizing Speakers from Short Utterances] |
| + | * [http://192.168.0.51:8888/2016/interspeech2016/WELCOME.html# INTERSPEECH 2016 Fri-O-3-2 : Special Session: The Speakers in the Wild (SITW) Speaker Recognition Challenge] |
| | | |
− | [[1.3 Neural basis of speech production]]
| |
| | | |
− | [[1.4 Coarticulation]]
| + | ==Review== |
| | | |
− | [[1.5 Models of speech perception]] | + | *[[媒体文件:Note icassp16.pdf|Zhiyuan Tang 20160520 - ICASSP 2016 summary ]] |
− | | + | *[[媒体文件:Nn analysis.pdf |Zhiyuan Tang 20160802 - Visualizing, Measuring and Understanding Neural Networks: A Brief Survey ]] |
− | [[1.6 Physiology and neurophysiology of speech perception]]
| + | *[[媒体文件:Interspeech16 review.pdf|Zhiyuan Tang 20161122 - INTERSPEECH 2016 summary ]] |
− | | + | |
− | [[1.7 Neural basis of speech perception]]
| + | |
− | | + | |
− | [[1.8 Acoustic and articulatory cues in speech perception]]
| + | |
− | | + | |
− | [[1.9 Interaction speech production-speech perception]]
| + | |
− | | + | |
− | [[1.10 Multimodal speech perception]]
| + | |
− | | + | |
− | [[1.11 Cognition and brain studies on speech]]
| + | |
− | | + | |
− | [[1.12 Multilingual studies]]
| + | |
− | | + | |
− | [[1.13 L1 acquisition and bilingual acquisition]]
| + | |
− | | + | |
− | [[1.14 L2 acquisition by children and adults]]
| + | |
− | | + | |
− | [[1.15 Speech and hearing disorders]]
| + | |
− | | + | |
− | [[1.16 Singing voice: production and perception]]
| + | |
− | | + | |
− | [[1.17 Speech and other biosignals]]
| + | |
− | | + | |
− | [[1.18 Special Session: Intelligibility under the microscope]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[2 PHONETICS, PHONOLOGY, AND PROSODY]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[2.1 Phonetics and phonology]]
| + | |
− | | + | |
− | [[2.2 Language descriptions]]
| + | |
− | | + | |
− | [[2.3 Linguistic systems]]
| + | |
− | | + | |
− | [[2.4 Discourse and dialog structures]]
| + | |
− | | + | |
− | [[2.5 Acoustic phonetics]]
| + | |
− | | + | |
− | [[2.6 Phonation, voice quality]]
| + | |
− | | + | |
− | [[2.7 Articulatory and acoustic features of prosody]]
| + | |
− | | + | |
− | [[2.8 Perception of prosody]]
| + | |
− | | + | |
− | [[2.9 Phonological processes and models]]
| + | |
− | | + | |
− | [[2.10 Laboratory phonology]]
| + | |
− | | + | |
− | [[2.11 Phonetic universals]]
| + | |
− | | + | |
− | [[2.12 Sound changes]]
| + | |
− | | + | |
− | [[2.13 Sociophonetics]]
| + | |
− | | + | |
− | [[2.14 Phonetics of L1-L2 interaction]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[3 ANALYSIS OF PARALINGUISTICS IN SPEECH AND LANGUAGE]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[3.1 Analysis of speaker states]]
| + | |
− | | + | |
− | [[3.2 Analysis of speaker traits]]
| + | |
− | | + | |
− | [[3.3 Automatic analysis of speaker states and traits]]
| + | |
− | | + | |
− | [[3.4 Pathological speech and language]]
| + | |
− | | + | |
− | [[3.5 Non-verbal communication]]
| + | |
− | | + | |
− | [[3.6 Social and vocal signals]]
| + | |
− | | + | |
− | [[3.7 Sentiment analysis and opinion mining]]
| + | |
− | | + | |
− | [[3.8 Paralinguistics in singing]]
| + | |
− | | + | |
− | [[3.9 Perception of paralinguistic phenomena]]
| + | |
− | | + | |
− | [[3.10 Phonetic and linguistic aspects of paralinguistics]]
| + | |
− | | + | |
− | [[3.11 Special Session: Interspeech 2016 Computational Paralinguistics Challenge (ComParE): Deception & Sincerity]]
| + | |
− | | + | |
− | [[3.12 Special Session: Clinical and neuroscience-inspired vocal biomarkers of neurological and psychiatric disorders]] | + | |
− | | + | |
− | | + | |
− | | + | |
− | [[4 SPEAKER AND LANGUAGE IDENTIFICATION]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[4.1 Language identification and verification]]
| + | |
− | | + | |
− | [[4.2 Dialect and accent recognition]]
| + | |
− | | + | |
− | [[4.3 Speaker verification and identification]]
| + | |
− | | + | |
− | [[4.4 Features for speaker and language recognition]]
| + | |
− | | + | |
− | [[4.5 Robustness to variable and degraded channels]]
| + | |
− | | + | |
− | [[4.6 Speaker confidence estimation]]
| + | |
− | | + | |
− | [[4.7 Speaker diarization]]
| + | |
− | | + | |
− | [[4.8 Higher-level knowledge in speaker and language recognition]]
| + | |
− | | + | |
− | [[4.9 Evaluation of speaker and language identification systems]]
| + | |
− | | + | |
− | [[4.10 Special Session: The RedDots Challenge: Towards Characterizing Speakers from Short Utterances]]
| + | |
− | | + | |
− | [[4.11 Special Session: The Speakers in the Wild (SITW) Speaker Recognition Challenge]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[5 ANALYSIS OF SPEECH AND AUDIO SIGNALS]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[5.1 Speech acoustics]]
| + | |
− | | + | |
− | [[5.2 Speech analysis and representation]]
| + | |
− | | + | |
− | [[5.3 Audio signal analysis and representation]]
| + | |
− | | + | |
− | [[5.4 Speech and audio segmentation and classification]]
| + | |
− | | + | |
− | [[5.5 Voice activity detection]]
| + | |
− | | + | |
− | [[5.6 Pitch and harmonic analysis]]
| + | |
− | | + | |
− | [[5.7 Source separation and computational auditory scene analysis]]
| + | |
− | | + | |
− | [[5.8 Speaker spatial localization]]
| + | |
− | | + | |
− | [[5.9 Voice separation]]
| + | |
− | | + | |
− | [[5.10 Music signal processing and understanding]]
| + | |
− | | + | |
− | [[5.11 Singing analysis]]
| + | |
− | | + | |
− | [[5.12 Special Session: Speech, audio, and language processing techniques applied to bird and animal vocalisations ]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[6 SPEECH CODING AND ENHANCEMENT]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[6.1 Speech coding and transmission]]
| + | |
− | | + | |
− | [[6.2 Low-bit-rate speech coding]]
| + | |
− | | + | |
− | [[6.3 Perceptual audio coding of speech signals]]
| + | |
− | | + | |
− | [[6.4 Noise reduction for speech signals]]
| + | |
− | | + | |
− | [[6.5 Speech enhancement: single-channel]]
| + | |
− | | + | |
− | [[6.6 Speech enhancement: multi-channel]]
| + | |
− | | + | |
− | [[6.7 Speech intelligibility]]
| + | |
− | | + | |
− | [[6.8 Active noise control]]
| + | |
− | | + | |
− | [[6.9 Speech enhancement in hearing aids]]
| + | |
− | | + | |
− | [[6.10 Adaptive beamforming for speech enhancement]]
| + | |
− | | + | |
− | [[6.11 Dereverberation for speech signals]]
| + | |
− | | + | |
− | [[6.12 Echo cancelation for speech signals]]
| + | |
− | | + | |
− | [[6.13 Evaluation of speech transmission, coding and enhancement]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[7 SPEECH SYNTHESIS AND SPOKEN LANGUAGE GENERATION]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[7.1 Grapheme-to-phoneme conversion for synthesis]]
| + | |
− | | + | |
− | [[7.2 Text processing for speech synthesis]]
| + | |
− | | + | |
− | [[7.3 Signal processing/statistical models for synthesis]]
| + | |
− | | + | |
− | [[7.4 Speech synthesis paradigms and methods]]
| + | |
− | | + | |
− | [[7.5 Articulatory speech synthesis]]
| + | |
− | | + | |
− | [[7.6 Segment-level and/or concatenative synthesis]]
| + | |
− | | + | |
− | [[7.7 Unit selection speech synthesis]]
| + | |
− | | + | |
− | [[7.8 Statistical parametric speech synthesis]]
| + | |
− | | + | |
− | [[7.9 Prosody modeling and generation]]
| + | |
− | | + | |
− | [[7.10 Expression, emotion and personality generation]]
| + | |
− | | + | |
− | [[7.11 Synthesis of singing voices]]
| + | |
− | | + | |
− | [[7.12 Voice modification, conversion and morphing]]
| + | |
− | | + | |
− | [[7.13 Concept-to-speech conversion]]
| + | |
− | | + | |
− | [[7.14 Cross-lingual and multilingual aspects in speech synthesis]]
| + | |
− | | + | |
− | [[7.15 Avatars and talking faces]]
| + | |
− | | + | |
− | [[7.16 Tools and data for speech synthesis]]
| + | |
− | | + | |
− | [[7.17 Evaluation of speech synthesis]]
| + | |
− | | + | |
− | [[7.18 Special Session: Singing Synthesis Challenge: Fill-In the Gap]]
| + | |
− | | + | |
− | [[7.19 Special Session: Voice Conversion Challenge 2016]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[8 SPEECH RECOGNITION: SIGNAL PROCESSING, ACOUSTIC MODELING, ROBUSTNESS, ADAPTATION]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[8.1 Feature extraction and low-level feature modeling for ASR]]
| + | |
− | | + | |
− | [[8.2 Prosodic features and models]]
| + | |
− | | + | |
− | [[8.3 Robustness against noise, reverberation]]
| + | |
− | | + | |
− | [[8.4 Far field and microphone array speech recognition]]
| + | |
− | | + | |
− | [[8.5 Speaker normalization (e.g., VTLN)]]
| + | |
− | | + | |
− | [[8.6 New types of neural network models and learning (e.g., new variants of DNN, CNN)]]
| + | |
− | | + | |
− | [[8.7 Discriminative acoustic training methods for ASR]]
| + | |
− | | + | |
− | [[8.8 Acoustic model adaptation (speaker, bandwidth, emotion, accent)]]
| + | |
− | | + | |
− | [[8.9 Speaker adaptation, speaker adapted training methods]]
| + | |
− | | + | |
− | [[8.10 Pronunciation variants and modeling for speech recognition]]
| + | |
− | | + | |
− | [[8.11 Acoustic confidence measures]]
| + | |
− | | + | |
− | [[8.13 Cross-lingual and multilingual aspects, non-native accents]]
| + | |
− | | + | |
− | [[8.14 Acoustic modeling for conversational speech (dialog, interaction)]]
| + | |
− | | + | |
− | [[8.15 Evaluation of speech recognition]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[9 SPEECH RECOGNITION - ARCHITECTURE, SEARCH, AND LINGUISTIC COMPONENTS]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[9.1 Lexical modeling and access: units and models]]
| + | |
− | | + | |
− | [[9.2 Automatic lexicon learning]]
| + | |
− | | + | |
− | [[9.3 Supervised/unsupervised morphological models]]
| + | |
− | | + | |
− | [[9.4 Prosodic features and models for language modeling]]
| + | |
− | | + | |
− | [[9.5 Discriminative training methods for language modeling]]
| + | |
− | | + | |
− | [[9.6 Language model adaptation (domain, diachronic adaptation)]]
| + | |
− | | + | |
− | [[9.7 Language modeling for conversational speech (dialog, interaction)]]
| + | |
− | | + | |
− | [[9.8 Neural networks for language modeling]]
| + | |
− | | + | |
− | [[9.9 Search methods, decoding algorithms, lattices, multipass strategies]]
| + | |
− | | + | |
− | [[9.10 New computational strategies, data-structures for ASR]]
| + | |
− | | + | |
− | [[9.11 Computational resource constrained speech recognition]]
| + | |
− | | + | |
− | [[9.12 Confidence measures]]
| + | |
− | | + | |
− | [[9.13 Cross-lingual and multilingual components for speech recognition]]
| + | |
− | | + | |
− | [[9.14 Structured classification approaches]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[10 SPEECH RECOGNITION - TECHNOLOGIES AND SYSTEMS FOR NEW APPLICATIONS]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[10.1 Multimodal systems]]
| + | |
− | | + | |
− | [[10.2 Applications in education and learning (incl. CALL, assessment of fluency)]]
| + | |
− | | + | |
− | [[10.3 Applications in medical practice (CIS, voice assessment, etc.)]]
| + | |
− | | + | |
− | [[10.4 Speech science in end-user applications]]
| + | |
− | | + | |
− | [[10.5 Rich transcription]]
| + | |
− | | + | |
− | [[10.6 Innovative products and services based on speech technologies]]
| + | |
− | | + | |
− | [[10.7 Sparse, template-based representations]]
| + | |
− | | + | |
− | [[10.8 New paradigms (e.g. artic. models, silent speech interfaces, topic models)]]
| + | |
− | | + | |
− | [[10.9 Special Session: Sub-Saharan African languages: from speech fundamentals to applications]]
| + | |
− | | + | |
− | [[10.10 Special Session: Realism in robust speech processing ]] | + | |
− | | + | |
− | [[10.11 Special Session: Sharing Research and Education Resources for Understanding Speech Processing]]
| + | |
− | | + | |
− | [[10.12 Special Session: Speech and Language Technologies for Human-Machine Conversation-based Language Education]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[11 SPOKEN LANGUAGE PROCESSING - DIALOG, SUMMARIZATION, UNDERSTANDING]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[11.1 Spoken dialog systems]]
| + | |
− | | + | |
− | [[11.2 Multimodal human-machine interaction (conversat. agents, human-robot)]]
| + | |
− | | + | |
− | [[11.3 Analysis of verbal, co-verbal and nonverbal behavior]]
| + | |
− | | + | |
− | [[11.4 Interactive systems for speech/language training, therapy, communication aids]]
| + | |
− | | + | |
− | [[11.5 Stochastic modeling for dialog]]
| + | |
− | | + | |
− | [[11.6 Question-answering from speech]]
| + | |
− | | + | |
− | [[11.7 Spoken document summarization]]
| + | |
− | | + | |
− | [[11.8 Systems for spoken language understanding]]
| + | |
− | | + | |
− | [[11.9 Topic spotting and classification]]
| + | |
− | | + | |
− | [[11.10 Entity extraction from speech]]
| + | |
− | | + | |
− | [[11.11 Semantic analysis and classification]]
| + | |
− | | + | |
− | [[11.12 Conversation and interaction]]
| + | |
− | | + | |
− | [[11.13 Evaluation of speech and multimodal dialog systems]]
| + | |
− | | + | |
− | [[11.14 Evaluation of summarization and understanding]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[12 SPOKEN LANGUAGE PROCESSING: TRANSLATION, INFORMATION RETRIEVAL, RESOURCES]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[12.1 Spoken machine translation]]
| + | |
− | | + | |
− | [[12.2 Speech-to-speech translation systems]]
| + | |
− | | + | |
− | [[12.3 Transliteration]]
| + | |
− | | + | |
− | [[12.4 Voice search]]
| + | |
− | | + | |
− | [[12.5 Spoken term detection]]
| + | |
− | | + | |
− | [[12.6 Audio indexing]]
| + | |
− | | + | |
− | [[12.7 Spoken document retrieval]]
| + | |
− | | + | |
− | [[12.8 Systems for mining spoken data, search or retrieval of speech documents]]
| + | |
− | | + | |
− | [[12.9 Speech and multimodal resources and annotation]]
| + | |
− | | + | |
− | [[12.10 Metadata descriptions of speech, audio and text resources]]
| + | |
− | | + | |
− | [[12.11 Metadata for semantic or content markup]]
| + | |
− | | + | |
− | [[12.12 Metadata for ling./discourse structure (disfluencies, boundaries, speech acts)]]
| + | |
− | | + | |
− | [[12.13 Methodologies and tools for language resource construction and annotation]]
| + | |
− | | + | |
− | [[12.14 Automatic segmentation and labeling of resources]]
| + | |
− | | + | |
− | [[12.15 Multilingual resources]]
| + | |
− | | + | |
− | [[12.16 Evaluation and quality insurance of language resources]]
| + | |
− | | + | |
− | [[12.17 Evaluation of translation and information retrieval systems]]
| + | |
− | | + | |
− | [[12.18 Special Session: Open Data for Under-Resourced Languages]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[13 SPEECH AND SPOKEN-LANGUAGE BASED MULTIMODAL PROCESSING AND SYSTEMS]]
| + | |
− | | + | |
− | | + | |
− | | + | |
− | [[13.1 Multimodal Speech Recognition]]
| + | |
− | | + | |
− | [[13.2 Multimodal LVCSR Systems]]
| + | |
− | | + | |
− | [[13.3 Multimodal Speech Analysis]]
| + | |
− | | + | |
− | [[13.4 Multimodal Synthesis]]
| + | |
− | | + | |
− | [[13.5 Multimodal Language Analysis ]]
| + | |
− | | + | |
− | [[13.6 Multimodal and multimedia language trait recognition ]]
| + | |
− | | + | |
− | [[13.7 Multimodal paralinguistics ]]
| + | |
− | | + | |
− | [[13.8 Multimodal interactions, interfaces]]
| + | |
− | | + | |
− | [[13.9 Special Session: Auditory-visual expressive speech and gesture in humans and machines]]
| + | |