2016年8月16日 (二) 06:33的版本

Text Processing Team Schedule

Members

Former Members

Rong Liu (刘荣) : 优酷
Xiaoxi Wang (王晓曦) : 图灵机器人
Xi Ma (马习) : 清华大学研究生
DongXu Zhang (张东旭) : --
Yiqiao Pan (潘一桥)：继续读研

Current Members

Tianyi Luo (骆天一)
Chao Xing (邢超)
Qixin Wang (王琪鑫)
Aodong Li (李傲冬)
Aiting Liu (刘艾婷)
Ziwei Bai (白子薇)

Past progress

nlp-progress 201605-20160816

nlp-progress-2016-04

“Schedule”版本间的差异

2016年8月16日 (二) 06:33的版本

目录

Text Processing Team Schedule

Members

Former Members

Current Members

Past progress

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具

@@ 第17行： / 第17行： @@
 * Ziwei Bai (白子薇)
-==Work Process==
-===Paper Share===
-====2016-06-23====
-Learning Better Embeddings for Rare Words Using Distributional Representations [http://aclweb.org/anthology/D15-1033 pdf]
-Hierarchical Attention Networks for Document Classification [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf pdf]
-Hierarchical Recurrent Neural Network for Document Modeling [http://www.aclweb.org/anthology/D/D15/D15-1106.pdf pdf]
-Learning Distributed Representations of Sentences from Unlabelled Data [http://arxiv.org/pdf/1602.03483.pdf pdf]
-Speech Synthesis Based on HiddenMarkov Models [http://www.research.ed.ac.uk/portal/files/15269212/Speech_Synthesis_Based_on_Hidden_Markov_Models.pdf pdf]
-===Research Task===
-====Binary Word Embedding(Aiting)====
-[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/9/97/Binary.pdf binary]
--06-05:  find out that tensorflow does not provide logical derivation method.
--06-01:  complete the first version of binary word embedding model
--05-28:  complete the word2vec model in tensorflow
--05-25:  write my own version of word2vec model
--05-23:
-.get tensorflow's word2vec model from(https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/embedding)
-.learn word2vec_basic model
-.run word2vec.py and word2vec_optimized.py
--05-22：
-.find the tf.logical_xor(x,y) method in tensorflow to compute Hamming distance.
-.learn tensorflow's word2vec model
--05-21：
-.read Lantian's paper 'Binary Speaker Embedding'
-.try to find a formula in tensorflow to compute Hamming distance.
-====Ordered Word Embedding(Aodong)====
-: 2016-07-25, 26, 27, 28, 29 : Share the ACL paper and implement rare word embedding
-: 2016-07-18, 19, 20, 21, 22 : Find a scratch paper and ask for the corpora
-: 2016-07-13, 14, 15 : Debug the mixed version of Rare Word Embedding
-: 2016-07-12 : Complete the mixed initialization version of Rare Word Embedding and start training
-: 2016-07-11 :
-    Improve the predict process of chatting model
-    Changing some hyperparameters of the chatting model to speed up the training process
-: 2016-07-09, 10 : Try to carry out paper's low-freq word experiment, and do some readings both from PRML and Hang Li's paper.
-: 2016-07-08 : Do model selections and the model finally set off on the server.
-: 2016-07-07 : Complete the chatting model and run on Huilian's server, in order to overcome the GPU memory problem.
-: 2016-07-06 : Finally complete translate model in tensorflow!!!! Now I am dealing with the out of memory problem.
-: 2016-07-05 :
-    Code predict process
-    Although I've got a low cost value, the predict result does not compatible as expected, even the input of predict process from training set
-    When I tried the Weibo data, program collapsed with an out of memory error.
-: 2016-07-04 : Complete Coding training process
-: 2016-07-01, 02: The cost function is very bumpy, debug it, while it's quite difficult!
-: 2016-06-27, 28, 29 : Coding
-: 2016-06-26 : Code tf's GRU and attention model
-: 2016-06-25 : Read tf's source code rnn_cell.py and seq2seq.py
-: 2016-06-24 :
-    Code spearman correlation coefficient and experiment
-    Read Li's paper "Neural Responding Machine for Short-Text Conversation"
-: 2016-06-23 :
-    Share paper "Learning Better Embeddings for Rare Words Using Distributional Representations"
-    experiment and receive new task
-: 2016-06-22 :
-    Experiment on low-frequency words
-    Roughly read "Online Learning of Interpretable Word Embeddings"
-    Roughly read "Learning Better Embeddings for Rare Words Using Distributional Representations"
-: 2016-06-21 : Experiment and calculate cosine distance between words
-: 2016-06-20 : Something went wrong with my program and fix it, so I have to start it all over again
-: 2016-06-04 : Experiment the semantic&syntactic analysis of retrained word vector
-: 2016-06-03 : Complete coding retrain process of low-freq word and experiment the semantic&syntactic analysis
-: 2016-06-02 : Complete coding predict process of low-freq word and experiment the semantic&syntactic analysis
-: 2016-06-01 : Read "Distributed Representations of Words and Phrases and their Compositionality"
-: 2016-05-31 :
-    Read Mikolov's ppt about his word embedding papers
-    test the randomness of word2vec and there is nothing different in single thread while rerunning the program
-    Download dataset "microsoft syntactic test set", "wordsim353", and "simlex-999"
-: 2016-05-30 : Read "Hierarchical Probabilistic Neural Network Language Model" and "word2vec Explained: Deriving Mikolov's Negative-Sampling Word-Embedding Method"
-: 2016-05-27 : Reread word2vec paper and read C-version word2vec.
-: 2016-05-24 : Understand word2vec in TensorFlow, and because of some uncompleted functions, I determine to adapt the source of C-versioned word2vec.
-: 2016-05-23 :
-    Basic setup of TensorFlow
-    Read code of word2vec in TensorFlow
-: 2016-05-22 :
-    Learn about algorithms in word2vec
-    Read low-freq word papar and learn about 6 strategies
-[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/3/39/How_to_deal_with_low_frequency_words.pdf low_freq]
-[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/2/2c/Lowv.pdf order_rep]
-====Matrix Factorization(Ziwei)====
-[http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf matrix-factorization]
--06-23：
-          prepare for report
--05-28：
-          learn the code 'matrix-factorization.py','count_word_frequence.py',and 'reduce_rawtext_matrix_factorization.py'
-          problem:I have no idea how to run the program and where the data.
--05-23:
-           read the code 'map_rawtext_matrix_factorization.py'
--05-22：
-           learn the rest of  paper ‘Neural word Embedding as implicit matrix factorization’
--05-21：
-           learn the ‘abstract’ and ‘introduction’ of paper ‘Neural word Embedding as implicit matrix factorization’
-===Question answering system===
-====Chao Xing====
--05-30 ~ 2016-06-04 :
-             Deliver CDSSM model to huilan.
--05-29 :
-             Package chatting model in practice.
--05-28 :
-             Modify bugs...
--05-27 :
-             Train large scale model, find some problem.
--05-26 :
-             Modify test program for large scale testing process.
--05-24 :
-             Build CDSSM model in huilan's machine.
--05-23 :
-             Find three things to do.
-. Cost function change to maximize QA+ - QA-.
-. Different parameters space in Q space and A space.
-. HRNN separate to two tricky things : use output layer or use hidden layer as decoder's softmax layer's input.
--05-22 :
-. Investigate different loss functions in chatting model.
--05-21 :
-. Hand out different research task to intern students.
--05-20 :
-. Testing denosing rnn generation model.
--05-19 :
-. Discover for denosing rnn.
--05-18 :
-. Modify model for crawler data.
--05-17 :
-. Code & Test HRNN model.
--05-16 :
-. Work done for CDSSM model.
--05-15 :
-. Test CDSSM model package version.
--05-13 :
-. Coding done CDSSM model package version. Wait to test.
--05-12 :
-. Begin to package CDSSM model for huilan.
--05-11 :
-. Prepare for paper sharing.
-. Finish CDSSM model in chatting process.
-. Start setup model & experiment in dialogue system.
--05-10 :
-. Finish test CDSSM model in chatting, find original data has some problem.
-. Read paper:
-                    A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion
-                    A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
-                    Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
-                    Neural Responding Machine for Short-Text Conversation
--05-09 :
-. Test CDSSM model in chatting model.
-. Read paper :
-                    Learning from Real Users Rating Dialogue Success with Neural Networks for Reinforcement Learning in Spoken Dialogue Systems
-                    SimpleDS A Simple Deep Reinforcement Learning Dialogue System
-. Code RNN by myself in tensorflow.
--05-08 :
-             Fix some problem in dialogue system team, and continue read some papers in dialogue system.
--05-07 :
-             Read some papers in dialogue system.
--05-06 :
-             Try to fix RNN-DSSM model in tensorflow. Failure..
--05-05 :
-             Coding for RNN-DSSM in tensorflow. Face an error when running rnn-dssm model in cpu : memory keep increasing.
-             Tensorflow's version in huilan is 0.7.0 and install by pip, this cause using error in creating gpu graph,
-             one possible solution is build tensorflow from source code.
-====Aiting Liu====
--08-08 ~ 2016-08-12 :
-.finish the first version of chapter2
-.read paper "A Sentence Interaction Network for Modeling Dependence between Sentences"    [[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/0/04/A_Sentence_Interaction_Network_for_Modeling_Dependence_between_Sentences.pdf pdf]]
--08-05:   write section probabilistic PCA
--08-04:   write section softmax regression
--08-03:   write section logistic regression
--08-02:   write section polynomial fitting and linear regression
--08-01:    learn linear model , determine the content of the chapter2
--07-28 ~ 2016-07-29:    learn lesson linear model
--07-25 ~ 2016-07-27:    read paper Intrinsic Subspace Evaluation of Word Embedding Representations    [[http://cslt.riit.tsinghua.edu.cn/mediawiki/images/6/68/Intrinsic_Subspace_Evaluation_of_Word_Embedding_Representations.pdf pdf]]
--07-22 ~ 2016-07-23:    run and modify lyrics generation model
--07-19 ~ 2016-07-21:
-        preprocess the 200,000 lyrics
--07-18:    get 200,000 songs from http://www.cnlyric.com/ ( singer list from a-z)
--07-08:   preprocess the lyrics from baidu music
--07-07:   get 27963 songs from http://www.cnlyric.com/ (singerlist from A/B/C)
--07-06:   try to get lyrics from http://www.kuwo.cn/ http://www.kugou.com/ http://y.qq.com/#type=index http://music.163.com/
--07-05:   write lyrics spider, and get 56306 songs from http://music.baidu.com/
--07-04:   learn tensorflow
--07-01:   submit APSIPA2016  paper
--06-30:   perfection paper
--06-29:   complete the ordered word embedding's paper
--06-26:   modify the ordered word embedding's paper
--06-25:   complete ordered word embedding experiment,get 54 figures
--06-23:   read Bengio's paper https://arxiv.org/pdf/1605.06069v3.pdf
--06-22:   read Bengio's paper http://arxiv.org/pdf/1507.04808v3.pdf
--06-13:
-    [[文件:Classification.jpg]]
--06-12:
-    [[文件:Similarity.jpg]]
--06-05:   complete the binary word embedding, find out that tensorflow does not provide logical derivation method.
--06-04:   write the binary word embedding model
--06-01:
-.Record demo video of our Personalized Chatterbot
-.program the binary word embedding model
--05-31:  debugging our Personalized Chatterbot
--05-30:  complete our Personalized Chatterbot
--05-29:
-.scan Chao's code and modify it
-.run the modified program to get the eight hundred thousand sentences's whole matrix
--05-28:
-.complete the word2vec model in tensorflow
-.complete the first version of binary word embedding model
--05-25:  .write my own version of word2vec model
--05-23:
-.get tensorflow's word2vec model from(https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/embedding)
-.learn word2vec_basic model
-.run word2vec.py and word2vec_optimized.py,we need a Chinese evaluation dataset if we want to use it directly
--05-22：
-.find the tf.logical_xor(x,y) method in tensorflow to compute Hamming distance.
-.learn tensorflow's word2vec model
--05-21：
-.read Lantian's paper 'Binary Speaker Embedding'
-.try to find a formula in tensorflow to compute Hamming distance.
--05-18：
-            Fetch American TV subtitles and process them into a specific format(12.6M)
-           (1.Sex and the City 2.Gossip Girl 3.Desperate Housewives 4.The IT Crowd 5.Empire 6.2 Broke Girls)
--05-16：Process the data collected from the interview site,interview books and American TV subtitles(38.2M+23.2M)
--05-11：
-            Fetch American TV subtitles
-           (1.Friends 2.Big Bang Theory 3.The descendant of the Sun 4.Modern Family 5.House M.D. 6.Grey's Anatomy)
--05-08：Fetch data from 'http://news.ifeng.com/' and 'http://www.xinhuanet.com/'(13.4M)
--05-07：Fetch data from 'http://fangtan.china.com.cn/' and interview books (10M)
--05-04：Establish the overall framework of our chat robot,and continue to build database
-====Ziwei Bai====
--07-29：
-           download & learn latex
--07-25 ~2016-07-28：
-、debug the based-RNN TTS（not ideal）
-、run the based-RNN TTS
-、write template
--07-21~ 2016-07-23:
-           build RNN model for TTS
--07-18 ~ 2016-07-19:
-、run bottleneck model with different parameters
-、 prepare Bi-weekly report
-、draw a map to compare different model,
--07-14 ~ 2016-07-15：
-           build bottleneck model（non linear layer : sigmoid relu tanh）
--07-12 ~ 2016-07-13:
-           modify the TTS program
-、separate classify and transfer
-、separate lf0 and mgc
--07-11：
-           finish the patent
--07-07 ~ 2016-07-08:
-、program LSTM with tensorflow （still has some bug）
-、learn paper 'Fast,Compact,and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers fot Mobile Devices'
--07-06:
-           finish the second edition of patent
--07-05：
-           finish the fisrt edition of patent
--07-04：
-、debug and run the chatting model with softmax
-、determine model for patent of ‘LSTM-based modern text to the poetry conversion technology’
--07-01：
-           the model updated yesterday can't converge，try to learn tf.sampled_softmax_loss()
--06-30:
-           convert our chatting model from Negative sample to softmax and convert the cost from cosine to cross-entropy
-           tf.softmax()
--06-29：
-           learn paper 'Neural Responding Machine for Short-Text Conversation'
--06-23：
-           learn paper ‘Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models’
-           http://arxiv.org/pdf/1507.04808v3.pdf
--06-22：
-、construct vector for word cut by jieba
-、retrain the cdssm model with new word vector(still run)
--06-04：
-、modify the interface for QA system
-、pull together the interface and QA system
--06-01：
-、add  data source and Performance Test results in work report
-、learn pyQt
--05-30：
-            complete the work report
--05-29：
-           write code for inputting a question ,return a answer sets whose question is most similar to the input question
--05-25:
-、learn DSSM
-、 complete the first edition of work report
-、construct basic Q&A（name，age，job and so on）
--05-23：
-           write code for searching question in 'zhihu.sogou.com' and searching answer in zhihu
--05-21：
-           learn the second half of paper 'A Neural Conversational Model'
--05-18:
-、crawl QA pairs from http://www.chinalife.com.cn/publish/zhuzhan/index.html and http://www.pingan.com/
-、find  paper 'A Neural Conversational Model' from google scholar and learn the first half of it.
--05-16:
-、find datasets in paper 'Neural Responding Machine for Short-Text Conversation'
-、reconstruct 15 scripts into our expected formula
--05-15:
-、find 130 scripts
-、 reconstruct 11 scripts into our expected formula
-            problem：many files cann't distinguish between dialogue and scenario describes by program.
--05-11:
-、read paper“Movie-DiC: a Movie Dialogue Corpus for Research and Development”
-、reconstruct a new film scripts into our expected formula
--05-08:   convert the pdf we found yesterday into txt，and reconstruct the data into our expected formula
--05-07:   Finding 9 Drama scripts and 20 film scripts
--05-04：Finding and dealing with the data for QA system
-====Andi Zhang====
--08-05:
-            Give a report on my research on sentence similarity
--08-04:
-            Give a representation on NNLMs; review papers read earlier this week.
--08-03:
-            Read the paper ''Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks''
--08-02:
-            Read the paper ''Modeling Interestingness with Deep Neural Networks''
--08-01:
-            Read papers about ABCNN for modeling sentence pairs
--07-25 ~ 2016-07-29:
-            Read papers about the theories and realization of NNLM, RNNLM & word2vec, prepared for a representation of this topic
--07-22:
-            Read papers about CBOW & Skip-gram
-===Generation Model (Aodong li)===
-: 2016-05-21 : Complete my biweekly report and take over new tasks -- low-frequency words
-: 2016-05-20 :
-    Optimize my code to speed up
-    Train the models with GPU
-    However, it does not converge :(
-: 2016-05-19 : Code a simple version of keywords-to-sequence model and train the model
-: 2016-05-18 : Debug keywords-to-sequence model and train the model
-: 2016-05-17 : make technical details clear and code keywords-to-sequence model
-: 2016-05-16 : Denoise and segment more lyrics and prepare for keywords to sequence model
-: 2016-05-15 : Train some different models and analyze performance: song to song, paragraph to paragraph, etc.
-: 2016-05-12 : complete sequence to sequence model's prediction process and the whole standard sequence to sequence lstm-based model v0.0
-: 2016-05-11 : complete sequence to sequence model's training process in Theano
-: 2016-05-10 : complete sequence to sequence lstm-based model in Theano
-: 2016-05-09 : try to code sequence to sequence model
-: 2016-05-08 :
-    denoise and train word vectors of  Lijun Deng's lyrics (110+ pieces)
-    decide on using raw sequence to sequence model
-: 2016-05-07 :
-    study attention-based model
-    learn some details about the poem generation model
-    change my focus onto lyrics generation model
-: 2016-05-06 : read the paper about poem generation and learn about LSTM
-: 2016-05-05 : check in and have an overview of generation model
-===jiyuan zhang===
-: 2016-05-01~06 :modify input format and run lstmrbm model (16-beat,32-beat,bar)
-: 2016-05-09~13:
-   Modify model parameters  and run model ，the result is not ideal  yet
-   According to teacher Wang's opinion, in the generation stage,replace random generation with the maximum probability generation
-: 2016-05-24~27 :check the blog's codes  and  understand  the model and input format details  on the blog
 ==Past progress==