“2013-08-02”版本间的差异

来自cslt Wiki
跳转至: 导航搜索
(以内容“== Data sharing == * LM count files still undelivered! == DNN progress == === Experiments === * Discriminative DNN Use sequential MPE/MMI/bMMI(0.1) (with the DNN-b...”创建新页面)
 
第32行: 第32行:
 
|}
 
|}
  
* Conclusions: It seems the MPE criterion provides gains in most test sets, particularly in the online testing. For the sets with reading speech and good acoustic conditions, the MPE does not provide much and even jeopardize. This can be attributed to the training set which is more 'online' style. MMI/bMMI results will be delivered soon, which seems more 'gentle' and adjusted the weights not much.  
+
* MMI seems more robust than MPE. The former provides general performance gains, while the latter shows different behavior for different sets. bMMI seems more robust than MPE but less robust than MMI. More investigation could be done with different boost factors. This observations might be explained by the discrepancy between training data and test data; DT training is more suitable for test sets which are more consistent with the training condition.  
  
  
* Sparse DNN on the ARM board
+
=== Tencent exps ===
  
<pre>
+
=== GPU & CPU merge ===
1200-1200-1200-3536                1200-1200-1200-3536-sparse0.3 (sparsity 1/5)
+
original atlas:  RT 2.3                        RT 2.3
+
atlas sparse:    RT 54                          RT 14 
+
NIST smatmat:    RT 27.3                        RT 5.98
+
</pre>
+
  
<pre>
+
#Hold
800-800-800-2108                    800-800-800-2108-sparse0.3 (sparsity 2/5):
+
original atlas: RT 1.3                          RT 1.1
+
NIST smatmat:  RT 11.9                        RT 5.5
+
  
600-600-600-1500
+
==Confidence estimation==
original atlas: RT 0.9
+
NIST smatmat:  RT 6.5
+
  
</pre>
+
===DNN confidence===
  
 +
* We are interested in confidence estimation from DNN output directly. This confidence is naturally 'posterior' and does not rely on graphs so simply to generalize, e.g., when examine which output is the best from multiple decoders.
  
 +
* Removed silence in confidence computing
  
<pre>
 
  
*To be done:
+
* To be done:
  
# Try SuiteSparse lib
+
# Large-scale test.
# Test accuracy on large data set
+
# CI Phone posterior-based (instead of state posterior-based) full path(instead of best path) confidence estimation.
  
</pre>
+
===Multi-graph decoding based on DNN confidence===
  
=== Tencent exps ===
+
* Code done. Current implemented serial multigraph support. Simple test validated the change.
 +
* General mutigraph relies on a more flexible framework where each graph runs a separated processes and the central control collects these results based on the DNN confidence.
  
=== GPU & CPU merge ===
 
  
#Hold
+
== Embedded progress ==
  
==Confidence estimation==
+
* Test done on the car-1000 test set:
===DNN confidence===
+
* We are interested in confidence estimation from DNN output directly. This confidence is naturally 'posterior' and does not rely on graphs so simply to generalize, e.g., when examine which output is the best from multiple decoders.
+
* The first design employs the best path in decoding/alignment, based on the state-posterior matrix directly. Code finished and the intuitive testing seems ok.
+
  
* To be done:
+
<pre>
 +
    1,100_800_800_800_800_3620:
 +
      %WER 1.66 [ 194 / 11710, 45 ins, 56 del, 93 sub ]
 +
      %SER 2.40 [ 71 / 2953 ]
 +
      Scored 2953 sentences, 0 not present in hyp.
  
# Large-scale test.
 
# CI Phone posterior-based (instead of state posterior-based) full path(instead of best path) confidence estimation.
 
  
===Multi-graph decoding based on DNN confidence===
+
    2,100_800_800_800_800_2108:
* Due to the DNN-based confidence which is independent of decoding graphs, it is simple to compare or integrate results from various graphs. E.g., decoding can be performed on a general big graph and a user-specific graph, and then compare the confidence of the two results and select the best one.  
+
      %WER 1.61 [ 188 / 11710, 45 ins, 54 del, 89 sub ]
 +
      %SER 2.24 [ 66 / 2953 ]
 +
      Scored 2953 sentences, 0 not present in hyp.
  
* To be done
+
    3,100_600_600_600_600_1264:
 +
      %WER 1.61 [ 189 / 11710, 44 ins, 48 del, 97 sub ]
 +
      %SER 2.47 [ 73 / 2953 ]
 +
      Scored 2953 sentences, 0 not present in hyp.
 +
</pre>
 +
 
 +
* It looks like the simple and fast 600X1264 net is good enough for grammar-based tasks.
 +
* Simple test shows RT 1.5 in the ARM board, and RT 0.5 in a popular mobile (bought in 2012).
  
# Coding finished.
 
# Debuging and Test next week.
 
  
== Embedded progress ==
 
  
* The DNN FE is now 0.7 RT, and so can be employed in simple grammar tasks.
 
  
 
*To be done
 
*To be done

2013年8月2日 (五) 05:54的版本

Data sharing

  • LM count files still undelivered!

DNN progress

Experiments

  • Discriminative DNN

Use sequential MPE/MMI/bMMI(0.1) (with the DNN-based alignment/denlattices). 100-hour training, network structure: 100 + 4 X 800 + 2100:

TASK cross-entropy (original) MPE (it1) MPE (it2) MPE (it3) MMI (it1) MMI (it2) MMI (it3) bMMI(it1) bMMI(it2)
map 22.98 23.91 23.26 22.84 22.30 21.92 21.64 21.99 21.82
2044 21.94 25.92 24.47 24.10 21.30 21.13 21.11 21.50 22.06
notetp3 14.73 21.64 18.83 19.16 14.68 14.57 14.25 14.52 15.06
record1900 8.45 8.93 7.60 8.46 6.64 6.27 6.07 6.76 6.20
general 34.0 35.29 33.72 33.62 33.80 33.85 33.68 33.27 33.25
online1 34.16 31.70 31.45 31.33 32.70 32.39 32.27 32.51 32.05
online2 27.10 24.56 24.42 24.37 25.18 24.90 24.76 25.02 24.70
speedup 24.1 22.93 21.86 21.60 21.94 22.00 22.26 21.92 21.35
  • MMI seems more robust than MPE. The former provides general performance gains, while the latter shows different behavior for different sets. bMMI seems more robust than MPE but less robust than MMI. More investigation could be done with different boost factors. This observations might be explained by the discrepancy between training data and test data; DT training is more suitable for test sets which are more consistent with the training condition.


Tencent exps

GPU & CPU merge

  1. Hold

Confidence estimation

DNN confidence

  • We are interested in confidence estimation from DNN output directly. This confidence is naturally 'posterior' and does not rely on graphs so simply to generalize, e.g., when examine which output is the best from multiple decoders.
  • Removed silence in confidence computing


  • To be done:
  1. Large-scale test.
  2. CI Phone posterior-based (instead of state posterior-based) full path(instead of best path) confidence estimation.

Multi-graph decoding based on DNN confidence

  • Code done. Current implemented serial multigraph support. Simple test validated the change.
  • General mutigraph relies on a more flexible framework where each graph runs a separated processes and the central control collects these results based on the DNN confidence.


Embedded progress

  • Test done on the car-1000 test set:
    1,100_800_800_800_800_3620:
      %WER 1.66 [ 194 / 11710, 45 ins, 56 del, 93 sub ]
      %SER 2.40 [ 71 / 2953 ]
      Scored 2953 sentences, 0 not present in hyp.


    2,100_800_800_800_800_2108:
      %WER 1.61 [ 188 / 11710, 45 ins, 54 del, 89 sub ]
      %SER 2.24 [ 66 / 2953 ]
      Scored 2953 sentences, 0 not present in hyp.

    3,100_600_600_600_600_1264:
      %WER 1.61 [ 189 / 11710, 44 ins, 48 del, 97 sub ]
      %SER 2.47 [ 73 / 2953 ]
      Scored 2953 sentences, 0 not present in hyp.
  • It looks like the simple and fast 600X1264 net is good enough for grammar-based tasks.
  • Simple test shows RT 1.5 in the ARM board, and RT 0.5 in a popular mobile (bought in 2012).



  • To be done
  1. Implement a DNN-based demo
  2. Shrink the NN structure (4 layer to 2 layer), and test the performance
  3. The Kaldi decoder costs a lot when the graph is large. Need to improve the indexing of the FST structure.
  4. Integrate the DNN FE with the pocket-sphinx decoder.