现在的位置: 首页 > 综合 > 正文

Something about the HTK tool

2014年02月24日 ⁄ 综合 ⁄ 共 2526字 ⁄ 字号 评论关闭

    The first speech recognizer and trainer most speech researcher works on if they started after 99. (I guess it can be much earlier.) I used it in most of my speech hacker's life. It is described by many as "very
modular" and implements "good engineering practice". My understanding of these statements is that each application you can found in the toolbox have one single well-defined use. For example HLEd is very versatile editor for editing the HTK transcription .MLF
format. Using HLEd is usually much better than using Unix tools. The interface itself is also very handy. Many people who are familiar with HTK can do training just by typing in the command prompts.

    In terms of algorithmic aspects, HTK use token passing algorithm in decoding and Baum-Welch training in training. The most amazing thing is their derivations of these algorithms and implementation is very elegant.
Even small minuaties such as incorporation of null nodes are correct. This makes many newbies in speech recognition enjoy the use of these advance facilties of training. Many advisors ask their students to match their own speech recognizer with HTK. This is
a very good way to start programming of speech recognition. (However, HTK can have bugs. So when you put this amazing ability of your in your resume, remember to put it like "your recognizer has a HTK compatible mode. :-) )

   Yet, there are many people don't really like HTK. Most of the time, they are already PhDs or some very experienced researchers of the field. The reason they don't like HTK is because HTK is pretty hard to change.
You may say that it may be caused by that fact that it strictly follows many software design principle. For example, HTK's recognizer (HVite) and trainer (HERest) share the same model data structure. Now, it is good to have code to be re-used, it is bad because
it will be hard to make changes on it. Every time you make a change, you also need to consider other routines when you make the change.

    People also feel lost when they finally realize that HVite actually do full fan-in and fan-out for cross-word triphones. (Usually, they knew it when they run phone recognition in TIMIT :-) ). This is obviously
the most "correct" way to implement a recognizer. However, it may not be a "clever" way. Assumptions that replaces full fan-in/out exists for long time. Some of them are found to be quite closed to full fan-in/out in terms of performance.

   There are also instances where the code is not consistent, impression I got when I change HHEd is that HHEd seems to be written by another programmer. I was also pretty lost when I saw data structure lying every
where. May be at the time I was not using emacs. :-)

   As an application developer, many people obviously don't like HTK because of its license. Well, you can use HTK trained models in your application but you have to write your own recognizer. It usually caused
a lot of trouble to many groups because writing a recognizer in these days still require about 3 months to half a year to make it well polished.

抱歉!评论已关闭.