PocketSphinx

Sphinx is a pretty old (early 2000s, latest stable release 2015) Speech-to-Text system developed at Carnegie-Mellon University.

Some websites suggest that the project is now abandoned. Others simply state that "active development has largely ceased and it has become very, very far from the state of the art".

It's available as Debian / Devuan packages, but getting it to actually do something is in my opinion extremely non-trivial.

Python

Someone seems to have worked out how to get it to do something under Python, however:

  • This blog entry is pretty old (March 2016) and the code doesn't work with Python 3
    • There are comments further down saying that someone else did get it working with Python 3, however the code provided lacks all indentation, and because Python is such an irritatingly fussy language where indentation is concerned, it simply doesn't work.
  • Following the link to the github page let you move up one level from srli/stt.py and you can discover srli/stt_py3.py which seems encouraging, however this is also lacking all indentation
    • Fortunately someone noticed this and posted correctly indented code into a comment on that github page, and if you copy and paste this it almost starts to work
    • The instructions are inadequate; you also need to install the package python3-pyaudio and this gets you in business at last

However, the code has been written to assume that you have a machine with a sound card and microphone, and you simply want to talk to it and see what it thinks you're saying.

This means that once you've got the Python interpreter to think it's code worth trying to run, the application output start with the error message ALSA lib confmisc.c:767:(parse_card) cannot find card '0' and simply continues from there, expecting to find something to listen to. There doesn't even appear to be a command-line option to tell it whether you want live recognition from a microphone / sound card, or batch recognition from pre-recorded files.

I'm sure it should be possible to adapt this code to work with a pre-recorded WAV file (in fact, I think it should be simpler) but given that it's written in Python this might take me a bit of time to work out how to do. I do not like Python and I have only extremely basic Python programming skills (with no intention of ever improving them beyond what is needed for specific tasks such as this), but I think "it can't be that hard".

Without Python

It turns out, as is so often the case, that I am far from the first person to wonder "how do I get PocketSphinx to do something useful? Say, something trivial, like convert a WAV file to text?"

So, if only the documentation were adequate, it would be obvious.

pocketsphinx_continuous -infile recording.wav

Note that the recording must be 16khz 16bit mono.

If you have something else, sox can almost certainly sort that out for you, for example:

sox Some.other.format.wav -r 16000 -c 1 -b 16 Nice.wav

Avoiding excessive output

PocketSphinx is in my opinion far too highly-technically documented, and is completely lacking in "I just want to use it" documentation.

For example the command shown above

pocketsphinx_continuous -infile recording.wav

produces vast quantities of highly technical output showing what the thing is doing internally to try to work out what the words are. I haven't yet found the bit in the man page where it says "redirect standard error to /dev/null if all you want is to get the decoded text output".

It's as simple as that, but you shouldn't have to work it out for yourself.

The next challenge

The Debian-packaged version of PocketSphinx contains only an American speech model, and as far as my experiments have shown so far, it does a terrible job when presented with English recordings.

So, the next step is to try to work out how to use the British English speech model files to get it to understand plain English.

Other languages are also apparently available.

Borrowing some documentation from some random website (albeit based on doing this under Windows), the language models seem to expect the following files and directory structure (using Russian as an example):

└──pocketsphinx-data
   └──ru-RU
      ├──language-model.lm.bin
      ├──pronounciation-dictionary.dict
      └──acoustic-model
         ├──feat.params  
         ├──mdef  
         ├──means  
         └──mixture_weights

There are some clues online for when this process turns out not to be as simple as it should be.


Go up
Return to main index.