It's not an easy application to write, what you would need to do is break down the input voice stream into chunks, probably based on where the type of sound changes, then analyse each chunk of sound to determine what phoneme it is likely to be, which will be based on the frequency characteristics. Listen to some sounds, say AAAAA, EEEEEEE, SSSSSSS, ZZZZZZZZZZZ and you will hear these have different characteristics. You would need to start by feeding such recordings into a frequency analyser and determine what differs, say, AAA from EEE, and you would also have to do that for a number of speakers to get an overall view of what makes a phoneme.

There's loads to take into account. The pitch of a phoneme can change the word's meaning as well (cf incense as used in churches and incense as in getting mad) and even the overall sentence. Words can sound the same and the context determines what they mean: cf. fare and fair, and even words like fare can have multiple meanings (a charge for using a service "bus fare", have a good life "fare well" etc). Word splits can be difficult too: cf "in sensitive areas" and "insensitive areas" and the rest of the sentence will be necessary to determine which was meant, e.g. "in sensitive areas we need a light touch" and "insensitive areas can manage quite a good scrub".

Regional differences can throw another spanner in the works. Northern Brits pronounce the a in bath like the a in cat, but southerners pronounce bath to dhyme with Darth. It doesn't have to be regional either; I pronounce book as buk, but my wife pronounces the oo as in too.

Oh, and to/too/two is another good one.

The whole voice recognition business is spectacularly complicated and I've seen computers trying to make sense of it for over 20 years now and with the best brains in the business voice recognition is still, quite frankly, pants. And humans need stuff to be repeated too so clearly even we haven't got it completely nailed yet.

Programming language is largely irrelevant. You would probably use whatever you know best. In just about any language there'll be a lot of low level work to do in analysing the input stream, splitting out phonemes, considering the possible meanings, reconsidering phonemes based on whether or not the sentence makes any sense.

Have a look at and linked articles for more info on phonemes than you ever thought was possible.