> Is it possible for one person to design this program, even after 20 years?

I don't see why not. The project you outline is a lot simpler than generic speech recognition as you have a good idea what is supposed to be coming next and you know, for example, if the test is "The quick brown fox", that you should be getting phonemes "voiced-th", "uh", "k", "oo-i", "k" etc and instead of having to work out what the incoming phonemes are from a point of not knowing what they are, all you have to do is check the first against "voiced-th", the second against "uh" and so on.

The software will need to handle people saying stuff clearly wrongly, obviously, for example given "The quick brown fox" and they say "Ten green bottles" then you need to flag that as completely wrong (explaining why might be difficult, but one solution could be to say "no, it should sound like this" and play a sample recording.

It will also need to be able to spot people nearly pronouncing stuff correctly, for example "Ve kooik brone focus" - that's wrong, but sounds in many ways very similar. As the aim is to bring language up to scratch you will need to determine when a sound is wrong and that may depend on the level of ability of the student, maybe for someone just beginning "Ve kooik brone focus" would be good enough and you could actually understand what someone means if they say "Ve kooik brone focus zhamps hovver ze lie-zee dawg".

> Do you require me to study one of the languages of programming? Or algorithms?

Me? I don't require you to do anything. You can just get started and I think if you take a structured approach thinking things through carefully that you will naturally work out any algorithms you need. You will need to be fluent in at least one programming language before you start this project though, and I would suggest that language should be a medium to low level compiled one like C, C++, Java (ok, Java isn't truly compiled, but it's a possible); not assembly, because that is too low, and not an interpreted language like BASIC because those are too high level and the computations you'll need to perform will be hampered by the interpretive nature of the language, and you may find the language gets in the way of what you're trying to do. These are generic languages though and you may find there are languages out there that are more suited to what you need - research is the key here, but if you just need a pointer then C++ would be able to handle the job,

> Do I need to study sound engineering? Artificial intelligence? Compared to audio?

I think those subjects could be of help but you should get started and find what out what you need to study on the way.

> What is the first step in the work?

Start with a few short test sentences/phrases, no more than 4-5 words long.
Get recordings of yourself and some friends saying those sentences. Get fluent English speakers that are male, female, have varying regional accents, and maybe some recordings from some non-native English speakers whose English is "good enough". Recordings from some non-native English speakers who have heavy accents could be useful as well, in a comparative sense, because you will need to determine when a sound goes "too far" wrong to be acceptable.
Split the recordings into phonemes, and if you do a good few of these "manually" then you'll get a feel for how a computer might go about it. Maybe you won't need to use phonemes per se, but, especially if you want to point out where they went wrong you will need to do some separation of the input stream; again phoneme analysis is more a pointer than a "you must do it this way", as in fact is anything I'm saying.
Then start studying the characteristics of each sound, looking in particular at the same phoneme across the different recordings to get a feel for how it varies. They will be different for different people, but in some sense a "v" sound will always sound like a "v" no matter who says it. You'll probably need to take a close look at each sound in both the time domain and the frequency domain; a good audio editor like Audacity will let you zoom right in to look at the detail of the wave, but I don't know of any software that does frequency analysis of sounds; you'll need to do some research into that and maybe consider writing your own utilities that suit your needs in this area.