
Product test
Plaude Note is the best voice recorder on the market
by Lorenz Keller
The web service töggl.ch is supposed to be able to understand and transcribe Swiss German dialects. No easy task. I tried out how well it works.
You speak, the software writes it down. You know this from your smartphone by now and it works well compared to before. But only with high-level language, not with dialects. Bad luck for Switzerland, where almost only dialect is spoken, even at official events like speeches or community meetings. And in Youtube videos. Our videos currently have to be subtitled by hand so that they can be understood outside the German-speaking part of Switzerland.
The web service Töggl promises the solution: it can transcribe Swiss dialects. Before you click away right now to try it out: Töggl is not free. Every minute of Swiss German costs one franc. High German and French spoken costs half as much as Swiss German. Töggl can also speak Rhaeto-Romanic, which costs the same as Swiss German. For the start you get a credit of ten francs for free.
In this article, I will limit myself to Swiss German. Will our subtitlers soon be rationalized away?
Only residents of Switzerland are allowed to register. The T&Cs state that customers with Töggl may not process data to which the EU's General Data Protection Regulation (GDPR) applies. For companies that want to make their content available to customers in Germany, Töggl is therefore out of the question. Töggl is aimed at private individuals, journalists and students.
Before transcribing, you specify which language it is. It is not necessary to specify a dialect such as Bern German or Valais German, Swiss dialect is sufficient.
Also necessary are a few details about pronunciation and recording quality. The creators of Töggl emphasize that the quality of the results depends strongly on these factors and give tips for recording. However, in a video with several people and scenes, these questions cannot always be answered unambiguously.
The transcribed text can be post-edited in an editor. This is also very necessary, as you will see in a moment.
When the person speaking changes, a new section with time code begins, so you can hear the passage directly. A double click on a text passage also starts the sound at the corresponding place. The speed is very finely selectable from 0.1 times to 3.5 times.
The finished text can be exported as a text or Word file and in various subtitle formats. So far, so good.
The first task: Töggl is to subtitle this digitec video. Colleague Simon slips into the role of a local TV reporter and mercilessly reveals that not even our own employees subscribe to the digitec Instagram account.
Many different people appear in this video. The subdivision of the text according to who is speaking would therefore be very useful. However, the recognition does not work reliably. In the second block, five people speak, four of whom can be easily distinguished by the sound of their voices. Töggl convolutes all of this into a single text mash. For example, one speaker estimates the number of followers at "two million," whereupon the woman next to him says "250,000. Töggl turns this into the number "2,250,000," thus not taking into account that two different people spoke with completely different voices.
Later, one person speaks High German, so even the language is changed - and even there, no new paragraph is created.
At minute 2:37, Töggl assigns Simon's speech to a new speaker in the middle of the sentence. The reason is probably that applause was played in the background. The sequencing is clearly not oriented to the voices, but to the ambient sounds.
The quality of the transcription leaves an ambivalent impression. Without post-processing, the text is incomprehensible. On the one hand, this is due to the faulty speaker separation. Another reason is that there are some errors and especially many gaps in the speech recognition. The software simply omits words and parts of sentences that it does not understand. This leads to completely meaningless sentences and also makes post-processing more difficult. It would be helpful if Töggl would mark incomprehensible parts with something like [[unverständlich]] would mark them.
The source material is not simple: the audio contains interjections, incomplete sentences, English expressions and different recording scenarios with more or less background noise. Simon, however, speaks slowly and clearly.
It seems odd to me that the word "follower" is transcribed differently each time he says it:
Similar with digitec.ch: This is sometimes called digitec.ch, sometimes digi.ch and once dete.ch.
In the next test, only two people are involved and there are no cuts. On the other hand, the recording quality is quite poor. With interviews, this type of audio should be very common. This is a conversation with a mask carver from Central Switzerland, which colleague Caro recorded with her smartphone.
The conversation lasts over an hour, which would be over 60 francs in transcription costs. Stingy as I am, I only uploaded twelve minutes of it to Töggl. For an approximate impression, that's more than enough.
Töggl turns the two people into eight. Continuous speeches are cut up, sometimes in the middle of a sentence. What this is due to is unclear to me; the entire conversation took place in the same room.
This test reveals a new problem, but it has nothing to do with Töggl - it is a general difficulty in transcribing conversations.
Yes, they immediately declared themselves ready, so, it is then nevertheless still some at financial expenditure meant and um also the premises where now no more are available for the city hall.
Töggl transcribed this sentence correctly; the man said it word for word. But it is unintelligible. Practically no one speaks in print, certainly not in dialect. When we speak, we often make only half sentences, start over, mix two thoughts together, and so on. Not to mention the many ahms and bumpy phrases. Orally, this is so normal that we don't notice it. Only in verbatim transcription does it bother us.
This is more pronounced in interviews than in video clips. The interviewees speak more freely, not with pre-arranged sentences. As a rule, they are not media professionals. Oral interviews usually have to be massively rewritten to make them easy to understand and pleasant to read.
Here's another example. The transcription is close to what was said. Nevertheless, these chunks of text would be completely incomprehensible without sound.
Does Töggl do better when only one person speaks? In good recording quality and with complete sentences? To test this, I use the first two minutes from Phil's review of the PlayStation 5.
In this case, the result is also incomprehensible. The errors cannot be corrected without listening to the audio. That is disappointing, because the task was clearly easier here.
That is, bad luck can the can you still not just where we need, if you want your PSA glasses need to then you plagues, but also data order the free but that is not included and otherwise they can not use.
Maybe you noticed it already above with the mask carver: Töggl writes high German words, but it does not translate the dialect. Swiss German idioms or grammatical peculiarities are transcribed word for word, even if they are not correct in High German. The result is an awkward pseudo-High German.
Swiss German: "[die Variable Refresh Rate], wo ebe macht, dass es kei Bildstörige git"
Töggl: "[the variable refresh rate], where ebe makes that there are no picture disturbances".
Hochdeutsch: "[the variable refresh rate], which just makes that there are no picture disturbances"
Other example:
Swiss German: "de quere Weg hiistelle"
Töggl: "to put the cross way away".
High German: "quer hinstellen".
Automatically transcribed texts almost always need post-processing. This is also the case with automatic translations. These serve as a raw version that is given the finishing touches by hand. This is faster than translating a text completely manually.
The question now is: How much time do I save when I revise a Töggl transcription compared to a transcription without software help? I transcribe two minutes each of Phil's review with and without Töggl and compare the time.
Result: I need 20 minutes to make the Töggl text halfway understandable. But the text is still far from good. It still has awkward formulations and also a few small errors.
For the second two minutes, completely transcribed manually, I need 17 minutes. Not only does it take faster, the text quality is higher. This is despite the fact that this part of the review is more difficult to transcribe. It goes into more detail, with difficult-to-explain things about the user interface. In addition, there are game names that I didn't know.
The main reason: I find it easier to get a sentence right from the start than to get a sentence wrong. If I hear a sentence first and then write it down, I can also translate it correctly into High German, which increases the quality compared to the Töggl text.
But one reason is that I don't get along with the editor at the beginning. A double click on the word I want to correct continues the sound recording against my will, and I don't know the key combination to stop it yet (Alt-K). I therefore try a second time with the next two minutes. Result: 19 minutes of work and the text reads better, although Phil makes many half-sentences in this part.
Nevertheless, it is clear: The Töggl transcript does not save any time to get to a final, flawless text. If the text doesn't have to be correct, but just barely understandable, you'll get there a bit faster with the automatic script.
It sounds contradictory: I'm impressed by what Töggl can do, but I still think the service is barely usable.
The task that the creators of Töggl have set themselves is extremely difficult. Even the speech recognition itself is a challenge. For example, recognizing word boundaries - when we speak, we don't pause between words. It is further complicated by the fact that Swiss German has neither a uniform pronunciation nor a uniform vocabulary. The translation into High German would be another task in itself, which Töggl does not even attempt. Töggl does not produce High German, but Swiss German with words written in High German.
The web editor for correcting is good. Nevertheless, you save little or no time compared to a manual transcription. One reason is that Töggl simply omits incomprehensible words and parts of sentences. This makes it difficult to find your way around the text.
In my tests, the quality was not much better with good recording quality. Never was the result so good that I could understand the text without sound.
What I find really disappointing is that Töggl can't keep the voices apart and produces an incomprehensible text mush when, for example, a man and a woman are speaking.
Even if Töggl would work better: Due to the T&Cs, commercial use is hardly possible. And the service is simply too expensive for private use.
Those who subtitle our videos have nothing to fear at the moment.
My interest in IT and writing landed me in tech journalism early on (2000). I want to know how we can use technology without being used. Outside of the office, I’m a keen musician who makes up for lacking talent with excessive enthusiasm.