Just needs a little oil. |
Part One of this series covered some of the reasons why captions vanish, as suddenly and mysteriously as if they were written by JJ Abrams. That remains the worst-case scenario, the captioning equivalent of a diplomat sending guests of state into anaphylactic shock while making them guess the secret ingredient in the marinade. But once captions are technically rolling as they should, captioners keep a finely-peeled eye on their personal accuracy. This is true both word-by-word, and for rates at the macro level. As we have seen this week, companies also audit their caption quality as a whole. I’ll go into some more detail later in this post regarding the different formulas we use for different purposes.
Don't go strayin', little red dot. |
As with Olympic figure skating hopefuls trying to land their 400th twizzle, perfectionism is the order of the day, and to the uninitiated eye the numbers might make it look like we’re the proverbial Russian judges quibbling over trifles in an essentially perfect output. But no doubt our regular viewers will be more acutely and instinctively aware of the difference between OK and good captions.
So first the basics. Offline captioners are held to a standard of 100%. Of course that’s an asymptotal journey, in the long run they can only ever fall fractionally short, and a man’s reach must exceed his grasp, or what’s a heaven for? But in any case, that’s the goal, and the mindset, and the time devoted per hour of captioning reflects that. Time enough to pore over the show several times, rewind things you don’t hear (every “mmm” in every cooking show…), fix lags in timing, and see that you’re not covering anything up onscreen. Any lapse, any individual error of either accuracy or pacing which makes it to air and is discovered will be brought to the captioner’s attention. A few of those, and management (with a nod to Carnivàle) will start to become scary.
Alright children, let's shake some dust! |
I don’t know so much about the standards to which stenographers are held, but I know 99% is fairly routine for their output. If I sound terse it’s the bitter envy at being the unevolved caption-type Pokemon. Find the right magic stone, and a Voicemander evolves into a Stenozard.
Compared with those standards, voice captioning is the Wild West. The blunt, brutal, minimum standard required before being allowed to hit the airwaves is a consistent 97.5%. Drop below that again too many times and you might soon be captioning the Tiddlywinks World Championships live from Cuernavaca at 4:00am. So most of our voice captioners average 97-point-something. For those who wrestle their average above 98, there’s a sweet pay-rise, and the enjoyment of captioning more exacting and higher-profile programs.
But I'll always look back with fondness on my Premier League Tiddlywinkin' days. |
Now superficially, those numbers sound reasonable enough to me. I mean, it’s a test, right? And they seem like good, solid test scores, the kind the kids who study on restricted amphetamines get. The thing is, that last 3% is where the art lies.
You're looking swell. |
Picture yourself reading a newspaper article and finding a typo. If you’re as obsessive-compulsive as I am, or if it’s particularly funny, you’ll probably show the other people at the breakfast table. Now what if you found another one in the next article, and another. A typo in every article in the paper. You’d consider it pretty much a shambles, and you’d probably change your subscription. Well, taking a rough average of 400 words per article, that means you’re finding fault with one word in 400. On that metric, the paper’s accuracy is 99.75. Suddenly our captioning accuracy rates look not-so-shiny. 99% means one wrong word in 100, 98% is one in 50, 97.5% is one in 40. At 140 words per minute, we’re talking numerous errors every minute, as well as suddenly giving an insight into why captioners sometimes have elaborate nightmares about homophones.
Make it stop. |
The global standard which normalises perfection across our beloved English language is part of the challenge of turning live speech into text. In written text, we’re predisposed to proofreading and precision, to ponderous production producing preternaturally perfect products. But at these levels of accuracy, the difference isn’t between perfect and flawed, or published and draft, but between usually comprehensible and sometimes comprehensible. Having to make a contextual guess at what two words per minute are supposed to be is profoundly less disruptive to the viewer than having to do it with three. If it’s down to one, and it’s not a complete shocker, it might sometimes escape the attentive viewer’s notice entirely, as your brain silently makes the correction. That’s why these levels matter, and why individual captioners can sometimes obsess over them, and why they’re increasingly enshrined in caption-quality legislation around the world.
So how are these accuracy measurements calculated? Well, you put your hands on these cans, and then the ghost of L. Ron Hubbard makes you caption a sample from Battlefield Earth, and then…no. For the basic, regular checks of our personal accuracy rates, we use a very simple metric called the word error rate, or WER model. You take a sample of text, and do a word count. You count the words successfully corrected with the industry-standard double-dash (“--“), and remove them from the word count, as they effectively don’t count either way. We’ll call the remaining number T for total. Then you count words missing, words added, and words with errors, along with punctuation errors which affect meaning (fail to close parentheses, it counts, start a sentence with “but” after the speaker gave every aural indication of a full-stop, it probably doesn’t). Add them up and you get a number we’ll call E for errors. Your percentage is 100×(T-E)/T. Really simple, you can also break it into cued and live elements if you’re hybrid captioning. You can do it in a few minutes using nothing but your text log. Of course, it’s a pretty blunt instrument, given that for practical purposes, not all errors are created equal. If the phrase “a crowd of activists” came out instead as “a crowd of activist”, it counts as one error. If “chalk” comes out as “Wensleydale”, that too is one error.
It was cracking chalk though. |
But if an error covers multiple words, such as Kafelnikov rendered as “car fell nick off”, then you’ve got four errors. There’s an exemption for compound words, “sometimes” and “some times” are considered interchangeable. Those vicissitudes are mostly ironed out by taking a big enough sample, and by all captioners being subject to the same advantages and disadvantages, making it a sturdy, if flawed, workhorse model for quickly measuring individual accuracy.
But it can be gamed a little. It doesn’t involve checking against vision, so you can increase your stats by being risk-averse. By skipping the mention of a tricky name you’re only 70% sure you trained in. By eliding adjectives and trimming down the content to a minimal paraphrase. By missing entire sentences to stop and correct errors (remembering a successful correction makes it like the error never happened, for assessment purposes). But <Alec Baldwin voice> here’s the thing </Alec>. We don’t do this for careerist reasons. There’s a future post brewing on exactly who we are and why we choose this line of work, but self-aggrandising ambition is way, way down the list. If correcting an error would be more disruptive than not, we’ll move on. If a slightly risky phrase will add colour and texture to the captions if it comes out successfully, we’ll roll the dice virtually every time.
Every time. |
In practice, the viewer’s experience comes first. And I mentioned earlier the pay-rise for 98-percenters. Well, the next precision level up from them takes into account words-per-minute and corrected errors. I’ll get there someday.
In the meantime, there’s also a need for a more elaborate metric assessing quality, as defined by reference to viewer experience, rather than captioner output. And they’ve built one. The NER model (see this Australian white paper on the different accuracy models), which stands for number, edition, recognition, compares live captions with a perfect transcript, and weights errors based on their seriousness and comprehensibility. It’s much more time-consuming, as the word-perfect comparison transcript takes several viewings to create (the white paper found NER reviewing ends up taking 10 to 15 times the length of the content being reviewed). It’s good, though, for a thorough periodic audit of a whole company’s average output. In practice, it tends to make good captioners look even better, as while all captioners make mistakes, good captioners catch the big ones and just miss the smaller ones. Such company-wide audits also consider factors largely beyond the individual captioners’ control, such as pace and lag, making them a truer measure of the quality of the viewer’s experience. With few losses, and minimal and minor errors, you have the fixings of substantial day’s captioning.
Disclaimer.
No comments:
Post a Comment