The Rogue Captioner: May 2014

Saturday 31 May 2014

Mishaps galore

So it's been a zany cavalcade of software mishearings over the past week. First of all, Thailand. Now one news outlet suggested they have a "precedent of insurgency", but Dragon was adamant I referred to the fundamentally paradoxical position of "President of insurgency".

Perhaps more poignant and poetic, a slightly mumbled respeak of Osama bin Laden came out as "or some of the blood and".

The British municipal and European elections over the past week provided their fair share of drama, political soul-searching, and hour upon hour of totally unscripted live news. You know where this is going. Firstly, a fiercely talented colleague accidentally made the municipal elections the "unicycle elections," which sounds far more whimsical. Dragon weighed in with a political opinion when a UKIP pundit was welcomed to the studio with "Nice to see EU". Finally, we crossed live to a press conference with that delicious German fish, "angler Merkel".

Cooking shows are always a treasure trove of hilarity. They're all pre-records, so none of these went to air, but live captioners often respeak the first draft of the script, and it's fair to say that Dragon just isn't comfortable in the kitchen.

So it seems fair enough that a home-made cake is "better than abort one". Just common sense really. I was more concerned by the suggestion that we season the tomatoes with "salt and papa." "unsalted butter" came out as "onslaught of butter," which...I like the way Dragon thinks. More alarming was that you could whip batter until it "becomes froggy", I'd hate to end up with amphibious pastry by accident. Spices were equally fumbled - "two teaspoons of human seeds" seems a little exotic for a prime-time TV chef. "Nice and grungy" and "nice and crunchy" are two fundamentally different levels of cooking. Separating eggs was equally fraught - "I just need the yolk" was rendered as "I just mediocre." We know, Dragon, we know.

This one comes from an unknown colleague. I pulled up the text from a previous newshour to harvest the VTs, and they must have a verbal shortcut trained in for Putin. Makes sense, he's a little bit homophone (as well as a little bit homophobe) and can come out as Britain, written, putting, pudding, put in, etc. But something went awry, and the shortcut itself went to air instead. The shortcut was "poot poot". I giggled in Ukrainian.

Finally, a little bit of sentimental sweetness from my Dragon. A science program was exploring egg-fertilisation and larva production for a particular species. Anyway, apparently "the egg and sperm, mixed together, produces love." Not very scientific, but oddly elegant.

Disclaimer.

Sunday 25 May 2014

Is Captioner No More Than This?

So I'm taking a short break from my series about caption quality, but will finish the final part sooner or later. In the meantime, I wanted to zoom out a little.

At the risk of getting a little lofty and philosophical, I’ve been mulling (over wine) on just what it is, and what it means, to be a voice captioner. So perhaps a trigger warning is in order that this post may contain flagrant speculation, gratuitous Platonic forms, and wanton abstraction.

Voice captioning involves thinking about yourself in a few unusual ways: as a broadcaster, as a viewer, as a content creator, as a narrator, as an interpreter, as a passive conduit or mouthpiece for someone else’s views.

With each passing season, the mouthpiece garrisons grow bolder and more ruthless.

As a broadcaster, a captioner aligns himself or herself with both commentator and newsreader. Like a commentator, we decide on the spot what information needs to be made explicit to enhance the viewer’s understanding of a visual text. Like a newsreader, we funnel and compile the information from autocues, scripts, and audio sources into a coherent narrative thread. I’ve discussed before how such compilation works. I work for a third-party captioning company, but when it’s flowing fairly smoothly, when we can access the same rundowns and script fragments as our clients, it can feel like we are “in the studio”, part of the allied industries of truth, worthy of being a background character in an Aaron Sorkin show, with all attendant privileges, folder-carrying, and stirring music.

Sadly, my walk-and-talk is a tad rusty.

Of course, it isn’t always like that. Whether we're treated as valued colleagues, or a necessary evil or annoyance, or a mistrusted source of intellectual property leaks, or (rather often) not given any thought at all, depends greatly upon the personalities of the relevant station managers and studio directors. Some of my co-workers have worked in the past as in-house captioners for networks, and have remarked on the cultural difference in how our work is conceived. After all, it’s hard to imagine such necessary information as scripts, rundowns, outcomes or ready-to-air media being routinely withheld from the writers, directors, producers, stars, competitors (on reality or game shows), or anyone else who is “in the team”, as readily as it sometimes is from us. It’s different when you’re in-house. With outsourcing comes efficiency, and concentration of expertise, but also alienation.

BRB, floating round my tin can.

In the backs of our minds, also, we know we may also be considered broadcasters as a category of legal liability. It’s certainly rare for captioners to be held responsible for the libel, slander, sedition and other forms of culpable speech, which more likely originate with their client networks. But someday the landmark test case will inevitably come, and a nasty lawsuit will lay down in black and white the extent of our identity as broadcasters. In some jurisdictions, accuracy legislation surely offers an argument in defence of a comfortable margin of error – a 98% rate of accuracy inherently licences 2% of our output to be wrong, scandalously or otherwise, so a tendentious error or two here and there would seem to be protected by statute. But I’d hate to be the one to test it. A particular source of angst is pre-scripted news content. Suppose we push the proverbial Big Red Button to cue out a prepared, edited and scripted sentence which we hear the newsreader begin to read. Now imagine that unbeknownst to us, the newsroom’s legal team has pulled, at the last minute (not a figure of speech, the kind containing 60 seconds), someone’s name right at the end of the sentence. It turns out they weren’t allowed to reveal it, and thus neither were we – but it’s too late to swallow back in. That barn door could prove hard to close. There’s only so much legal caution we can exercise without creating delay, but we’re often pre-emptive where we feel there ought to be an “allegedly”. More often than not, in that event, the newsreader makes the same on-the-fly script amendment that we do. You can’t be too careful.

Sometimes, though, we feel more like viewers. Like we’re “out of the studio”, with no privileged access, no inside knowledge, just reacting to whatever is going to air and trying to explain it as best we can to someone sitting alongside us, who is intelligent and informed, but having trouble following. That sense is most acute when there’s something which I, too, can’t make out, such as when I can’t make sense of someone’s accent, and my position as just another flawed and human viewer and listener is thrown into stark relief. The in-jokes we PM to our co-pilots enhance this experience of captioner-as-viewer. After all, we are literally talking over an audio-visual artefact – does that make what we do the equivalent of Mystery Science Theatre 3000? In that vein, I keep thinking we should also offer alternate, novelty closed caption tracks (idea copyright Rogue Captioner, 2014) which take the piss in real time. It’s gonna be a hit.

Then again, sometimes we think of ourselves as content creators. We create the first rough written transcript of whatever we caption live, and some of our clients take advantage of that as the basis for what will then be tidied (perhaps by us) into a verbatim online transcript. That also gives us an odd little margin of creative freedom. Future posts will go into more detail about what that freedom can entail, but suffice to say it’s never quite as simple as saying what you hear. Punctuation is one example. Extemporised speech is punctuated with comparative informality, but on the page or screen, the game changes. Imagine the shift in tone if the following phrase exclusively used commas in place of its full stops:

Misogyny. Sexism. Every day from this Leader of the Opposition. Every day. In every way. Across the time the Leader of the Opposition has sat in that chair, and I've sat in this chair, that is all we have heard from him.

It would be the same speech, but the effect on the page would be very different. As phrased above, it reads like a boxer, landing a self-assured one-two punch with each staccato sentence. But with all commas, it becomes a kind of cumulative litany, ideas and grievances piling up with each subordinate clause, more like an adept freestyler converting his anguish into passionate, deeply personal rhyme. Dashes and colons would be different again, and exclamation marks could read either as forceful or shrill. Either way that creative choice, however small, is routinely ours. So when I see quotes from a politician later that day, sourced from a web transcript I created, the sense of recognition contains within it a peculiar kernel of ownership. Like the child in the end credits of the X-Files, I Made This!

Of course, another lens through which to interpret our role is that of an interpreter. We facilitate communication between people who can’t necessarily share an immediate and direct linguistic connection. I recognised both the work of colleagues in the accessibility industry and a unique and unexpected practical challenge recently, while voice captioning a live session of the United Nations Security Council. Some Eastern Bloc leaders and ambassadors were speaking through a live interpreter, as they often do. The thing is, the interpreters were adjusting, just like I do, to the rhythms of speech of their subjects. They would pause momentarily to hear what came next, and to take a breath, and then maybe 10 words would pour out in a rush. Perfectly comprehensible to the listener, just subtly uneven in pace. But for me, trying to do the same thing on top of that was really hard, like a kind of Captionception. Three people simultaneously saying the same thing, to three different audiences (Ukrainian, hearing-Anglophone, and hearing-impaired-Anglophone), presented interesting difficulties in pacing, as well as that whole pesky breath thing.

And lastly (for now), there’s something strange and at times slightly horrific about having to rearticulate, with my own voice-box, the views of some truly awful people. To have to hear myself say out loud, for instance, that Romanians are destroying English society, or that Stopping The Boats would somehow help to pay off Australia’s modest national debt, or that the science of climate change is anything but iron-clad. Or, more pointedly, that the axing of the position of Disability Commissioner on the Australian Human Rights Commission is a good and necessary thing, because Freedom. Of course I know, at some level, it isn’t me doing this. I’m just the messenger. But is that the Nuremburg Defence? Am I just taking orders? Of course, I reassure myself that the best thing I can do is transmit, in its unadulterated ugliness, the abhorrent things said by those in power. All that I am empowered to do is to make sure that all viewers understand, with perfect clarity, the capacity of awful people to say awful things. Give them enough rope etc. I think that can only be a good thing, even if it makes me a passive conduit for something grotesque. But it’s a singular feeling, to just say everything. To be a voice and nothing more.

Albeit a cyborg micro-processor-equipped online voice from the future.

So anyway, tying this up nicely into a bow might be against the spirit of this post. Better perhaps to end as we started, with the question: as a captioner, just what the hell am I?

Disclaimer.

Monday 19 May 2014

Quality and Accuracy Part Two: Errors

So earlier I mentioned the different kinds of caption production, and I threw some accuracy percentages out there, noting that stenographers achieve a higher percentage than voice captioners, having spent many more years levelling up and going on sweet dungeon raids and so forth. But just what do these percentages mean? How do we measure them, and what kind of accuracy can we achieve? Why does my blogging persona ask so many rhetorical questions, and can it be prevented medicinally?

Just needs a little oil.

Part One of this series covered some of the reasons why captions vanish, as suddenly and mysteriously as if they were written by JJ Abrams. That remains the worst-case scenario, the captioning equivalent of a diplomat sending guests of state into anaphylactic shock while making them guess the secret ingredient in the marinade. But once captions are technically rolling as they should, captioners keep a finely-peeled eye on their personal accuracy. This is true both word-by-word, and for rates at the macro level. As we have seen this week, companies also audit their caption quality as a whole. I’ll go into some more detail later in this post regarding the different formulas we use for different purposes.

Don't go strayin', little red dot.

As with Olympic figure skating hopefuls trying to land their 400th twizzle, perfectionism is the order of the day, and to the uninitiated eye the numbers might make it look like we’re the proverbial Russian judges quibbling over trifles in an essentially perfect output. But no doubt our regular viewers will be more acutely and instinctively aware of the difference between OK and good captions.

So first the basics. Offline captioners are held to a standard of 100%. Of course that’s an asymptotal journey, in the long run they can only ever fall fractionally short, and a man’s reach must exceed his grasp, or what’s a heaven for? But in any case, that’s the goal, and the mindset, and the time devoted per hour of captioning reflects that. Time enough to pore over the show several times, rewind things you don’t hear (every “mmm” in every cooking show…), fix lags in timing, and see that you’re not covering anything up onscreen. Any lapse, any individual error of either accuracy or pacing which makes it to air and is discovered will be brought to the captioner’s attention. A few of those, and management (with a nod to Carnivàle) will start to become scary.

Alright children, let's shake some dust!

I don’t know so much about the standards to which stenographers are held, but I know 99% is fairly routine for their output. If I sound terse it’s the bitter envy at being the unevolved caption-type Pokemon. Find the right magic stone, and a Voicemander evolves into a Stenozard.

Compared with those standards, voice captioning is the Wild West. The blunt, brutal, minimum standard required before being allowed to hit the airwaves is a consistent 97.5%. Drop below that again too many times and you might soon be captioning the Tiddlywinks World Championships live from Cuernavaca at 4:00am. So most of our voice captioners average 97-point-something. For those who wrestle their average above 98, there’s a sweet pay-rise, and the enjoyment of captioning more exacting and higher-profile programs.

But I'll always look back with fondness on my Premier League Tiddlywinkin' days.

Now superficially, those numbers sound reasonable enough to me. I mean, it’s a test, right? And they seem like good, solid test scores, the kind the kids who study on restricted amphetamines get. The thing is, that last 3% is where the art lies.

You're looking swell.

Picture yourself reading a newspaper article and finding a typo. If you’re as obsessive-compulsive as I am, or if it’s particularly funny, you’ll probably show the other people at the breakfast table. Now what if you found another one in the next article, and another. A typo in every article in the paper. You’d consider it pretty much a shambles, and you’d probably change your subscription. Well, taking a rough average of 400 words per article, that means you’re finding fault with one word in 400. On that metric, the paper’s accuracy is 99.75. Suddenly our captioning accuracy rates look not-so-shiny. 99% means one wrong word in 100, 98% is one in 50, 97.5% is one in 40. At 140 words per minute, we’re talking numerous errors every minute, as well as suddenly giving an insight into why captioners sometimes have elaborate nightmares about homophones.

Make it stop.

The global standard which normalises perfection across our beloved English language is part of the challenge of turning live speech into text. In written text, we’re predisposed to proofreading and precision, to ponderous production producing preternaturally perfect products. But at these levels of accuracy, the difference isn’t between perfect and flawed, or published and draft, but between usually comprehensible and sometimes comprehensible. Having to make a contextual guess at what two words per minute are supposed to be is profoundly less disruptive to the viewer than having to do it with three. If it’s down to one, and it’s not a complete shocker, it might sometimes escape the attentive viewer’s notice entirely, as your brain silently makes the correction. That’s why these levels matter, and why individual captioners can sometimes obsess over them, and why they’re increasingly enshrined in caption-quality legislation around the world.

So how are these accuracy measurements calculated? Well, you put your hands on these cans, and then the ghost of L. Ron Hubbard makes you caption a sample from Battlefield Earth, and then…no. For the basic, regular checks of our personal accuracy rates, we use a very simple metric called the word error rate, or WER model. You take a sample of text, and do a word count. You count the words successfully corrected with the industry-standard double-dash (“--“), and remove them from the word count, as they effectively don’t count either way. We’ll call the remaining number T for total. Then you count words missing, words added, and words with errors, along with punctuation errors which affect meaning (fail to close parentheses, it counts, start a sentence with “but” after the speaker gave every aural indication of a full-stop, it probably doesn’t). Add them up and you get a number we’ll call E for errors. Your percentage is 100×(T-E)/T. Really simple, you can also break it into cued and live elements if you’re hybrid captioning. You can do it in a few minutes using nothing but your text log. Of course, it’s a pretty blunt instrument, given that for practical purposes, not all errors are created equal. If the phrase “a crowd of activists” came out instead as “a crowd of activist”, it counts as one error. If “chalk” comes out as “Wensleydale”, that too is one error.

It was cracking chalk though.

But if an error covers multiple words, such as Kafelnikov rendered as “car fell nick off”, then you’ve got four errors. There’s an exemption for compound words, “sometimes” and “some times” are considered interchangeable. Those vicissitudes are mostly ironed out by taking a big enough sample, and by all captioners being subject to the same advantages and disadvantages, making it a sturdy, if flawed, workhorse model for quickly measuring individual accuracy.

But it can be gamed a little. It doesn’t involve checking against vision, so you can increase your stats by being risk-averse. By skipping the mention of a tricky name you’re only 70% sure you trained in. By eliding adjectives and trimming down the content to a minimal paraphrase. By missing entire sentences to stop and correct errors (remembering a successful correction makes it like the error never happened, for assessment purposes). But <Alec Baldwin voice> here’s the thing </Alec>. We don’t do this for careerist reasons. There’s a future post brewing on exactly who we are and why we choose this line of work, but self-aggrandising ambition is way, way down the list. If correcting an error would be more disruptive than not, we’ll move on. If a slightly risky phrase will add colour and texture to the captions if it comes out successfully, we’ll roll the dice virtually every time.

Every time.

In practice, the viewer’s experience comes first. And I mentioned earlier the pay-rise for 98-percenters. Well, the next precision level up from them takes into account words-per-minute and corrected errors. I’ll get there someday.

In the meantime, there’s also a need for a more elaborate metric assessing quality, as defined by reference to viewer experience, rather than captioner output. And they’ve built one. The NER model (see this Australian white paper on the different accuracy models), which stands for number, edition, recognition, compares live captions with a perfect transcript, and weights errors based on their seriousness and comprehensibility. It’s much more time-consuming, as the word-perfect comparison transcript takes several viewings to create (the white paper found NER reviewing ends up taking 10 to 15 times the length of the content being reviewed). It’s good, though, for a thorough periodic audit of a whole company’s average output. In practice, it tends to make good captioners look even better, as while all captioners make mistakes, good captioners catch the big ones and just miss the smaller ones. Such company-wide audits also consider factors largely beyond the individual captioners’ control, such as pace and lag, making them a truer measure of the quality of the viewer’s experience. With few losses, and minimal and minor errors, you have the fixings of substantial day’s captioning.

Disclaimer.

Thursday 15 May 2014

Quality and Accuracy Part One: Losses

I mentioned briefly the notion of “accuracy” in closed captions. Now I already had a post brewing on what exactly we mean by that, and then I saw this Ted talk:

I was all fixed to talk about accuracy, really I was. But after that talk, it Tristram Shandied its way into broader territory. As Robson says, captioning quality is the latest technical and legislative frontier of accessible television. And again, the watchword is comprehensibility (he went with “understandability”, but he’s American so I’ll allow it). So this will be the first part of a three-part post on caption intelligibility. Part two will cover errors and measuring accuracy. Part three will look at style and standards. For now though, here is a little look at a more fundamental problem: what happens when there are suddenly no captions at all.

Just how it is sometimes.

In passing, Robson drops a very important industry term: loss. A loss occurs when captions are absent, when relevant and necessary auditory information is completely elided. A loss is the worst possible outcome in any attempt to provide closed captions – far more alienating to viewers than the still-comprehensible word-soup I routinely file here under mishaps.

I See What They Did There. And that's ultimately what matters.

And modest losses occur with what some might regard as startling regularity (I suspect, however, that our more regular viewers may be less than startled to hear this).

I’ve discussed the fairly immutable limits to how long a captioner can broadcast in one sitting. The flesh is weak, the voice goes all Tom Waits.

That necessitates an endless succession of potentially awkward handovers. It also requires a large number of computers, all of which must be powerful both individually and as a network, and any one of which can, from time to time, do what all computers do, and suddenly fail to compute.

And Dragon can get huffy.

Additionally, there are two main processes which need to go smoothly in order for captions to appear without loss. Captions need somewhere to go, and captioners need an uninterrupted source of audio. Each of these can be the locus for both technical hiccups and human error.

A path through which captions are sent is called a “gateway”, and choosing the appropriate gateway is a simple but high-stakes click of a mouse. A very careless captioner may select Sydney rather than Perth. Far more likely, they may select Sydney+Brisbane, when they should be targeting Sydney+Brisbane+Melbourne. In the former case, the captions would not appear at all in the target market, while in the latter the captions would only appear in two out of three of the target markets. Either way, it’s a loss. It’s not very common; of the many thousands of gateway selections I’ve made, I’ve only selected the wrong gateway once or twice.

But it can happen.

A far more common cause of losses is the technology of gateways itself. Now I’m not a technician, just a user. So I don’t know the detail, but I know enough to respect and fear gateways as temperamental and fragile beasts. Faults can occur both on our end and the studio’s, so any troubleshooting has to be collaborative. Both ends need to be tested, and whether the solution is a reset, or switching to a backup gateway, it needs to be agreed upon by both sides. And as with roadworks on a major freeway, some of the gateways are in 24-hour use, so urgent repairs can temporarily shut them down, causing losses. For the more sporadically used gateways, test captions are routinely sent before going live (no mischievous captions in case it accidentally goes to air). If you see a caption of a dot flash onscreen briefly during an ad break, you could be witnessing the aftermath of a gateway issue. They might have had to quickly reset, and then send a brief live test, and a bunch of people are likely very relieved at that moment.

Captioners also need an audio source. This might sound like a no-brainer, but it’s a little more intricate than that. We’re not in the studio with the presenter; in fact as voice captioners we couldn’t be, it’s a noisy job.

Rather like the cruel, sadistic world of stock photography.

So we need to hear what’s going to air, somehow. One way is to watch TV. The trouble is, many live shows are broadcast at delays in excess of five seconds. And as Robson noted, long delays between visuals and captions wreak havoc on comprehensibility. They can also cause losses in themselves when it comes to an ad break or handover and the last words disappear into the Aether. Our work introduces its own short delays, as do our gateways – so being five seconds behind before you even get started makes timely captions utterly impossible. Additionally, if we’re in Sydney and captioning Adelaide, or London, we can’t just tune in an ordinary television and start watching, we need the signal piped in. The solutions vary depending on the technological capabilities of the broadcaster, but a lag-free audio source is often acquired, by any available means.

It can be deceptively low-tech – some programs use a telephone line, and back in the analogue days, the analogue signal was much less delayed than the digital. Sometimes we have a direct feed to the studio. Often we have a high-speed audio feed, but only (delayed) off-air visuals, so they don't sync up. Sometimes we caption blind, with only audio, and have to call someone to ensure they can see our captions. Sometimes there is a backup source (albeit one with a delay), sometimes not.

So again, there are a range of possible sources of both human and technological error. If the audio drops out, captions can’t happen, and if you’re hearing the wrong feed, or even if your headphones are just slightly unplugged, there will be a loss.

By far the dumbest loss is, of course, when you respeak a sentence, look up, and see the red icon indicating your mic is off. The entirely hypothetical captioner to whom that happened must sure feel silly…I bet.

We have a backup generator if we lose power. But needless to say, things get truly hairy on the rare occasions when we completely lose our internet connection.

In our contracts with our clients, minimising losses is the factor most directly tied to our profit margins. In our clients’ statutory obligations, losses are what make the percentage-based captioning requirements hardest to fulfil. And in our moral duty to our viewers, the last thing we want is a loss.

Disclaimer.

Tuesday 13 May 2014

Mishaps round-up

Well, we *could* "crack our legs" to make mayonnaise, captioning software, but let's not. The same cooking show had us "doubting our chicken", rather than adding it. "Doubting Our Chicken" kinda sounds like an indie band. And I'm not sure that "salt farmers" should be mixed in with the dried fruits.

In supermarkets, I suppose, next to the "Miced Volvos":

Also from the other night's Eurovision, I felt for several colleagues. I spotted "weirded lady", and a UK colleague tweeted me that he'd had "bearded drag queen" come out as "bearded dragon." I can see the logic, really. We should also all take a moment to acknowledge the ridiculous difficulties posed by Sweden's entry, "Undo my Sad". Our software makes predictions based on grammatical context, so a phrase full of short words, in an order which doesn't really resemble any recognisably human syntax, presents the sort of captioning errors best treated with paracetamol.

Disclaimer.

Sunday 11 May 2014

Conchita is Top Banana

Unfortunate indeed that Dragon decided to render "going for it" as "going fruit" in reference to Austria's truly fabulous Eurovision contender.

Disclaimer.

Thursday 8 May 2014

Exquisite rage

A cooking show referenced "gorgeous broth", and it came out as "gorgeous wrath". I like that so, so much better.

Disclaimer.

Wednesday 7 May 2014

Rhythms and Co-pilots

Live voice captioning can be arduous. Physically draining, mentally onerous. There’s a peculiar kind of multitasking required to simultaneously listen, speak and proofread the same content, speaking a few words behind what you hear, and reading a few words behind what you speak. It’s a bit like rubbing your stomach while patting your head and also playing mahjong.

Demonic multitasking

There’s also the brute physicality of uttering as many as 2000 words in 15 minutes, with the pace and breath placement dictated by someone else, someone who you can’t ask to slow down, because they’re more famous than you and also because that’s not how TVs work. A non-stop 15 minute soliloquy would be considered stern work for even the more hardened of avant-garde thespians. And I’ve already written about the nimble real-time curation of script fragments the news requires. And if your enunciation wanes with weariness, just watch Dragon’s recognition tumble, like dominoes kicked over by a confused dragon.

As loveable as it is frustrating.

On top of which, being ready to create text, like, now, takes planning, work, and time – although while on the air, I’ll have neither time nor attention to spare. I have to train in new words, and retrain words which have come out wrong. I may have to write new autocorrect rules, either universal or context-specific (leads united=Leeds United can be safely universal, city=City should probably be football-exclusive). If it’s the news, I have to tidy the last session’s VTs and prepare the next session’s rundown. Needless to say, it can’t all be done at the start of the shift, as old news curdles like suspect dairy. If the whole program is going to be replayed at any point, I also have to tidy the whole thing afterwards, so next time it can be perfect (a pro-tip, therefore, for captions viewers is that repeat broadcasts of live shows may have much better captions). And I have to file the text for our records.

Don’t get me wrong – we do get to sit in a soundproof booth watching the news and talking back to it, and I love the work something fierce. But it means live captioning can’t be effectively accomplished in long, uninterrupted on-air slabs. As with transcontinental aviation, we require co-pilots.

Goggles sold separately.

The exact rhythms vary, but most live captioning sessions are either 15 or 30 minutes long, depending on the degree of difficulty and accuracy required. 30 minutes is typical for live sport (which tends to be less taxing, for reasons I’ll explore in future posts) and for more heavily scripted news. 15 minutes is the norm for our more exacting and relentlessly live news programs. After that, the co-pilot takes the wheel, and you get ready for the next time you have to take it back. These sessions can go for up to two hours, with each captioner contributing an hour in total, before there’ll be a pause in scheduling for both of you, to be filled with admin, accuracy work, reviewing, training, and helping the offline department. A nine hour shift will ultimately usually involve about three hours live on the air.

Handing over control is tricky. You look to line it up with a convenient lull in conversation, but much of live television is designed to avoid such lulls in conversation like it’s on an over-caffeinated first date. If you look carefully, you might see the telltale signs of a handover at the 15 and 45 minute marks of a live program. It could be a missing line, a duplicate line, or a five second lapse in captions. You probably won’t catch it at the 30 or 60 minute marks, because these we line up with the end of segments and programs.

One advantage of this division of labour in news captioning is that you tend to cover the same content a few times. Captioner A at the top of the hour and half hour will get up close and personal with the headlines, the top stories, the rarely-changing priorities of the day, while captioner B will become intimately familiar with the weather forecast, which they’ll cover twice each hour, and with the more variable summary coverage of minor and local stories, and quirky end features. Captioner A has the benefit of a fresh rundown which they can tidy from the top, captioner B has to listen while tidying so they can keep track of what’s already gone – they’re editing a document which is also being whittled away towards obsolescence, in real-time – a task with considerable figurative heft if you like your stories with eagles eating livers, snakes eating tails, or stones rolling uphill.

Of course, the other role of a co-pilot is to render assistance in an emergency, or otherwise as they are able. Accordingly, if your computer is unexpectedly sucked into a watery deep by Cthulhu, someone is hopefully poised to seamlessly take over. The most entertaining instance of this I’ve encountered was when a captioner working from home had to relegate control as a highly venomous Australian spider marched across her keyboard.

Die. :)

These things happen. More mundanely, a co-pilot has time to spell-check difficult and unexpected names, or to find unexpectedly reused scripts, and message them to the live captioner. Along with plenty of commentary and banter.

Still, it’s busy. Those 15 minute off-air windows are where all that prep has to fit. You just have to get into the rhythm.

Disclaimer.

Monday 5 May 2014

Put 'em together and what have you got?

Doing some offline scripting, and the phrase "bibbidi-bobbidi-boo" emerged. Now you never quite know what Dragon does and doesn't know, so I rolled the dice. "BP property boom". Kinda less magical than the Disney one!

Disclaimer.

A tricky comma nation.

There's a special frustration when a Dragon misfire affects the syntax of the sentence. Had a doozy this morning. ", essentially" (comma essentially) came out as "Thomas urges Pete". Oh well.

Disclaimer.

Sunday 4 May 2014

Lines, composed above (and below)

So in between last week's Russian novel of a post about news captioning, and an in-depth piece which is brewing away about measuring accuracy in captioning, I thought I'd throw in a little about one of those tiny-but-significant things we do - caption positioning.

Captioners have the whole surface of the shiny rectangle to work with, so describing positions onscreen requires a system. As live captioners, ours is very simple. The screen is divided evenly into 20 lines, and they're numbered from 1 to 20. Line 1 is at the top of the screen, line 20 is at the bottom. Our scrolling captions can take up any two sequential lines. </technical stuff>

Captions are scrawled like marginalia across the surface of audiovisual content, they cover up anything underneath, like a moustache on the Mona Lisa.

It's so nice to have you back where you belong.

How many problems that causes for the viewer, and how best to avoid doing so, depends heavily on client (ie TV channel or broadcasting company) preferences, industry conventions and practical expectations. The paramount consideration, of course, is that the most essential visual information is least likely to be obscured. Next consideration is to avoid endlessly bouncing around like a dated Flubber reference.

Timelessly elegant.

But it also takes a few seconds for line changes to kick in, so as captioners we need to be both on our toes and stoical about the fact that sometimes we'll know that we're covering something important, but it will already be too late.

So what do we look for? For news, panel discussion and interviews, the key visual elements are supers (superimposed graphic bars containing explanatory material, generally in the lower third, an industry term which usually doesn't mean the entire third), moving lips, eyes, occasional infographics, weather maps, 'hardwired' subtitles which all viewers see (used for translation or to clarify bad audio recordings), and incidental recorded vision, particularly around sport. As a general rule though, there's a near-perfect sweet spot, which we unceremoniously dub "line 16". It's about chest-level on any talking heads, about lectern-level on any press conference, above most supers and hardwired captions, and below any vision which frames the action in the centre of the screen. It looks a little something like this:

It's always the manic pixies.

I note in passing that these examples are from The Google and not from my employers. They may have their own standards, rules and technologies, which my own opinions regarding the success of their placement may not reflect. In any case, you see above that the captions fall nicely between the crime scene photographs, and the supers containing the time, the network, the temperature, the scrolling newsbar, the location and headline and the 'breaking news' graphic. This too shows the magic of line 16 in action:

New age parents, man.

You see how elegantly it gives viewers access to both the speaker's elegant bone structure and classical good looks, and the informative captions. Most speakers on these kinds of programming are framed in exactly the same way, so it's widely applicable. These, on the other hand, are less successful:

Lard, gross.

It's a tough gig.

Both of these confine themselves to lines 18-20. In the first, we lose whatever is being said about dieting and intimacy in the super, while in the second, there's a more serious side as that super may have featured important evacuation information, which viewers need.

I mentioned lips and eyes a moment ago. I'll just note in passing that there's more to that than aesthetic considerations. If the captions are in any way less than complete, deaf viewers may wish to lip-read. Anything which prevents that is indeed a fail.

So that just leaves weather and other graphics. Here it's as simple as raising captions to the top, dropping them to the bottom, or anywhere there's a reliable gap. For the weather, the shape of the continents comes into play. In Australia, line 1 just clears Darwin.

Not actually kidding when I say we sometimes bend that rule in cyclone season.

While for the UK, ~~we just repeat the word "rain" for four minutes~~ line 20 falls safely between continental Europe and this gentleman's trouserline.

Rakishly dapper fellow.

For sport, it's the scores, other incidental graphics, and the play itself which need to be considered. Luckily, here too there are some useful trends. Sport is unerringly visual, and it often foregrounds bodies over faces. In many sports, you could imagine following the play if you could only see legs, but perhaps not if you could only see chests and heads. It's convenient that we're usually dealing with the extreme top left or extreme bottom left for scoreboards, it puts them out of our way. Out of the top or bottom, we usually choose line 1. This illustrates the problem with the alternative:

Captioner multitasks, with mixed results.

These captions could very easily obscure the player's footwork, and if they move to a second line they will also clip the scoreboard. Again though, it can be varied, and a larger score insert might necessitate getting freaky with some line 2 action.

That's about it for live caption placement. There is just one other thing I wanted to mention. It's from the world of offline captioning, which is not my area, but film studies is and it's cool. It's horizontal placement - the use of left- and right-justified captions in dialogue. So there's an editing convention in cinema called the 180-degree rule. It's demonstrated here:

It basically means that when you're shooting a conversation conventionally, one character will tend to mostly be left of frame, the other right. The main purpose is to give viewers a coherent sense of the space. As a bonus though, a corresponding left or right caption alignment can clearly and stylishly show who is speaking. You end up with something which looks like this:

Neat, huh?

Disclaimer.

Friday 2 May 2014

Och Aye.

Kind of like that my captioning software thought "okie doke" should be "oaky dirk". Didn't go to air, because I knew it might fail, but evidently there's a fine line between being folksy and bringing up your hardwood-hilted Highland hand-hacker.

Disclaimer.

Thursday 1 May 2014

The headlines this hour:

Captioning the news might be a counterintuitive role for a live captioner. After all, your evening newshour is a tightly regimented machine, diligently fact-checked, read by a flamboyantly stodgy and resolutely Anglo older upper-middle-class baritone in a suit, abetted by a tersely professional, shoulder-padded alto, who is five years younger and has to work twice as hard and earn 30% less to survive and excel in her tenaciously chauvinist newsroom.

You know the guy.

Or, you watch SBS/ABC. Point is, news is surely meticulously pre-written, right? Even the obligatory meteorological larrikinism is tightly contained, relegated to its half-hourly whirlwinds of high-pressure-low-pressure interactivity. There’s a reason why both cute animal TV news stories and news anchor bloopers enjoy such guaranteed virality. It’s because broadcast news canonises and dignifies its content, that implacable prestige makes the anchor a perfect comic straight-man on par with Sideshow Bob.

Don't blame me, I voted Quimby.

It’s safe, it’s true, it’s the first draft of history written right there on the autocue (or teleprompter if that's your hemisphere). So if it’s all written ahead of time, why couldn’t it be relegated to an offline captioner? Why, in effect, don’t they just flip around the autocue for hearing impaired viewers to read? Why do they need us?

Errant question marks notwithstanding.

Well, it’s basically because the TV news isn’t just written. It’s also compiled. You’ve probably got five separate headlines, written by five separate people, then the opening music sting, then the anchor’s greeting written or improvised by him or herself, then the anchor throwing to the Sport Reporter With Eyes Too Close Together for a summary of the top sport story, then to the Enthusiastic Weatherperson for a 10-words-or-less version of the weather, before the anchor reads an intro to the top story (written by a sixth reporter). The anchor will then say “Our correspondent, Reporter Number Six sent this report, from In Front Of Something.” Then there will be a VT, or prerecorded package sent by Reporter Number Six. It could include background information, CCTV footage, interviews with key players or random vox pops, allegedly intercepted phone calls, shots of documents and plenty more. Anything which can be compiled and tidily put together before the news goes to air. But the story isn’t done yet! The anchor may then cross live to Reporter Number Six, who is still on the scene In Front Of Something, for the latest breaking developments. Then the anchor crosses live to an Academic With A Beard, for an interview discussing the geopolitical ramifications of The Thing That Happened.

“I'm a sceptic about the anthropogenic nature of Donetsk.”

That brings us to the end of one story. Probably four minutes into the broadcast. At least five people have spoken, at least nine have contributed either written or improvised verbal content. Between two and four were reading from an autocue, at least two were not. And any part of the story is subject to any amount of revision, and if Surprise Momentous Thing happens, or the VT or satellite feed fails, then The Thing That Happened might be bumped down the order, or out of the broadcast entirely.

So the role of a news captioner is similarly one of compiling a linear narrative from an array of moveable building blocks. At some point, close to the time of broadcast, the running order or rundown becomes available*. The exact mechanism varies between newsrooms, but it typically includes the full text for whatever is going in the anchor’s autocue, the order of stories with estimated times, blocks mapped out for commercial breaks, instructions (usually in a different colour) for those in charge of the visual aspects of the broadcast (cuts, wipes, VTs, crosses etc). The news will usually ultimately be captioned in a hybrid way, with some aspects cued out live from available text, and others voiced. So the captioner’s first task is to tidy the immediately available autocue script. It will be designed to be read aloud, so things like expressive punctuation (a comma, for every pause, even those, which would look odd in print), phonetic spelling, longhand numbers (“six-point-two-five-billion-dollars were wiped off the A-S-X this morning…”) need to be removed. This script needs to be kept in identifiably separate segments, so that stories can be bounced around the rather fluid running order in real time.

*Or, the technology gets the hiccups, and it doesn’t. In which case the captioner gets an extra glass of water, and a tragically non-Irish coffee, and gets ready to do an awful lot of fast talking.

Now, that VT. It’s true, someone will have written it. But it was probably a reporter in the field, and they probably didn’t send a script for it back to the network. And it’s very possibly not on the rundown, because it doesn’t need to be. It’s a prerecorded video, it’s not like it will forget or change its lines. All they’ll need is the first and last words, so the anchor knows when to come in. Even if the reporter did write a script, they probably didn’t see the need to transcribe their interview subjects’ responses. But we want it all, our viewers need the full auditory experience. Luckily, that’s not the end of the story. VTs don’t change, so they can be reused. If we’re captioning a 24-hour news channel, or a breakfast program that goes for a few hours, we can lift the previous hour’s output, find the VTs, check the proper nouns, fix the errors, and use it again. Same goes for different timezones – if your colleagues have captioned the east coast, and you’re on the west coast, they might have done some of your heavy lifting.

After all, some stories develop more quickly than others.

So now, we just need to get ready for that live cross to Reporter Number Six, and the interview with Academic With A Beard. Now that will have to be done live, no way around that, but looking through the rundown and the VT, we know what they’re going to talk about. We can see what new words Dragon will need to be taught, and what hard-to-say words might be programmed in with verbal shortcuts.

Or as I might say, "Word One"

So the news starts. Exciting, bracing music by an Oscar-Nominated Composer plays. We cue out the headlines, they switch the third and fourth stories but some fancy clicking and dragging sorts it out. They also throw in a grab from the Prime Minister after the second headline, which gets respoken. You voice the witty banter with the Sports Reporter With Eyes Too Close Together and Enthusiastic Weatherperson, changing colour for different speakers. The anchor’s intro to the Thing That Happened gets cued out, and the VT. A few extra words the previous captioner missed get spliced in live as you go. You also voice the last sentence, as you were pressed for time and couldn’t tidy the whole thing. You voice the cross to Reporter Number Six with some breaking developments from the last 10 minutes, and the interview with Academic With A Beard in light of these developments, and on it goes.

Not quite live, not quite scripted, the news calls for hybrid captioning.

Disclaimer.