Thursday 8 January 2015

Quality and Accuracy Part Three: Style and the Great Offline Caper

Happy New Year, captioning enthusiasts! We swan-dive into the thick, opaque molasses of 2015 in interesting captioning times. Australia may or may not have exorcised the spectre of deregulation and quality-cutting, industry-wide security is being tweaked against the threat of hacking (as many have mentioned, captions comprise valuable metadata – a corollary is that it’s data which unsavoury types may covet), but at the cost of some convenience and productivity for captioners, and I discovered on New Year’s Day that captioning live fireworks instead of news headlines at the top of each hour is really quite fun (even if “auld” isn’t the most Dragon-friendly word). So I basically hope 2015 will see continued public support for high-quality captioning, smooth and user-friendly security protocols, and colourful explosions just every day.

If fireworks persist, see your doctor.


Another development over the past few months has been a diversifying of skills for your friendly neighbourhood Rogue Captioner. I mentioned in the very beginning that there are two basic strands of TV captioning – live and offline. Live captioning involves either stenography or respeaking in real time as things go to air like an uncommonly literate hamster, sometimes combined with cueing out prescripted elements. While I remain primarily a live captioner, I’ve been gradually learning most of the steps in the shadowy world of offline captioning, and filling in the sometimes-unpredictable gaps between manic live times with the more methodical offline work. I thought I’d share a little about how it’s done. As it makes sense here to go into style and standards, this post also forms the much-belated third part to my “quality and accuracy” series. The other parts covered losses and errors.



You may be surprised how much more time which goes into offline than live captioning. Live captioners typically produce output at a rate of around 5:2, since they require some prep beforehand (and of course a sandwich after), and then usually share the load 50-50 with a co-pilot. So fully captioning a two-hour segment requires two captioners to each prep for roughly half an hour, then alternate 15-minute or 30-minute slots on air, for a total of five captioner-hours. But offline is a really different, and much more chronovorous, ballgame. It’s rarely less than 10:1 – that’s 10 hours of work for one hour of captioned content – and even that kind of efficiency may only happen if you’re an unusually dextrous millipede with opposable thumbs.



So where does the extra time come from? Well, the glib answer is offliners are lazy perfection and timing. Live captions slither and snake their way onto the air, a word at a time, with around a five-second delay and an accuracy rate of between 97.5 and 99 percent. Offline captions appear in carefully sculpted blocks, adhering to a long list of style guidelines, exactly as the speaker is talking. This post will take you through the process of getting a captioned program ready for broadcast, up to the point of a final edit (that part isn’t yet among my responsibilities, so there be sea monsters), and let you in on the sorts of things we need to keep in mind. Two main processes need to happen – first scripting, and then file/fix-up – with some inevitable overlap between them.



So we first receive an episode, in the form of an MPEG file, from a broadcaster. Captioners assign themselves all or part of the runtime of the episode, depending on how much time and caffeine they have available, then get to work creating a script. The first step in scripting is to import it into our offline captioning software. This software is designed to stop, collaborate and listen with both Dragon, the speech-recognition software for respeakers, as well as the shorthand software used by stenographers. It combines video-navigation functions like play, stop, slow-motion, or (very usefully) jump-back-one-second-and-play, aka the “what was that?” button, with captioning functions like colour change, positioning on screen, and adding, deleting and combining captions. It adds up to an absolutely dizzying array of keyboard shortcuts, and watching someone really experienced use it can be quite baffling. I ain’t there yet, so the shortcut I’ve most mastered is Ctrl-Z, to undo. Next, before you get started, you need to make sure the timecode on the video matches that in the caption file, which can be thought of as the captioning equivalent of the clapper used to synchronise audio and visuals in film.



The software then takes a moment to create some invaluable metadata (I barely knew her data!) which maps out the audio track, calculating the “shape” of the sound in a way which will help it to guess where each caption should fall. Interestingly though, it also maps out the visuals, marking out where all the shot changes fall. When I discuss the second process, file/fix-up, that will come in handy.

Metadata: Nothing Whatsoever to do with Envelopes.


So now the main work of scripting can begin (at the very beginning, which Julie Andrews tells me is a bonza place to kick off). On the first play-through of the file, or “first pass”, we respeak it in much the same way we would for live content, with a few differences. Firstly, since we can pause and go back, there is no sense in paraphrasing to avoid words we don’t know or can’t catch, or to get around cross-talk (characters speaking over each other) or fast dialogue. The first pass can be far from perfect, but it must at least be reasonably complete. While in live captioning it sometimes makes sense to skim or just convey the gist, offline has no place for that. Secondly, the first pass is where we begin to create the timing. We do this by setting markers (more keyboard shortcuts) where a conversation or section of narration or whatever begins and ends. The software then gets fancy. It guesses the breakdown of the captions, based on colour changes, punctuation and a two-line limit, then looks at the shape of the audio from that metadata I mentioned, and roughly matches up each caption to that like an audiovisual OkCupid. So let’s say I’m captioning this scene:


I would put a section opener before “Big man in a suit of armour. Take that off, what are you?” and a section closer after “Genius, billionaire, playboy, philanthropist.” The software would recognise three sentences, each comfortably within two lines in length, and would split it accordingly into three captions. Then it would read the audio track metadata and see three little corresponding spikes at frequencies consistent with human speech. I’ve told it where the first caption begins and the third one ends, so it should make an educated guess that the second caption begins after Steve pauses, and the third begins where the voice soundwave shifts to Tony’s taunting tone. Those sentences were all similar in length, but if they vary, the software can include sentence length in its calculations. It often gets it wrong (single words which are held a bit long like “Stella!”, rapid-fire sentences, and lyrics cause particular problems), but it gives us something to work with when I eventually come back and fix the timing.



So after the first pass we have a rough script, roughly timed. The aim of the second pass is to get it word-perfect. We watch through again, pausing to fix any Dragon errors, verify any proper nouns, and standardise spelling to our style guides. For Dragon errors, there’s a handy keyboard shortcut which cycles through homophones of a selected word, so you can quickly turn nay into neigh if the politician voting against the motion turns out to be Mr Ed. For proper noun verification we can use the credits, imdb, or else my company maintains an enviable database filled with soap opera family trees, the baffling names of reality TV contestants who Didn’t Come Here To Make Friends, and street directories for places that never were. I won’t subject you to all of our spelling standards (with some exceptions, reciting the dictionary isn’t the best way to dazzle readers), but one of my favourite documents is our official spelling list of non-verbal sounds. So “eugh” expresses disgust, but “ew” is used to “express disgust, Valley Girl-style”. “Ah” always expresses discovery and “uh” uncertainty (except where it’s within “uh-oh,” “uh-uh” or “uh-huh”), even though what you actually hear may sometimes be the other way around. “Oh” is surprise or an interjection, but “Ohh” is emotional pain (“O” sometimes comes up as a religious invocation, but usually needs specific verification). If it’s quizzical or contemplative it’s “hmm”, but if it conveys agreement or pleasure it’s “mmm”.

Another acceptable usage.


We also resplit captions at this point to be as readable as ephemeral text on a glowy rectangle can be. Here there can be some trade-offs. We try and keep sentences, or clauses, or concepts, or individual speakers, together. We try not to end either a line or a caption with a preposition (of, for, under…), a conjunction (and, but, so…), an article (the, a, an) or a verb – anything, basically, which belongs with the word that follows. We try not to have a book or movie title go over a line or caption break. We don’t use semicolons as they’re difficult to read without the ability to glance back over the first part. We avoid colons in most cases as when captioning they specifically mean “read the screen”. So if a phone number is printed onscreen, we might caption “For a free trial appendectomy, call:”. And more generally we err on the side of shorter sentences as they’re more readable onscreen.



So once we finish the second pass and a quick spell-check, that wraps up the scripting phase. Next comes a grab bag of chores under the heading “file/fix-up”. We now have a word-perfect script, the next big task is timing. A handy keyboard shortcut which moves the video to the beginning of the caption you’re editing becomes your friend at this point. Hopefully many of the captions will be sitting roughly where they need to be, but we go through and meticulously adjust the start and end times of each caption to correspond with when the speaker is doing their speak thing. There’s a few exceptions though – the captions should be no shorter than one second, even if the utterance is, because while a very short caption doesn’t take long to read, it might take a moment to notice. A caption can linger longer for readability if it exceeds 300 words per minute. If a pause is very short, like on a dachshund, we don’t put a gap between captions. It looks more polished, and it also means that when the captions do cease, the viewer will subconsciously know the conversation has paused and they can safely look around the rest of the mise-en-scène without having to be immediately yanked right back to the captions. For similar reasons, we try and align the beginning and end of captions with any relevant shot changes, even if the speaker begins talking slightly before or after the cut. And they often do – there’s a film editing technique called a “sound bridge” which involves using sound to smooth over a visual cut. Sound bridges can also mimic the way our senses work – we begin hearing a sound and then look up. But a cut involves a whole new slab of (visual) information to take in. If a caption comes just before, the viewer might be intently reading it and miss something visual. If it comes just after, they might be taking in the visuals and run out of time to read the caption. If it’s simultaneous, it maximises the time to take both in.



So back to the above Avengers clip, let’s look at the timing considerations. No-one is talking fast enough to present serious problems with reading speed, and not many of the pauses would be long enough to justify a gap in the captions. Tony’s “Why shouldn’t the guy let off a little steam?” gives a nice example of a sound bridge as the cut from a long shot to a mid shot happens just before he finishes talking. So we’d probably clear it right on the cut, which frees up the viewer to take in the visual tension between Steve and Tony. Whereas for Tony’s last close-up, he starts saying “Genius, billionaire, playboy, philanthropist” just after the cut, but close enough that you’d probably start the caption on the cut. As an added bonus, this makes it easy to see who is talking, as it ties the shot and the caption together, like a comic panel and a speech bubble.



We also do caption positioning at this point. I’ve mentioned the principles of this, the main things are avoiding speakers’ mouths and important visual information. By default we hug line 20 at the bottom of the screen, raising over any supers when necessary. Interestingly, we also have to raise for 10 seconds after ad breaks in case network promos are added in post-production, after we’ve done our thing (or when the show is repeated). We have to make sure there’s at least a second clear just before and after ad breaks, as that can cause “hanging caption” glitches. For the same reason, a blank caption is needed at the beginning so any late-running ad captions don’t get stuck. We insert labels where it isn’t clear who is speaking, sound effects where relevant, and a captioning company credit at the end. We run a battery of tests which check for errors, short gaps, minimum and maximum lengths, word rates, overlaps, captions too close to shot changes, spelling, homophones and invalid characters (some text we copy in comes with the wrong kinds of apostrophes, which is a headache), and then we…uh, watch the show. We watch it as it will air, or at double speed if we’re running short of time, and look for anything that seems wrong or unclear.

Nup, all seems in order.


And then we send to the editor, and go get a sandwich.



Disclaimer.