速记员即将被淘汰，未来 AI 可以把一切转录为文字

发布时间：2024/7/27 15:39:17 阅读次数：2236

人工智能势不可当。虽然尚不完美，却极有可能在未来取代打字员，将人类从打字的繁琐中解放出来，甚至使人们摆脱设备的束缚。便捷、高效、低廉的人工智能转录还将对未来社会产生哪些影响？本文编译自GREG NOONE在 the Atlantic上发表的“”。

怎样才是描述报业大亨鲁伯特·默多克（Rupert Murdoch）被奶油派砸了一脸的最好方式？这对世界新闻界来说不成问题。几乎所有媒体都报道了在2011年英国议会听证会期间，这位媒介大亨发表证词时发生的意外事件，报道风格从高雅喜剧到低俗喜剧皆由。但这对听证会的官方书记员来说，则是另一回事。通常情况下，书记员的工作只是记录听到的话语。奶油派袭击事件发生后——无论是出于有意选择还是受制于议会的固定风格——书记员决定以最简单的方式，将其标注为“中断”。

What is the best way to describe Rupert Murdoch having a foam pie thrown at his face? This wasn’t much of a problem for the world’s press, who were content to run articles depicting the incident during the media mogul’s testimony at a 2011 parliamentary committee hearing as everything from high drama to low comedy. It was another matter for the hearing’s official tranionist. Typically, a tranionist’s job only involves typing out the words as they were actually said. After the pie attack—either by choice or hemmed in by the conventions of house style—the tranionist decided?to go the simplest route?by marking it as an “[interruption].” ?

专业领域有大量的对话——会议、面试和电话会议等——需要转录为文字并存档，以备未来参考。这是一项繁琐的日常工作，但对于愿意付费的人来说，这项工作可以外包给专业的转录服务商。转录服务商会反过来雇佣人员，远程转录音频文件，或像我几个月的从业经历一样，参加会议，实时记录听到的内容。

Across professional fields, a whole multitude of conversations—meetings, interviews, and conference calls—need to be transcribed and recorded for future reference. This can be a daily, onerous task, but for those willing to pay, the job can be outsourced to a professional tranion service. The service, in turn, will employ staff to transcribe audio files remotely or, as in my own couple of months in the profession, attend meetings to type out what is said in real time.

尽管近年来出现了基于浏览器的转录助手，在现代西方经济社会中，转录依然是一项苦役，因为机器还是无法完全替代人类。直到去年年底，微软推出了一款产品使之成为可能。

Despite the recent emergence of browser-based tranion aids, tranion’s an area of drudgery in the modern Western economy where machines can’t quite squeeze human beings out of the equation. That is until last year, when Microsoft built one that could.

微软首席语言科学家黄学东（Xuedong Huang）在苏格兰爱丁堡大学攻读博士课程时，就被自动语音识别（ASR）深深地吸引了。“当时我刚离开中国，”黄学东回忆起用本科水平的美式英语，试图听懂苏格兰口音的教授讲话时的困难，他说，“我希望每个讲师和教授在教室里授课时，都能有字幕。”

Automatic speech recognition, or ASR, is an area that has gripped the firm’s chief speech scientist, Xuedong Huang, since he entered a doctoral program at Scotland’s Edinburgh University. “I’d just left China,” he says, remembering the difficulty he had in using his undergraduate knowledge of the American English to parse the Scottish brogue of his lecturers. “I wished every lecturer and every professor, when they talked in the classroom, could have subtitles.”

为了实现这种实时服务，黄学东和他的团队首先需要创建一个能够追溯转录的程序。人工智能的发展使他们得以利用名为“深度学习”的技术，将该程序训练为能从大量数据中识别出模式。黄学东和他的同事们利用该软件来转录NIST 2000 CTS测试集，这是20多年来作为语音识别工作基准的一组记录谈话。职业打字员在转录两个不同部分的测试时，分别会出现5.9%和11.3%的错误率。微软团队开发的系统则略微胜过两者。

In order to reach that kind of real-time service, Huang and his team would first have to create a program capable of retrospective tranion. Advances in artificial intelligence allowed them to employ a technique called deep learning, wherein a program is trained to recognize patterns from vast amounts of data. Huang and his colleagues used their software to transcribe the NIST 2000 CTS test set, a bundle of recorded conversations that’s served as the benchmark for speech recognition work for more than 20 years. The error rates of professional tranionists in reproducing two different portions of the test are 5.9 and 11.3 percent. The system built by the team at Microsoft edged past both.

“这还不是一个实时系统，”黄学东承认，“但它与我们所期望的非常相近了，在我们现有能力的基础上已经到达了极限。实时系统没有那么遥不可及了。”

“It wasn’t a real-time system,” acknowledges Huang. “It was very much like we wanted to see, with all the horsepower we have, what is the limit. But the real-time system is not that far off.”

的确，ASR程序已经能够准确地转录采访或会议内容，内容看上去也不再那么荒唐。在上个月微软举办的Build大会上，副总裁沈向洋（Harry Shum）展示了一款PowerPoint转录服务，展示时的语音能够和个人幻灯片相关联。同时，微软也在和苹果、谷歌等公司展开激战，让实时移动翻译应用能够完美地进行转录。

Indeed, the promise of ASR programs capable of accurately transcribing interviews or meetings as they happen no longer seems so outlandish. At Microsoft’s Build conference last month, the company’s vice-president, Harry Shum, demonstrated a PowerPoint tranion service that would allow the spoken words of the presentation to be tied to individual slides. The firm is also in a close race with the likes of Apple and Google to perfect the trans produced by its real-time mobile translation app.

黄学东相信，转录软件将超越人类能力的观点是可以理解的。“完美结果的定义是存在争议的，”他用人类打字员的错误率加以印证。“如何’完美’取决于特定情形和应用。”

Huang believes the point at which tranion software will overtake human capabilities is open to interpretation. “The definition of a perfect result would be controversial,” he says, citing the error rates among human tranionists. “How ‘perfect’ this is depends on the scenario and the application.”

如果带有实时转录语言任务的ASR系统，只有在正确理解每个词的情况下才被认为是成功的，那么这在很大程度上已经被Cortana和Siri等手机助手实现了，只是实时翻译应用尚不具备这种功能。然而，越来越多的计算机科学家意识到，对于自动转录音频的要求并不需要那么高，文本中的错误可以之后修改。

An ASR system tasked with transcribing speech in real time is only deemed successful if every word is interpreted correctly, something that largely has been achieved with mobile assistants like Cortana and Siri, but has yet to be mastered in real-time translation apps.? However, a growing number of computer scientists are realizing that standards do not need to be as high when it comes to the automatic tranion of recorded audio, where any mistakes in the text can be amended after the fact.

“我们并不声称…这是完美的。只是在拥有优质音频的情况下，它能够接近完美。”

“We don’t claim ... this is perfect. But, with good audio, it can be close to perfect.”

两家公司——位于伦敦的Trint和推出SwiftCribe应用的中国互联网巨头百度——已经推出了基于浏览器的工具，能够将一小时以内的音频转录为文本，且错误率在5%以内。在页面上，它们的输出和我作为自由职业打字员参加许多会议期间实时打出的原始文档相似，最好时像詹姆斯·乔伊斯（Joycean）的意识流巨作，最糟时像一篇官样文章。但是通过把用户从转录员变为编辑，这两个程序都能够免去数小时繁琐而不能分心的任务。

Two companies—Trint, a start-up in London,and Baidu, the Chinese internet giant with an application called?SwiftScribe—have begun to offer browser-based tools that can convert recordings of up to an hour into text with a word-error rate of 5 percent or less.*?On the page, their output looks very similar to the raw documents I typed out in real-time during the many meetings I attended as a freelance tranionist: at best, a Joycean stream-of-consciousness marvel, and at worst, gobbledygook. But by turning the user from a scribe into an editor, both programs can shave hours off an onerous and distracting task.

当然，节省的时间取决于音频的质量。Trint和SwiftScribe在转录几乎无噪音的面对面访谈时表现出色，在转录嘈杂房间中的录音、信号不佳的电话访谈或带有非美式或英式英语口音时则十分吃力。我尝试过对Trint播放一段德国口音的英语，却看到它把“天气相当冷，但气氛不错”转录成“那颗心也在呕吐。是的，他的第一面。”

The amount of time saved, of course, is contingent on the quality of the audio. Trint and SwiftScribe tend to make short work of face-to-face interviews with the bare minimum of ambient noise, but struggle to transcribe recordings of crowded rooms, telephone interviews with bad reception, or anyone who speaks with an accent that isn’t American or British English. My attempt to run a recording of a German-accented speaker through Trint, for example, saw the engine interpret “it was rather cold, but the atmosphere was great” as “That heart is also all barf. Yes. His first face.”

“我们并不认为在几分钟的访谈中，这样的转录结果是完美的，”Trint的首席执行官杰夫·考夫曼（Jeff Kofman）说。“但是，只要有高质量音频，它就能接近完美。你可以搜索、重听、查错，就能在几秒内知道究竟说了什么。”

“We don’t claim that this turnaround in a couple of minutes of an interview like this is perfect,” says Jeff Kofman, Trint’s CEO. “But, with good audio, it can be close to perfect. You can search it, you can hear it, you [can] find the errors, and you know within seconds what was actually said.”

考夫曼表示，Trint的绝大多数用户都是记者，其次是定性研究的研究员以及商界和医疗保健客户——换句话说，都是需要在严格的规定时间内完成大量音频转录的职业。这与SwiftScribe的开发者Ryan Prenger和他的同事们收集到的匿名用户行为数据相一致。虽然Prenger推测有一些长尾用户，他们只是渴望测试SwiftScribe能力的人工智能爱好者，但他也看到一些日常使用该程序转录语音的“超级用户”。随着ASR技术的不断改进，他对该技术能够吸引的用户范围感到乐观。

According to Kofman, most of the people using Trint are journalists, followed by academics doing qualitative research and clients in business and healthcare—in other words, professions expected to transcribe a large volume of audio on tight deadlines. That’s in keeping with the anonymized data on user behavior being collected by the developer Ryan Prenger and his colleagues at SwiftScribe. While there is a long tail of users who Prenger speculates are simply AI enthusiasts eager to test out SwiftScribe’s capabilities, he’s also spotted several “power users” that are running audio through the program on almost a daily basis. It’s left him optimistic about the range of people the tool could attract as ASR technology continues to improve.

“这就是转录技术的一般情况，”Prenger说，“一旦精确度突破一定范围，所有人都有可能开始转录，至少在前几轮。”他预测，最终自动转录技术能够提升对转录工作的需求和供给。“未来可能会出现一个良性循环，更多人期望他们的音频能够被转录，因为快速转录将变得低价、方便。而且，它将成为转录一切的标准。”

“That’s the thing with tranion technology in general,” says Prenger. “Once the accuracy gets above a certain bar, everyone will probably start doing their tranions that way, at least for the first several rounds.” He predicts that, ultimately, automated tranion tools will increase both the supply of and the demand for trans. “There could be a virtuous circle where more people expect more of their audio that they produce to be transcribed, because it’s now cheaper and easier to get things transcribed quickly. And so, it becomes the standard to transcribe everything.”

未来，Trint将有意识地进行拓展。该公司刚刚募集到310万美元的种子基金，用于下一轮扩张。考夫曼和他的团队计划本月底在维也纳举行的全球编辑网络峰会上，展示该技术的能力。他们的目标是在峰会主题发言结束一小时内，将转录结果发布在《华盛顿邮报》的网站上。

It’s a future that Trint is consciously maneuvering itself to exploit. The company just?raised $3.1 million in seed money?to fund its next round of expansion. Kofman and his team plan to demonstrate its capabilities later this month at the Global Editors Network in Vienna. Their aim is to have the tranion of the event’s keynote address up on the?Washington Post’s website within the hour.

虽然人们预计会出现错误，但仍然难以准确预测这次转录结果将会如何。速记员很有可能像小贩和售冰员一样，进入被遗忘的职业行列。在辅助写作工具的协助下，记者可以花更多时间进行报道和写作，侦探可以更早地分析出犯罪嫌疑人证言中的矛盾。YouTube上的视频字幕将标准化，大量听障人士能够接触到广播节目和播客。与熟人、朋友、旧情人的通话能够像社交软件和电子邮件一样存档、搜索，也能被执法部门拦截、存储。

It’s difficult to predict precisely what this new order could look like, although casualties are expected. The stenographer would likely join the ranks of the costermonger?and the?iceman?in the list of forgotten professions. Journalists could spend more time reporting and writing, aided by a?plethora of assistive writing tools, while detectives could analyze the contradictions in suspect testimony earlier. Captioning on YouTube videos could be standard, while radio shows and podcasts could become accessible to the hard of hearing on a mass scale. Calls to acquaintances, friends, and old flames could be archived and searched in the same way that social-media messages and emails are, or intercepted and hoarded by law-enforcement agencies.

对于黄学东而言，转录技术只是ASR从根本上改变社会的一部分，这些变化已经能从Cortana，Siri和亚马逊的Alexa之类的语音助手中瞥见。“显而易见的是，下一波将让你彻底脱离设备，”他想象着计算技术逐渐植入工作环境中。“在未来的中心，用户界面技术将使人们从设备的束缚中解放出来。”

For Huang, tranion is just one of a whole range of changes ASR is set to provide that will fundamentally change society itself, one that can already be glimpsed in voice assistants like Cortana, Siri, and Amazon’s Alexa. “The next wave, clearly, is beyond the devices that you have to touch,” he says, envisioning computing technology discreetly woven into a range of working environments. “UI technology that can free people from being tethered to the device will be in the front and center.”

然而目前，自动转录器的工程师们还是需要更多的相关用户：例如在最后期限前拼搏的记者，或是想方设法描述一位男性在国会特选委员会上被砸了一脸奶油派的书记员。

For the moment, however, the engineers behind automated transcribers will have to content themselves with more germane users: the journalist sweating a deadline, or the tranionist working out the right way to describe a man being pied in a parliamentary select committee.

[1]