Wednesday, June 17, 2009

The business implications of Project Natal, part two - automated captioning? (pulling inspiration out of your rear end)



Untitled by dno1967 used under a Creative Commons License


On June 1, I wrote a post entitled The business implications of Project Natal that speculated whether this project, initially tailored to the gaming community, may eventually result in advances in business applications.

Project Natal attempts to remove hardware devices that we normally use to communicate with computers, and instead allows us to use our hands, feet, and other parts of our bodies to communicate. If you think about it, a "control x" or even a mouse click is not a natural movement, and (at least theoretically) a Project Natal-like interface supports more natural movements.

But here's the question that interests me - how long will it take to transfer this technology from the game console to the corporate computer? Now this goes above and beyond Microsoft's announced plans to move the XBox 360 into movies, television and social networking. I'm talking about Microsoft server operating systems running on computers that don't need mice.


At the time I wasn't able to come up with any brilliant ideas. But today Ffundercats got me thinking.

For those who don't know, Ffundercats is a weekly podcast, hosted by Josh Haley and Johnny Worthington, that discusses both the FriendFeed application and things that are discussed on FriendFeed. And one of the items discussed (and embedded on last week's podcast was this video, sourced from this FriendFeed discussion:



The video shows Felicia Day (whoever she is; I'm ignorant) trying the Project Natal technology. And during the podcast (go here to listen), Haley and Worthington engaged in some speculation about how these types of interfaces could be used. The discussion about Project Natal starts about 25 minutes into the podcast, and they really get into the topic starting at minute 27.

As an example, Josh noted (29:52) that if you were using a speech-to-text interface, it is unnatural for a person to literally say the letters L O L (for "laughing out loud"). What if the system was instead able to detect laughter, and substitute the "LOL" acronym in the text transcription? Johnny followed (30:15) by noting that visual recognition is necessary to detect the activity referenced by the acronym "ROFL" (for "rolling on floor laughing"). The discussion fell apart, however, when Josh and Johnny thought about the phrase "LMAO."

A silly conversation, perhaps, but it does suggest some possible business applications for the Project Natal technology. You can say that I pulled this idea out of my rear end:

Think of closed captioning. Currently this is all done manually, both the actual words that are spoken and the non-spoken sounds that are made (for example, "JAKE: Let's go. [starts car]" What if you could use a Project Natal-like technology to automate the closed captioning process?

In case you didn't know, the closed captioning process is very labor-intensive. According to the National Captioning Institute, both captioning of pre-recorded programs and live programs is very labor intensive. For example, here is one step in the pre-recorded captioning process:

2) Caption Preparation: At one of NCI's networked caption preparation workstations, a caption editor watches and listens to the program and enters a verbatim text of the dialogue, sound effects and other essential non-verbal features into NCI's proprietary captioning system. The editor breaks the text into discrete captions, assigns appropriate screen placement to each caption and times the appearance and disappearance of each caption with the associated audio and video.

At least part of this step could be completely eliminated via a Project Natal-like technology. In the ideal scenario, the following would happen:

  • The system would analyze the video and audio from a pre-recorded tape (or a live event).

  • The system listens to the audio and detects the text being spoken, who is doing the speaking (video analysis may be needed for this), and other sounds (laughter, gunshots, etc.). These are converted into a series of captions.

  • Software could also be developed for the other functions, including breaking the text into discrete captions, positioning the captions, and calculating the timing.
Granted we're not there yet, but the Project Natal technology suggests where we could go.

And this is just one example of how the Project Natal technology could be transferred to business uses. I'm sure there are plenty of others.
blog comments powered by Disqus