Andy Havens: April 2012 Archives
In the first of what (we hope) will be a series of "Behind the Research" interviews, we spoke to Jean Godby, Ph.D., Senior Research Scientist. If you're wondering about the connection between Noam Chomsky and library metadata, now's your chance to find out.
Thanks for taking the time to talk with me today, Jean.
Jean Godby: Glad to.
Andy: This is the first one of these we've done. So I'm not 100% sure what we'll end up talking about. But my goal is to get some background on what brought you to OCLC, got you interested in the research you're doing and motivates you going forward. Why don't we start with where you went to school.
Jean: OK. I got my M.A. and Ph.D. degrees in Linguistics from Ohio State University and a B.A. in German Literature from the University of Delaware.
Andy: And how did that lead to a career in library research?
Jean: Well, when I was in school we weren't too far removed from Noam Chomsky's original work on linguistics. He's still, obviously, very influential. But the way in which computers could be used to study language was still relatively new, and I found it fascinating.
Andy: In what way?
Jean: We think of language, generally, as very fluid and complex. Which it is, of course. But Chomsky wondered if computer analysis of text could also provide clues to help us understand the basic structures of language. He invented a hierarchy of grammar that is still taught; one that is much more amenable to computer analysis than the traditional, grade school method of diagramming sentences that most of us were taught.
Andy: Can you give me a brief, layman's idea of the difference?
Jean: Sure. We're taught to pull apart sentences into chunks based on the parts of speech; subject, object, prepositional phrases, etc. The problem is that this requires an intimate knowledge of grammar, rules, exceptions, usage and context. That's harder to model using computers, and isn't as helpful for understanding key, universal properties of language.
Andy: Like studying geography as opposed to physics, maybe? Examples instead of rules?
Jean: Something like that. Whereas Chomsky's hierarchical approach to studying language lets you break it down in a much more binary fashion, which is something that can then be parsed by computer programs. Would an example help?
Andy: Couldn't hurt...
Jean: Let's take the sentence, "My brother ran up a bill." Traditionally, we'd break that down into a subject ("my brother"), a verb ("ran"), a preposition ("up") and a predicate object ("a bill"). Chomsky, on the other hand, would break that into pairs, and then further into pairs, like so:
Andy: And that's helpful because...
Jean: Being able to break down language in this way means that we can study it in binary, logical ways. Which means we can use computer analysis to determine the meaning of text, rather than relying solely on human interpretation. It also helps us understand the difference between why the sentence, "My brother ran up a bill" can be reworded as, "my brother ran a bill up," but the sentence, "My brother ran up a hill" can't be rephrased as, "My brother ran a hill up."
Andy: Interesting. And that obviously has applications for library science.
Jean: Of course. Text that is "machine processable," as we say, can provide us with easier and, in some cases, very different and very useful metadata.
Andy: And what is that used for?
Jean: Well, for one thing, providing a way to map metadata from one system to another. To create a kind of "Rosetta Stone" for metadata. Or, more accurately, many Rosetta Stones.
Andy: That sounds incredibly useful.
Jean: It is. In many cases, libraries have to deal with metadata standards--or a lack of standards--that arise from the data creation activities of non-libraries. If we can come up with translation analysis and software, it means we don't have to support as many standards directly. We can wait until useful standards emerge from the environment and then use them in our work, regardless of how they came about and how appropriate, initially, they are to our systems.
Andy: Can you give us an example of work you've done in this area at OCLC?
Jean: Sure. Name extraction is an excellent example. It's not always easy to automatically recognize and extract useful names of people, places, groups, organizations, etc. from text. When we can do that, analytically, though, we can provide a whole new set of tools for discovery. Our work on names and authority files here at OCLC has led, among other things, to WorldCat Identities, which provides a way to search for works by and about named entities.
Andy: I love WorldCat Identities. I call it "Six Degrees of Francis Bacon."
Jean: Nice. Also, this work is useful for more routine aspects of cataloging and discovery, such as FRBR and GLIMMR. Determining what an "item" or "title" actually is.
Andy: And this all tracks back to Chomsky?
Jean: Yes. Though that might be a little bold on our part. I like to think that we start with the same questions he did. What concepts do our systems rely on? How can we model them in ways that allow for better communication between them? How do we start with complex, analytical models but then deliver a coherent view that's useful to our communities of users?
Andy: Grammar, linguistics, computer programming and library science all in one. Very interesting stuff. Before we go, let me ask you... on a more personal note, what have you been reading recently?
Jean: Well, I just finished "The Big Short," by Michael Lewis.
Andy: He wrote, "Moneyball," too, right?
Jean: Yes. Both very interesting. But "The Big Short" is one of the scariest things I've read in a long time.
Andy: Anything else?
Jean: Well, I've finally gotten around to reading "The Grapes of Wrath," which, I'm ashamed to admit, I hadn't ever read before. I've also always been a fan of Appalachian Fiction, as I grew up in that part of the country.
Andy: Thanks, Jean. And thanks for taking the time with us to talk about how linguistics and programming fits into your areas of research.
Jean: Glad to oblige.