Where Einstein Meets Edison

Semantic Technologies Series: Interview with Thomas Tague of OpenCalais

Semantic Technologies Series: Interview with Thomas Tague of OpenCalais

Jul 19, 2010

We are continuing our 6-part series on semantic technologies bringing you interviews with industry thought leaders.  To follow this week are interviews with Chris Messina (Google), David Recordon (Facebook), Will Hunsinger (Evri) and Jamie Taylor (Metaweb).

I talked to Thomas Tague, VP Platform Strategy at Thomson Reuters and in charge of Open Calais, a free service to analyze and extract concepts from user-submitted texts or web sources of any kind. Its main users include publishers, bloggers and content managers, who are looking to incorporate semantic functionality into their content. Open Calais is linked  into the Linked Open Data cloud, a project driven by Sir Tim Berners-Lee to connect distributed data across the web. Calais is working with CNET/CBSi, Huffington Post, DailyMe, Associated Content, The British Library and many more.

Thomas joined Thomson Reuters through the acquisition of ClearForest, where he was the VP of Solutions and Marketing. ClearForest builds software solutions to search facts and derive meaning from unstructured data sources such as news articles, blog posts or research reports.

Take us back to the early days of Calais in 2008 and the motivation behind the project…

Why is Thomson Reuters doing this? Well, they acquired ClearForest with the goal of using it inside Thomas Reuters to [derive more value from its data]. A large portion of Thomson Reuters, which has a large variety of content – for example 100,000 ticks / second stock market data –  […] is to create value by the ability to link content more tightly, derive insights – that’s why they bought us. And we are well down that path, and there are a lot of things we can’t talk about we are doing internally. Thomson Reuters is probably ex of the CIA the worlds’ largest repository of structured content; we have x petabytes of structured content and add value to it.

There are two reasons why we did OpenCalais; ClearForest was enterprise software, a very protected environment. We wanted to get it out in the world, make it tougher, more agile, more resilient, and by opening it up to now 30,000 developers, believe me, it’s broken many times, and then we fix it and the software is just far better than it was 2 years ago – both in terms of functionality and stability. We have days today we hit just a crazy load of over 200 transactions /second for an hour a few days ago – we’d never seen even 1/100 of that in the past.

The other motivation is more strategic. Thomson Reuters is a content company, it spends a lot of money to bring content into the top of the silo, to integrate, validate that content and pull it out through applications and tools. And then we have that other thing going on – the web, other content. 20 years ago, that’s perfect – now that’s orders of magnitudes larger than what is inside our walled garden and there is value in that content. So now the question is: what do you do?

We are always going to be the source of the most trusted content in the world. We would prefer if the explosion of the web did not detract from that but in fact was additive. So one of the values we get out of Calais is the amount of open content becomes available; we learn about the intersection of highly trusted monetized content and less trusted free content on the web. And what we are learning is that there are interesting business cases for both. There are businesses that can be built and value propositions derived just be harvesting just with wild [i.e. unstructured] content.

Can you provide one example of how the intersection between trusted paid content and free web content could play out?

We are starting to hear from our clients [how they] get value out of taking highly trusted Thomson Reuters content and linking it to the world’s content. [People] are not going to make $100 million bets based on blog postings. But that blog posting may be an outlier, may be an initial indicator, maybe about a layoff at a factory or something like that, that [people] can now immediately link back to trusted Thomson Reuters data and gain some insight and take some action. They might be a little bit faster; they might have a little better insight. In that case, the explosion of wild content and the importance for professionals becomes great- it’s not cannibalistic. We are starting to see some interesting first steps in that direction.

One of the widely discussed questions, also here at the Semantic Technologies Conference, is how to monetize linked data. One of the key conditions is to find a valuable data set and become a trusted source, but where do you go from there?

This is going to be one of the fascinating questions once linked data takes off. It is interesting, it seems to be technically solid but people aren’t actually using it very much; but assuming it does take off, it will be about what domains do you own, what is the trusted resource for certain data.

So, for example, one of our clients, CNET/CBSi, will be publishing linked data around electronic products. They will probably be the default go-to source for linked data for 8 megapixel cameras. Now, how do you monetize that?

In our case, the deepest linked data we publish is around companies. And we give away a little bit and then there is a link in the bottom that says “if you want more” …

… basically a freemium model …

Right now, we don’t actually expose this apart from some experiments with some of our clients but it is essentially a freemium. We give this away because it’s essentially exhaust out of our business and if you want to go down one more level we have a contractual relationship where you have to pay for it. The number of data and content types where this is applicable to? I don’t know yet.

In the case of CNET/CBSi, if they had sales numbers, review information, sentiment information about their products, I could see them hide this behind a pay wall. Other stuff is already out there for free.

How about from a technological standpoint?

The technology doesn’t support it very well. Linked Data was designed by academics, with no built-in monetization model other than one standard which says “subscription required beyond that point”. People will have to invent this.

So will Thomson Reuters become the data market place of the future, taking into consideration the long tail of the market, for example individuals looking to buy a very well-defined data subset for a low, one-time access fee?

I can’t speak about future products. What I can say is that there are a lot of conversations at the most senior levels within Thomson Reuters, around how we are going to play with content. Those conversations are going on all the time – cannibalization, value-ad – we are not unaware of the issue or the opportunity and we are thinking about it aggressively. And I am sure ten years from now the model will look very different but I can’t say exactly how. I advocate continuing for us to experiment, making portions of our content available either for free or under some kind of transactional license, allowing others to go off and create solutions on top of it.

A vantage point many content companies are facing …

One of the interesting things going on with all the content companies right now is that they are asking themselves: am I a content company or am I a platform; do I want to be a platform on which others will use my content and capabilities to build tools? We are going through the same questions right now and the answer, in the 10-year time frame, will probably be yes.

Let’s talk a little bit more about how Calais interacts with developers and start-ups.

One of the joys of my job is talking to a lot of smart people – I joke about how many NDAs I have to sign every year, we have a lot of discussions, 70% of the time we are telling people it’s not a good idea: either we have seen it before or there is no clear path to monetization; they should maybe go back and think about the value proposition a little bit more. There is an enormous number of me-too companies: “let’s take a piece of text and tag it and then search by tag”, “let’s put news on a map” – it’s been done! To some extent, what we offer is business advice; let’s talk about a domain you are passionate about and look how we can use Calais and linked data to make you more successful. With the more intellectually interesting ones, we spend hours of brainstorming.

How big is the team?

It is a virtual team, drawing from the 80 people within ClearForest what we need for Calais; most of our effort is dedicated to servicing Thomson Reuters.

What would be your key message to startups in the semantic technologies space?

You have to focus on the business; technical capabilities alone are not adequate. It is the hard lesson of every new technology.

What role do Google and Facebook play in defining the future of linked data, given the power of these companies?

These are interesting and sensitive questions but I will give you my direct answers: Facebook, I don’t they will be able to; you need the right balance between contributing to the community and doing stuff for the good of open and linked data and I think they made poor choices in that balance in the past. This is a world where you have to give capabilities away in order to get your monetization. We’ll see where they go. They are coming back in the right direction. Google, you never know, you just never know. Well, there is the adoption of rich snippets. It is my assumption that they are baking up some cover ideas, they probably have nothing to do with the standard. But if they come out they’ll be simple and they will make sense and if they are in any way tied to SEM they will possibly become dominant.

I hope there is a reasonable level of skepticism by users; they should look at the motivations, the lock-in in adopting certain formats.

What is your prediction for the near future concerning linked data?

The whole linked-data world is puzzling and problematic; I believe in it fully but we invested massive amounts of money – when Calais came out first, there was no linked data aspect to it – to create a linked-data aspect and push high-value information, company fundamentals. We publish the meta-data, as an optional thing, for privacy reasons, but utilization is extremely low. There is a lot of talk about it, but with respect to our linked-data company information, people aren’t picking it up yet very much.

I think we will continue to see pick-up in the truly open linked data space, particularly around government data, we will be playing in that area just to put fuel on the fire, to see examples. It’s still just a technology; we need some more real businesses.