Google Base与科学家数据共享（Nature Vol 438|24 November 2005）

Google Base与科学家数据共享

下面是Nature 11月份上一篇新闻报道，Google的数据库遍布世界各地，Google Base让人们联想到Semantic web，而这两者加起来又让人们联想到“数据网格”，也许Google将是数据网格被普通人使用的开始。

Google Makes data free to all

translated by Jacquette (http://jacquette.cnblogs.com)

Google在上星期启动了它的一项新服务——Google Base。它允许任何人免费上载文件到Google服务器，使得数据立即可以被检索。虽然这项服务主要定位是在线市场：比如工作和居家，科学家们说这项服务也蕴含了巨大的针对科学数据进行共享的潜力，并且可能使得整个Web变得更加具备“智能”性。

Google Base允许用户上载数据，并允许用户用简单的标签来描述数据，其他人可以在搜索中使用这些标签。它还让用户随时增加字段，以让这些非结构化的数据变为结构化的数据。一个包含一篇科学文章的网页也许会利用“作者”、“期刊”、“出版物”或者其他关于文章的元数据信息的字段。

这也许听起来并不是一项复杂的技术。但是鼓吹者认为它使得Web内容在整体上变为结构化的数据库。它开始使得科学家们共享数据、以便于搜索的形式存储数据变得更简单。

David Haussler（加利福尼亚大学生物科学与工程中心的director）说，在共享数据方面，科学家们仍处在一个“黑暗的时代”。那些存在于基因序列、蛋白质结构、天文数据等几个相对很少的大型世界级数据库之外的数据，最终以附随期刊文章的辅助表格的形式从这个世界上消失。他说，“它们以不可索引、不一致、不方便的形式存储，如果说它们真的被保存的话”。

网格概念的创始人Ian Foster说，Google Base或者类似的服务或许能够发挥它的作用。在“网格”中，很多计算机协同起来提供大量的处理能力和数据存储能力。Ian Foster认为，科学研究急需一些能够使得个人、团体创建和共享数据以及处理这些数据的程序变得更容易的东西。

提供一种简单地能够交互检查多种类、多来源数据的途径能够给科学研究带来真正的实惠，来自明尼苏达州大学生物信息学的Paul Myers说，“我认为Google在这方面是超前的，这个工具的重要性将会变得不可估量”。

Smart Systems(智能系统)

Google Base也许还意味着是Web向智能网络前进的一个低调谨慎的开始。Web向智能网络发展的思想最初是由Web的创始人Tim Berners-Lee于1989年在欧洲高性能粒子物理实验室CERN提出的。

网页的设计只是为了方便人的阅读，并不包含那些计算机能够处理的附加描述信息。这种设计思想限制了它们的用处，尤其是对那些搜索网页的用户来说。例如，现在我们不能够通过Web搜索“关于蛋白质CCR5激活PYK2实验的，并被同行评论的论文”。当我们在线阅读论文时，不能够让计算机重新画一个概念图以包含一些额外的数据集。

Berners-Lee倡导一种被称为“Semantic Web（语义网）”的东西，它将网页加上标签，从而使得计算机能够理解网页的内容。这意味着计算机能够询问那些网页上的数据是否满足一定的条件，以及从不同的数据源进行数据的融合。

但是，虽然语义网在生物信息学等一些专门的领域很快建立了基础，它仍然没有广泛地推行。科学家们认为，Google Base能够改变这种局面，因为它能够将大量结构化的网页聚集起来。“最大的问题在于这样的服务是否能够对引导语义网起到作用”，Greg Tyrelle,台湾Chang Guan大学的一位生物信息科学家说。

Google power

“对包括科学数据在内的任意的数据进行在线的灵活存储，是近两年一个主要的研究领域”，Leigh Dodds，Ingenta出版社的一个Web专家认为。“Google Base将其扩展到每个人，使得这项研究推进了一大步”，虽然他还说希望看到政府、大学也推动类似的服务，而不是只留给Google来作。

但是，科学家指出，Google在W3C 组织关于Semantic Web标准制定工作方面的缺席是突出的。他们还承认，和那些专业的数据库，比如GenBank, UniProt等比起来，现在Google Base服务还是相当粗糙的。你所能做的只是发布信息，搜索信息，它没有提供对这些数据实行抽取、计算的途径。

但很多研究者相信这种情况很快就会改变。Google已经对其他的服务公开了其API接口，同样，对Google Base 服务也不会例外。它将允许任何人写可以访问Google数据库的程序，将Google数据库中的数据同其他数据混合、匹配，以创造出完全不同的新产品。

“如果Google 意图将Google Base变为不再仅仅是一个查询信息的工具，而是供科学家发现数据的东西，更多的工作需要去做。”耶鲁大学生物信息学科学家Mark Gernstein说。

但是根据Foster的观察，这样的进程将会很快发生。“Google 有很多相关技术和专家，如果它组建了合适的团队并投入足够的资源，就一定能产生巨大的影响”。

“Google Base现在看起来还有点简单，而且还不清楚怎样去接入Google power，但是这毕竟已经开始了”，Myers说。

原文：

Google launched a new service last week, Google Base. It allows anyone to upload files for free to its massive server farms, making the data instantly searchable. Although mainly aimed at online markets for such things as homes and jobs, scientists say the facility could have important implications for data-sharing in science, and perhaps boost efforts to make the web more ‘intelligent’.

As well as letting people upload data, Google Base lets users describe the data with simple tags that others can then use in searches. It also allows users to structure the data by adding fields on the fly. So a web page holding a scientific article might have fields for ‘author’, ‘journal’, ‘publicationdate’ and other bibliometric information.

That might not sound like a very big deal. But advocates say that this allows web content to be structured as databases on a large scale. For a start, that makes it simple for any scientist to share data, and store it in ways that allow computers to search and retrieve it.

Scientists are still “in the Dark Ages” when it comes to sharing data, says David Haussler, director of the Center for Biomolecular Science
and Engineering at the University of California at Santa Cruz. Data falling outside the relatively few big international databases, such as those for gene sequences, protein structures and astronomy data, mainly end up in supplementary tables accompanying journal articles, he says, and are “stored in some non-indexable, inconsistent and inconvenient format, if indeed they are kept at all”.

Google Base or similar services could help, says Ian Foster, a computer scientist at Argonne National Laboratory in Illinois and co-inventor of the Grid concept, in which many computers work together to provide large amounts of processing power and data storage. Science badly needs “something that would make it trivial for individuals and communities to create and share scientific data, and the programs that operate on those data”, he says.

“To have a way to easily cross-examine multiple kinds and sources of data would be a real boon to research,” agrees Paul Myers, a bioinformatician from the University of Minnesota, Morris. “I think Google is getting in early on what could be an immensely important tool.”

Smart systems

Google Base may also signal a modest start for the web to move towards the ‘intelligent’ network originally envisaged by Tim Berners-Lee when he invented the web at CERN, the European Laboratory for Particle Physics in Geneva, Switzerland, in 1989.

Most web pages are designed to be read by humans, and don’t contain additional descriptive information that can be interpreted by computers. This limits their usefulness, especially for users carrying out searches. For example, it’s not currently possible to search the web to find “only peer-reviewed papers dealing with experiments where the CCR5 protein activates the PYK2 protein”. And when reading a paper online, you can’t ask the computer to replot a graph adding in extra data sets.

Berners-Lee champions what he calls a ‘semantic web’, where tags added to pages would allow computers to ‘understand’ what the pages contain. This means computers can ask whether the data meet certain criteria and merge data sets from different sources.

But although the semantic web is fast gaining ground in certain specialist areas such as bioinformatics, it has yet to take off in a big way. Scientists say Google Base could change that by bringing structured web pages to the masses. “The big issue here is whether services like this will help bootstrap the semantic web,”says Greg Tyrelle, a proteomics researcher at Chang Guan University in Taiwan.

Google power

“Flexible online storage of arbitrary data,   including scientific data, is going to be a major area of research over the next couple of years,” says Leigh Dodds, a web expert at publisher Ingenta. “Google Base takes that a step further by widening it out to everyone,” although he adds that he would like to see governments and universities doing more to promote such services, rather than leaving it to Google.

Scientists point out, however, that Google has been prominent in its absence from work on the semantic web in the World Wide Web consortium
(W3C), the body that creates web standards.

They also acknowledge that Google Base is a pretty crude service so far, especially compared with sophisticated specialist databases such as GenBank and UniProt. All you can do is put in information, and then search it — there’s no way to extract or compute the data.

But most researchers believe that will change fast. Google has been a pioneer in creating what are known as ‘application programming interfaces’ to its other services, such as Google Maps. These allow anyone to write programs that can access Google’s databases, and mix and match its content with other data to create completely new products.

“If Google wants to turn Google Base into more than just a tool for finding information, and into something scientists can actually use to explore data, then more is needed,” says Mark Gernstein, a bioinformatician at Yale University in New Haven, Connecticut.

But observers such as Foster believe such progress could happen fast. “Google has much relevant technology and expertise,” he says. “If
it forms the right partnerships and dedicates sufficient resources, it could have a tremendous impact.”

“Google Base looks a little simple right now, and it’s not clear exactly how to tap into Google’s power,” adds Myers. “But we’ve got to
start somewhere.”

posted on 2005-12-17 20:17 Jacquette.wang 阅读(705) 评论(0) 编辑收藏举报

刷新页面返回顶部

桂林山水甲天下