welcome2

what is big data? what to do about it?

1.the challenge: big data

when needs for data collection, processing, management and analysis go beyond the capacity and capacity of available methods and software systems.

2.the solution: data science

scalable architectural approaches, techniques, software and algorithms which alter the paradigm by which data is collected, managed and analyzed.

3.the relevance 

Addressing the challenge of Big Data is on the critical  path for many organizations.

(1)the size and distribution of data sets and predictive models continues to burgeon(萌芽).

(2)core objectives (such as providing the basis of an analytic result) are becoming compromised.

NRC Report
Frontiers in the Analysis of Massive Data

A Major Shift from Compute-Intensive to Data-Intensive 

(1)Chartered(成立) in 2010 by the National Research Council.

(2)Chaired by Michael Jordan, Berkeley, AMP Lab (Algorithms, Machines, People).

(3)Co-author: Dan Crichton, JPL.

(4)Consideration of the architecture(架构) for big data management and analysis.

(5)Importance of systematizing the analysis of data .

(6)Need for end-to-end lifecycle: from point of capture to analysis.

(7)Integration of multiple discipline experts(集合了多个学科的专家) .

(8)Application of novel statistical and machine learning approaches for data discovery.

Data Lifecycle Model for Space Missions 

NCI Early Detection Research
Network: Moving towards Data-Driven Science for Cancer Biomarkers

The Need for Data Science 

(1)Data Science is the focused research to develop principled techniques and scalable architectures to address challenges across the entire Data Lifecycle.

 

(2)data lifecycle: Data Generation -> data triage(数据分类) -> data curation(数据管理) -> data transport -> data processing -> data mining/visualization(数据挖掘/可视化) -> data analytics

Common Challenges in Massive,Distributed, Heterogeneous Data

OODT: Object-Oriented Data Technologies
An Open Source Data Science Framework

Triage, Analysis, and Understanding of Massive Data

(1)Detection: fast identification of signals of interest (triage).

(2)Prioritization: use triage decisions to inform adaptive data compression.

(3)Classification: online, real-time source type classification.

(4)Understanding: generate human-understandable explanations for decisions.

Application of Machine Learning Methodologies to Cancer Biomarker Research

summary:

1.Data Science is a growing area that requires new thinking across the data lifecycle, in software/system architectures, in the application of intelligent algorithms, in addressing uncertainty.

2.JPL and Caltech have both established Centers to respond to this important growing area, Caltech and JPL are also partnering to maximize the value of combining our collective expertise. JPL is working with NASA, other government agencies, academia and industry to bring together solutions.


附原文:

Hello and Welcome to the Caltech and JPL virtual summer school on Big Data Analytics.

I'm Richard Doyle.

I'm with the Information, Data Science Program Office of JPL and my colleague Dan Crichton, who's at the Center for Data Science and Technology JPL.

We're going to set a little context foryou as to why JPL is interested. I'll give you a little sense of the big data challenges that we're engaged with and the kinds of solutions we're seeking.

So, our objectives in this joint summer school are there's a few objectives(目标).

First of all,

we're going to be focusing on what itmeans to do data intensive science well.You know, ranging from basic programming approaches, to deep computational methods for dealing with massive data, particularly distributive massive data. We're going to be focusing more on computational science, then traditional computer science, if that distinction isn't clear to you yet, it will, as we unfold this material. And it's going to be a very hands on laboratory over these several days, and the content is delivered to you 100% online as a MOOC.

Okay, so the first question to ask is,what is big data, and,you know, what do we do about it?

Everybody talks about Big Data these days. It's kind of a buzzword(流行词). And for us the, the challenge boils down to(归结为) when traditional methods for dealing with Berginin datasets and the fact that datasets are becoming more and more highly distributed as well as heterogeneous(混杂的,多种多样的). Traditional methods are falling short to be able to grapple directly with with that, that kind of growth in data.

In particular, the end goal is always toextract understanding from, from data.

And that's very much the goal within the NASA(美国航空航天局) context.

So, some of the challenges that we're seeing NASA ultimately is about delivering capability into very remote places like the far reaches of the solar system and bringing data back, so that we can study the planets and beyond. And just like a lot of other institutions, we're seeing the challenge of the size and the, the number and the heterogeneity of these datasets keeps growing. And and a core challenge that's beginning to rear it's ugly head so to speak, is that of the core tenet of reproducibility in doing science well.

In the era of Big Data everyone's dealing with vanishingly(难以察觉的) small slices of the available data. And when you have a publishable result, you want your peers to be able to reproduce that result. That's becoming increasingly challenging in the Big Data era.

So in the JPL and NASA context were hardly unique, these challenges are being recognized at the national level.

In fact there was a National Research Council activity in 2013 looking at Big Data Challenges and there was a publication called Frontiers in The Analysis of Massive Data.

And I'm happy to report my colleague Dan Creighton is a co-author of that report. And one of the take-aways in that report is the need to look across the full end-to-end data lifecycle. And that's going to be an important theme in, in, in all of our material.

Okay. So, here's a nice image of curiosity landing on Mars.

And I think you've all recall in just, just a few months ago, or little more than a year ago.

We landed Curiosity successfully on Mars, and that was a huge challenge.

And we did it, till seven minutes of terror and we were able to, and part of the, the challenges in doing that successfully, was validating the techniques forsafe landing on Mars, which involved processing lotsof test data even before launch.

Those challenges continue with typical Mar's missions where, for example one of our orbiters(人造卫星) has a high resolution instrument(高分辨率的仪器) called HiRise.

And we typically only use 1% of the capacity of that instrument, because ofthe difficulties of moving massive databack from remote locations like Mars.

So this points up the lifecycle view that makes sense to us in the context of JPL's space missions.

The hard life cyclestarts with a spacecraft(宇宙飞船),a remote platform which could be infar reaches of the solar system. And getting data back it's not only behind the traditional light time delays(光时间延迟). But depending on the specifics of the location, the ability to move data backto Earth can be quite limited.And increasingly we're, we're exploiting environments which are unpredictable whether that's the surface of Mars, we're trying to conductoperations near a comet orthe the activity can be volatile(反复无常的).

So this all translates into needs to be doing some forms ofdata analysis onboard this spacecraft,and at least prioritizing data. And sometimes making decisions about how to react to data, even in the onboard context.

So continuing through the life cycle view that's what we call the flight system part of the life cycle(所以继续通过生命周期的观点,即我们所说的飞行系统生命周期的一部分). The data's communicated to the ground where there's various processing involved to create usable data products that the end users, the scientists can use.

Even in the ground based context, the massive data streams are challenging our, our traditional methods to keep up with with the stream of data.

And then finally when the data arrives in archives. That's when the challenge really begins.

We're typically seeing data arriving in multiple institutions and the need to reconcile(一致,调节) predictive models with, with these large data sets and, and on and on.

Now the point about the, the data lifecycle is that often when people talk about Big Data, they're really referring to what happens after the data has arrived in the archives and focusing on the, the, the massive datasets that had originated the data and the distributed aspect.

But the point about the full life cycle view is that, if you don't take that full life cycle

view, you may be making choices, perhaps inadvertent(疏忽的) choices, all the way

from the collection point, or through the various processing steps that occur before the data arrives in the archives.

And those choices can be limiting or even compromising to the kind of understanding you'd like to extract from your data ultimately after it's arrived in the archives.

So if you don't takethat full lifecycle view,you may have lost part of the gamebefore you've even begun.

So again, that's a point we're going to be coming back to again and again over the next several days. So this is the lifecycle view that applies to space missions.

It turns out there's a, a very similar view looking at other arenas where massive data occurs.

For example, in the context of the National Cancer institute, the so-called Early Detection

Research Network, there's a pretty direct mapping between the NASA challenge and the challenges here. And that there's there's observational systems as the starting point for

collecting data. And they may not be in remote parts of the solar system, but they're, they're highly distributed.

And then there's the challenge of collecting data in who knows how many different formats and then ultimately getting it into the hands of the end users, in this case the researchers for for cancer research.

Who are interested in understanding biomarkers, and not only to be able to share their research results, but be able to work meaningfully across these highly distributed datasets.

So it's a very similar big data set of challenges. Not just in NASA, but in healthcare, and many other arenas.

And in fact, we're seeking those solutions that can be applied across discipline. Looking for those common principles. That will work as effectively in a NASA science context as well as many other contexts.

Okay, so now to expand further onthis notion of the data life cycle, here's a notional view of, of data from the point of collection all the way through to when it arrives in the archives.

And when people talk about Big Data challenges, they're usually talking about some kind of a capacity shortfall that occurs in some phase of the data lifecycle.

So let's just pick out one of these for illustration purposes, the, the second on the chain on the right side, the second from the top, the one we call Data Triage(数据分类).

Sometimes your, your data collection source is so extreme that you have to make choices. In fact irrevocable(不可改变的) choices about which data you're even going to keep and store.

And obviously that's an example of if you make poor choices that's limiting, as to what you'll be able to do later in the life cycle when you're actually prepared to apply your

analytics and extract understanding.

So that's just a simple example, but the point here is thatagain you need to take a full lifecycle view to be able to be effective.

Now at this point I'm going to hand to my college, Dan Crighton, who's going to

give a little more introduction as to the JPL activities in their data science which will lead directly into the instructional material to follow.

>> Very good.

Thank you, Rich.

So as, as Rich has, has spoken to, wereally see big data as something that notonly is, is affecting us in how we buildfor data-driven missions for NASA. But how do we address this problem and,and what do we do to address,develop the techniques for being ableto analyze mass amounts of data.

So these are, these are data that, where we need to be able to look at this life cycle. And so we're looking atthe challenges across science.(跨学科的挑战)

We're looking at other challenges in engineering business.

If you go back to that National Academy report that, that Rich mentioned we see these challenges of massive data that are occurring and effecting several industries.

So we need to be able to look at, how do we capture well architected data repositories.(架构数据存储库) This is a foundation. If you want to be able to actually do data analysis.

We first have to make sure we canactually capture the data, so we can begin to do that analysis.

Several science areas, astronomy, earth science, planetary ones that we're familiar with in, in biomedicine. Have done a good job of beginning to build repositories.

How do we move beyond that to begin to do the data analysis?

Much of this class is going tolook at techniques forbeing able to enablethat Big Data Analytics.

Part of that is enabling the cyber infrastructure(网络基础设施) to access the data to integrate this, to develop ways that we can look at how to analyze it using various statistical approaches.

So some of my colleagues will be talking about that.

Looking at the analysis and computation across highly distributed repositories. One of the challenges we have in Earth Science for example, is that we collect instrument data in highly distributed repositories that would benefit from being able to be brought together.

But these are different measurementsthat require statistical methods fordata fusion for being able toactually integrate that data.

How do we develop mechanisms for being able to extract out and look for the features in the data?

This is machine learning techniquesthat will help us to be able to look for those patterns, pull those patterns out, classify the data, be able to annotate interesting features of the data. So we can further the analysis and understanding.

And then how do we, we look forways in which we can compare our results,against things like predictive models. This is something that's really important in areas like Climate Science. Where we have models of, of climate. And we want to be able to cho, validate those models based on actual measurements we might be making in satellites, airborne missions(机载任务), and so forth. So, that's an important aspect that we really need to be able to, to address.

And then how do we, once we've actuallyanalyzed this data, how do we,how do we visualize it?

That's part of the analysis process, but the, the ability to actually visualize massive

amounts of data is important as well. So our colleagues, here in the summer school be talking about that aspect and,and providing ways inwhich we can do analysis.

One of the things that we have done at JPL is we've participated in, in the belief that open source technologies are important. They're a key part of how we can build the cyber infrastructure.

We have developed something that wecall the Object-Oriented DataTechnology Framework, or ODT.It's part of the Apache software foundation.Provides a way toactually build these,these,this data science framework to capture,manage and analyze data.

And I encourage you to take a look, look at that and others that can provide this kind of foundation. We've actually used it quite a bit in actually constructing earth science missions so it's flying in several earth science missions providing the foundation for that we're using it on our, on our planetary and our biomedicine. Some of our work with other agencies.

And so really the idea of,of how do you build the whole framework.

For being able to, to develop an analytics capability long term. [SOUND] Once you've had that framework, you want to actually be able to plug in and run the analytics.

So we'll be talking about some ofthe techniques to do machine learning.

Some of my colleagues who you'll hear from will talk about techniques in detection.

Prioritization of the data,classification pattern recognition. Ways in which we could look at and extract features from images.

So these are capabilities that hopefully you'll get some background in by the time you finish the class.

As I mentioned a moment ago one of the, our big interest areas is, is, is how do we begin to look atreconciliation(协调) of climate models,versus actual observations(实际观察).

So, this is an example where a lot of that data is distributed.

That the model data itself, the output of those models which are running in, in computer centers, the observations. And so we have real challenge of how do we actually do this,

in any kind of efficient way.

Right now, there are cases where it might take a couple of weeks to download these

climate models and begin to do an experiment As these grow, and these are growing into, towards the peta-byte scale in the future the, this challenge is only going to get worse.

So how do we be able to get our armsaround being able to do the analysis in,in some kind of efficient way.

We need to be able to have techniquesthat help us reduce that data.

Particularly at the source of where that data is actually collected and managed, so that we can increase the efficiency of the analysis.

In addition to what I've shown you as examples in climate and other areas what we also find interesting is the fact that many of these techniques are not unique to earth science, astronomy. The techniques that we can apply to other areas such as cancer research.

And so you see the ability to, to develop and apply algorithms for the detection of features and images.

So if we're looking at pathology(病理学) images or we're looking at images for cancer research, we can begin to look at trends over time. We can be able to identify features in that data that might give us insight and better understanding that we can be able to annotate(注释) those, provide that data to discovery agents and software that can be able to help you help scientists be able to access, extract and find the information you're looking for.

So the computational techniques that we're talking about are extremely powerful across all these various disciplines.

So in summary we'd like to welcome you to this class.

We hope you enjoy it.

We believe that data science. The whole area of, of, doing what we're calling Big Data Analytics in this class, is an area that we believe is, is growing.

We believe it's, it's critical that we have computationally trained scientists. They're able to work along traditional scientists to be able to really change the paradigm by which we do data analysis.

And that's a goal for this class is to beable to help provide a foundation thatbegins to train the next generation setof computational scientists that couldhelp us really begin to solve someof these, these kinds of challenges.

At JPL and Caltech, we've, we've just formed a center for connecting our centers together to really be able to go after these challenges.

So we've got the, the center for data driven discovery at Caltech. We've got the center for data science and technology. These are sister centers that we believe will be a really power combination. They'll help us.

As a joint entity to be able to really pursue and provide advances in research and capability in the area of computational science.

So we really see the value of bringing these capabilities together.

we, we at JPL are working with NASA and other agencies. We really believe that this is our future. So we hope you join the class, and we wish you well.

 

 



 

posted @ 2014-09-24 10:37  behappylee  阅读(374)  评论(0编辑  收藏  举报