复习笔记|Lecture 1: Introduction to Big Data(大数据处理技术)
1.Evolution of Big Data
Evolution of technology
Here are some numbers that convince you that data is exploding…
The constant creation of data using social media, business applications, telecom and various other domains is leading to the formation of Big Data.
2.What is Big Data
Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on‐hand database system tool or traditional data processing applications
大数据是指大量和复杂的数据集的集合,使用现有的数据库系统工具或传统的数据处理应用程序很难处理
The 5V's of Big Data
Volume: huge amount of data
Variety: different formats of data from various sources
Value: extract useful information
Velocity: high speed of accumulation of data
Velocity:数据积累速度快
Veracity: inconsistencies and uncertainty in data
准确性:数据中的不一致和不确定性
Mining is not useful if the data is messy and poor in quality, and is hard to analyze messy data
杂乱无章、质量差的数据挖掘无用,杂乱无章的数据难以分析
V’s associated with Big Data may grow with time…
Types of Big Data
• Structured: The data that can be stored and processed in a fixed format is called as Structured Data. Data stored in a relational database management system (RDBMS) is one example of ‘structured’ data. It is easy to process structured data as it has a fixed schema. Structured Query Language (SQL) is often used to manage such kind of Data
• 结构化:能够以固定格式存储和处理的数据称为结构化数据。存储在关系数据库管理系统 (RDBMS) 中的数据是“结构化”数据的一个示例。结构化数据很容易处理,因为它有一个固定的模式。结构化查询语言(SQL)通常用于管理此类数据
• Semi‐structured: Semi‐structured data is a type of data which does not have a formal structure of a data model, i.e. a table definition in a relational DBMS, but nevertheless it has some organizational properties like tags and other markers to separate semantic elements that makes it easier to analyze. XML files or JSON documents are examples of semi‐structured data.
• 半结构化:半结构化数据是一种数据类型,它没有数据模型的正式结构,即关系 DBMS 中的表定义,但它具有一些组织属性,如标签和其他标记来分隔语义使分析更容易的元素。XML 文件或 JSON 文档是半结构化数据的示例。
• Unstructured: The data have unknown form and cannot be stored in RDBMS and cannot be analyzed unless it is transformed into a structured format. Text Files and multimedia contents like images, audios, videos are example of unstructured data. The unstructured data is growing quicker than others, experts say that 80 percent of the data in an organization are unstructured
• 非结构化:数据具有未知的形式,不能存储在RDBMS 中,除非将其转换为结构化格式,否则无法进行分析。文本文件和图像、音频、视频等多媒体内容是非结构化数据的示例。非结构化数据的增长速度比其他数据快,专家表示组织中 80% 的数据都是非结构化数据
3.Big Data Challenges & Opportunities
Challenges with Big Data
• Storage: The more data an organization has, the more complex the problems of managing it can become. The question that arises here is “Where to store it?”. We need a storage system which can easily scale up or down on‐demand.
• 存储:组织拥有的数据越多,管理数据的问题就越复杂。这里出现的问题是“在哪里存储它?”。我们需要一个可以轻松按需扩展或缩减的存储系统。
• Data quality: The problem here is the 4th V i.e. Veracity. The data here is very messy, inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in the United States.
• 数据质量:这里的问题是第 4 个 V,即准确性。这里的数据很乱,不一致,不完整。在美国,脏数据每年给公司造成 6000 亿美元的损失。
• Discovery: Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes of data using extremely powerful algorithms to find patterns and insights are very difficult
• 发现:寻找有关大数据的见解就像大海捞针。使用极其强大的算法分析 PB 级数据以找到模式和见解非常困难
• Analytics: In the case of Big Data, most of the time we are unaware of the kind of data we are dealing with, so analyzing that data (with complex structure) is even more difficult.
• 分析:在大数据的情况下,大多数时候我们并不知道我们正在处理的数据类型,因此分析该数据(具有复杂结构)更加困难。
• Efficiency: The data volume is growing rapidly and the queries often have tight response time limits.
• 效率:数据量增长迅速,查询通常有严格的响应时间限制。
• Security: Since the data is huge in size, keeping it secure is another challenge. It includes user authentication, restricting user access, recording data access histories, proper use of data encryption etc.
• 安全性:由于数据量巨大,保证数据安全是另一个挑战。它包括用户身份验证、限制用户访问、记录数据访问历史、正确使用数据加密等。
• Lack of talent: There are a lot of Big Data projects in major organizations, but a sophisticated team of developers, data scientists and analysts who also have sufficient amount of domain knowledge is still a challenge.
• 人才匮乏:主要组织中有很多大数据项目,但拥有足够领域知识的开发人员、数据科学家和分析师组成的成熟团队仍然是一个挑战。
Big Data as an Opportunity
Big data analytics examines large and different types of data to uncover hidden patterns, correlations and other insights. Basically, Bi g Data Analytics is largely used by companies to facilitate their growth and development. This majorly involves applying various data mining algorithms on the given set of data, which will then aid them in better decision making.
大数据分析检查大量不同类型的数据,以发现隐藏的模式、相关性和其他见解。基本上,大数据分析主要被公司用来促进他们的成长和发展。这主要涉及对给定的数据集应用各种数据挖掘算法,然后帮助他们做出更好的决策
Descriptive Analytics: It uses data aggregation and data mining to provide insight into the past and answer: “What has happened?” The descriptive analytics does exactly what the name implies they “describe” or summarize raw data and make it interpretable by humans.
描述性分析:它使用数据聚合和数据挖掘来洞察过去并回答:“发生了什么?”描述性分析的作用正是顾名思义,它们“描述”或总结原始数据并使其可被人类解释。
Predictive Analytics:It uses statistical models and forecasts techniques to understand the future and answer:"What could happen?"Predictive analytics provides companies with actionable insights based on data.It provides estimates about the likelihood of a future outcome.
预测性分析:它使用统计模型和预测技术来了解未来,并回答:"可能发生什么?"预测性分析为公司提供基于数据的可操作的见解。它提供关于未来结果的可能性的估计。
Prescriptive Analytics: It uses optimization and simulation algorithms to advice on possible outcomes and answers: “What should we do?” It allows users to “prescribe” a number of different possible actions and guide them towards a solution. In a nutshell, this analytics is all about providing advice.
规范分析:它使用优化和模拟算法来建议可能的结果和答案:“我们应该做什么?”它允许用户“规定”一些不同的可能行动,并引导他们找到解决方案。简而言之,这种分析就是提供建议。
Diagnostic Analytics: It is used to determine why something happened in the past. It is characterized by techniques such as drill‐down, data discovery, data mining and correlations. Diagnostic analytics takes a deeper look at data to understand the root causes of the events.
诊断分析:它用于确定过去发生某事的原因。它的特点是采用向下钻取、数据发现、数据挖掘和关联等技术。诊断分析需要更深入地查看数据以了解事件的根本原因。
4.Big Data Case Study
• Making Smarter and More Efficient Organization
The New York Police Department (NYPD) brilliantly uses Big Data and analytics to detect and identify crimes before they occur. They analyze historical arrest patterns and then maps them with events such as federal holidays, paydays, traffic flows etc. Big Data and analytics strategy helps them identify crime locations, through which they deploy their officers to these locations an d prevent the occurrence of crime.
• 打造更智能、更高效的组织
纽约警察局 (NYPD) 巧妙地使用大数据和分析来检测和识别犯罪,防患于未然。他们分析历史逮捕模式,然后将其与联邦假期、发薪日、交通流量等事件进行映射。大数据和分析策略帮助他们识别犯罪地点,通过这些地点他们将警官部署到这些地点并防止犯罪的发生。
• Optimize Business Operations by Analyzing Customer Behavior
Amazon uses customer click‐stream data and historical purchase data to provide them with customized results on customized web pages. Analyzing the clicks of every visitor on their website aids them in understanding their site‐navigation behavior, paths the user took to buy the product, paths that led them to leav e the site and more. All this information helps Amazon to improve their user experience, thereby improving their sales and marketing.
• 通过分析客户行为优化业务运营
亚马逊使用客户点击流数据和历史购买数据在定制网页上为他们提供定制结果。分析每个访问者在他们网站上的点击次数,有助于他们了解他们的网站导航行为、用户购买产品的路径、导致他们离开网站的路径等等。所有这些信息都有助于亚马逊改善他们的用户体验,从而改善他们的销售和营销。
• Cost Reduction
Patients nowadays are using new sensor devices when at home or outside, which send constant streams of data that can be monitored and analyzed in real‐ time to help patients avoid hospitalization by self‐managing their conditions. For hospitalized patients, physicians can use predictive analytics to optimize outcomes and reduce readmissions. Parkland Hospital uses analytics and predictive modelling to identify high‐risk patients and predict likely outcomes once patients are sent home. As a result, Parkland reduced 30‐day readmissions for patients with heart failure, by 31%, saving $500,000 annually.
• 降低成本
如今,患者在家中或室外都在使用新的传感器设备,这些设备会发送源源不断的数据流,可以对其进行实时监控和分析,从而帮助患者通过自我管理病情来避免住院。对于住院患者,医生可以使用预测分析来优化结果并减少再入院率。 Parkland Hospital 使用分析和预测模型来识别高危患者,并预测患者出院后的可能结果。因此,Parkland 将心力衰竭患者的 30 天再入院率降低了 31%,每年节省 500,000 美元。
5.Big Data Technologies
A Brief History of Databases
Relational DBMS
关系数据库管理系统
DBMS ‐ Database Management System: a software package with computer programs that controls the creation, maintenance and use of a database
DBMS ‐ 数据库管理系统:一个包含计算机程序的软件包,用于控制数据库的创建、维护和使用
• Database Management System (DBMS) provides….… efficient, reliable, convenient, and safe multi‐ user storage of and access to massive amounts of persistent data.
• 数据库管理系统(DBMS)提供……...高效、可靠、方便和安全的多用户存储和访问大量持久数据。
• 安全可靠 • 关键应用 • 硬件/软件故障、断电、恶意用户 • 99.99999% 正常运行时间保证 • 多用户 • 并发控制 • 确保数据保持一致——不会被半覆盖 • 方便 • 物理数据独立性 • 高-层次查询语言:声明式 • 高效 • 性能!表现!表现!!! • 每秒数千次查询/更新 • 索引、查询优化、用户程序的并发执行
• Benefits of Relational databases:
Mathematical background (Relational Algebra)
Transaction management (ACID)
Query optimization
Strong consistency, concurrency, recovery
Standard Query language (SQL)
Lots of tools to use with i.e: Reporting services
• 关系数据库的优势: 数学背景(关系代数) 事务管理(ACID) 查询优化 强一致性、并发性、恢复性 标准查询语言(SQL) 许多工具可以使用,即:报告服务
• Batch processing vs. Stream processing Under the batch processing model, a set of data is collected over time, then fed into an analytics system. In other words, you collect a batch of information, then send it in for processing Under the streaming model, data is fed into analytics tools piece‐by‐ piece. The processing is usually done in real time.
• 批处理与流处理 在批处理模型下,随时间收集一组数据,然后输入分析系统。换句话说,您收集了一批信息,然后将其发送进行处理。在流模型下,数据被逐条输入分析工具。处理通常是实时完成的。
• Batch processing Batch processing is most often used when dealing with very large amounts of data, and/or when data sources are legacy systems that are not capable of delivering data in streams.
Batch processing works well in situations where you don’t need real‐ time analytics results, and when it is more important to process large volumes of information than it is to get fast analytics results
• 批处理 批处理最常用于处理大量数据,和/或数据源是无法以流形式传送数据的遗留系统。
批处理适用于不需要实时分析结果的情况,以及处理大量信息比获得快速分析结果更重要的情况
• Stream processing Stream processing is key if you want analytics results in real time. By building data streams, you can feed data into analytics tools as soon as it is generated and get near‐instant analytics results.
Stream processing is useful for tasks like fraud detection. If you stream‐process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed
• 流处理 如果您想要实时分析结果,流处理是关键。通过构建数据流,您可以在数据生成后立即将其输入分析工具,并获得近乎即时的分析结果。
流处理对于欺诈检测等任务很有用。如果您对交易数据进行流式处理,则可以实时检测到表明欺诈的异常情况,然后在欺诈交易完成之前将其停止
• Graph processing A graph processing framework (GPF) is a set of tools oriented to process graphs. Graph vertices are used to model data and edges model relationships between vertices. Social Network Knowledge Graph Road network Graph processing systems: Pregel 、GraphX 、Giraph
• 图处理图处理框架(GPF)是一组面向过程图的工具。图顶点用于建模数据,边用于建模顶点之间的关系。 Social Network Knowledge Graph Road network 图处理系统:Pregel 、GraphX 、Giraph
6.Trends in Big Data
Cloud Computing
Cloud computing is the on‐demand availability of computer system resources, especially data storage (cloud storage) and computing power, without direct active management by the user; it is commonly known as delivery of computing services including servers, storage, databases, and intelligence over the Internet.
云计算是计算机系统资源,特别是数据存储(云存储)和计算能力的按需可用性,无需用户直接主动管理;它通常被称为通过 Internet 提供计算服务,包括服务器、存储、数据库和智能。
Internet of Things (IoT)
IoT describes the network of physical objects ‐ "things“, that are embedded with sensors, software, and other technologies for the purpose of connecting and exchanging data with other devices and systems over the Internet.
物联网描述了物理对象网络——“事物”,它们嵌入了传感器、软件和其他技术,目的是通过互联网与其他设备和系统连接和交换数据。