spark官方文档翻译之pyspark.sql.SQLContext

Posted on 2016-08-09 14:58 来碗酸梅汤阅读(1483) 评论(0) 收藏举报

class pyspark.sql.SQLContext(sparkContext, sparkSession=None, jsqlContext=None)
　　spark、spark1.x以结构化数据(rows and columns)为内容的工作进入点
　　spark2.0中被替代为SparkSession，然而，保持了class的向后兼容
　　一个SQLContext被用来创建DataFrame，以table形式记录DataFrame，在table上执行SQL，保存表，读取文件
　　Parameters: sparkContext – SparkContext 支持 SQLContext.
　　　　　　　　sparkSession – The SparkSession around which this SQLContext wraps.
　　　　　　　　jsqlContext – 一个可选的JVM Scala SQLContext，如果设置了，我们不是在JVM中实例化一个新的SQLContext，相反我们是所有的调用使用这个对象

　　cacheTable(tableName)

　　　　缓存指定的表到内存中

　　　　New in version 1.0.

　　clearCache()

　　　　从内存中清楚所有缓存的tables

　　　　New in version 1.3.

　　createDataFrame(data, schema=None, samplingRatio=None)

　　　　通过一个RDD或一个list或一个pandas.DataFrame创建一个DataFrame

　　　　当schema是list，每一列的类型将从data推断当schema是None, 它将尝试推断schema(列名称和类型)根据data,应该是一个RDD的行,或namedtuple,还是dict

　　　　当schema是DataType或者datatype string，schema必须匹配真实的data，或exception将在运行时抛出。如果给定的schema不是StructType，

　　　　它将作为它仅有的字段包含在一个StructType中，这个字段名字将是“value”,每条记录将包含在一个元组中，最后将转换成行。

　　　　如果schema推断是必须的，samplingRatio(抽样比例)　将被用来推断schema的行的比例。如果samplingRatio是None，第一行将被使用。

　　　　Parameters: data – 一个任和一种SQL数据(e.g. row, tuple, int, boolean, etc.)表示的RDD，或者list，或pandas.DataFrame.

　　　　　　　　　　 schema – a DataType or a datatype string or a list of column names, default is None. The data type string format equals to DataType.simpleString,

　　　　　　　　　　 except that top level struct type can omit the struct<> and atomic types use typeName() as their format,

　　　　　　　　　　例如. 使用byte代替非常小的整数for ByteType. 我们也可以是使用int作为一个短名称for IntegerType.

　　　　　　　　　　 samplingRatio – 用作推断每行样本比例

　　　　Returns: DataFrame

　　　　Changed in version 2.0: The schema parameter can be a DataType or a datatype string after 2.0. If it’s not a StructType, it will be wrapped into a StructType and each record will also be wrapped into a tuple.

>>> a = [('Alice', 1)]
>>> spark.createDataFrame(a).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(a, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

>>> d = [{'name': 'Alice', 'age': 1}]
>>> spark.createDataFrame(d).collect()
[Row(age=1, name=u'Alice')]

>>> rdd = sc.parallelize(a)
>>> spark.createDataFrame(rdd).collect()
[Row(_1=u'Alice', _2=1)]
>>> df = spark.createDataFrame(rdd, ['name', 'age'])
>>> df.collect()
[Row(name=u'Alice', age=1)]

>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = spark.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]

>>> from pyspark.sql.types import *
>>> schema = StructType([
...    StructField("name", StringType(), True),
...    StructField("age", IntegerType(), True)])
>>> df3 = spark.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]

>>> spark.createDataFrame(df.toPandas()).collect()  
[Row(name=u'Alice', age=1)]
>>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect()  
[Row(0=1, 1=2)]

>>> spark.createDataFrame(rdd, "a: string, b: int").collect()
[Row(a=u'Alice', b=1)]
>>> rdd = rdd.map(lambda row: row[1])
>>> spark.createDataFrame(rdd, "int").collect()
[Row(value=1)]
>>> spark.createDataFrame(rdd, "boolean").collect() 
Traceback (most recent call last):
    ...
Py4JJavaError: ...

　　　　New in version 1.3.

　　createExternalTable(tableName, path=None, source=None, schema=None, **options)

　　　　基于数据源的数据集创建一个外部表。返回一个与外部表有联系的DataFrame

　　　　数据源由source和a set of options指定，如果source没有指定，默认的数据源由spark.sql.sources.default指定。

　　　　可选的，返回的DataFrame和外部表可以提供schema。

　　　　Returns: DataFrame
　　　　New in version 1.3.

　　dropTempTable(tableName)

　　　　从目录中移除临时表

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> sqlContext.dropTempTable("table1")

　　　　New in version 1.6.

　　getConf(key, defaultValue=None)

　　　　根据给定的key返回spark sql 配置属性value

　　　　如果key未设置defaultValue不是None，返回defaultValue。如果key未设置defaultValue是None，返回系统默认value

>>> sqlContext.getConf("spark.sql.shuffle.partitions")
u'200'
>>> sqlContext.getConf("spark.sql.shuffle.partitions", u"10")
u'10'
>>> sqlContext.setConf("spark.sql.shuffle.partitions", u"50")
>>> sqlContext.getConf("spark.sql.shuffle.partitions", u"10")
u'50'

　　　　New in version 1.3.

　　classmethod getOrCreate(sc)

　　　　得到现存的SQLContext或者根据SparkContext创建一个新的SQLContext

　　　　Parameters: sc – SparkContext
　　　　New in version 1.6.

　　newSession()

　　　　返回一个新的SQLContext作为session，这个session有自己单独的SQLConf，临时注册的views and UDFs，但是共享的SparkContext and table会存储起来

　　　　New in version 1.6.

　　range(start, end=None, step=1, numPartitions=None)

　　　　创建一个LongType列名为id的dataframe，包含元素在一个范围内从start到end（exclusive独有的）步长值为step

　　　　Parameters: start – 起始值
　　　　　　　　　　end – 结束值 (exclusive)
　　　　　　　　　　step – 增值步长 (default: 1)
　　　　　　　　　　numPartitions – DataFrame的分区数
　　　　Returns: DataFrame

>>> sqlContext.range(1, 7, 2).collect()
[Row(id=1), Row(id=3), Row(id=5)]

　　　　如果只指定一个参数，他将被用于end值

>>> sqlContext.range(3).collect()
[Row(id=0), Row(id=1), Row(id=2)]

　　　　New in version 1.4.

　　read

　　　　返回一个DataFrameReader，它可以被用来从DataFrame中读取数据

　　　　Returns: DataFrameReader
　　　　New in version 1.4.

　　readStream

　　　　返回一个DataFrameReader，它可以被用来从一个streaming DataFrame中读取数据流

　　　　Note Experimental.
　　　　Returns: DataStreamReader

>>> text_sdf = sqlContext.readStream.text(tempfile.mkdtemp())
>>> text_sdf.isStreaming
True

　　　　New in version 2.0.

　　registerDataFrameAsTable(df, tableName)

　　　　在目录中将给定的DataFrame(也就是df)注册为一个临时表

　　　　临时表存在时间只在SQLContext实例存在的生命周期中存在

>>> sqlContext.registerDataFrameAsTable(df, "table1")

　　　　New in version 1.3.

　　registerFunction(name, f, returnType=StringType)

　　注册一个Python函数(包含lambda匿名函数)作为一个UDF,这个函数能被用于SQL表达式.

　　　　除了一个名称和函数本身，返回类型可以被随意指定。当返回类型没有给定，默认返回的是string，这个转换将自动完成。对于别的返回类型，产生对象必须匹配指定的类型。

　　　　Parameters: name – UDF的name
　　　　　　　　　　　　f – Python函数
　　　　　　　　　　　　returnType – a DataType object

>>> sqlContext.registerFunction("stringLengthString", lambda x: len(x))
>>> sqlContext.sql("SELECT stringLengthString('test')").collect()
[Row(stringLengthString(test)=u'4')]

>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("stringLengthInt", lambda x: len(x), IntegerType())
>>> sqlContext.sql("SELECT stringLengthInt('test')").collect()
[Row(stringLengthInt(test)=4)]

>>> from pyspark.sql.types import IntegerType
>>> sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType())
>>> sqlContext.sql("SELECT stringLengthInt('test')").collect()
[Row(stringLengthInt(test)=4)]

　　　　　　New in version 1.2.

　　setConf(key, value)

　　　　设置给定的spark SQL配置属性

　　　　New in version 1.3.

　　sql(sqlQuery)

　　　　根据给定的sql查询语句返回一个DataFrame

　　　　Returns: DataFrame

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> df2 = sqlContext.sql("SELECT field1 AS f1, field2 as f2 from table1")
>>> df2.collect()
[Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]

　　　　New in version 1.0.

　　streams

　　　　Returns a StreamingQueryManager that allows managing all the StreamingQuery StreamingQueries active on this context.

　　　　Note　　Experimental.

　　　　New in version 2.0.

　　table(tableName)

　　　　返回指定的table名作为一个DataFrame

　　　　Returns: DataFrame

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> df2 = sqlContext.table("table1")
>>> sorted(df.collect()) == sorted(df2.collect())
True

　　　　New in version 1.0.

　　tableNames(dbName=None)

　　　　返回dbName数据库中的tables名

　　　　Parameters: dbName – string, 数据库名. 默认为当前数据库

　　　　Returns: list of table names, in string

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> "table1" in sqlContext.tableNames()
True
>>> "table1" in sqlContext.tableNames("default")
True

　　　　New in version 1.3.

　　tables(dbName=None)

　　　　根据给定的数据库返回一个DataFrame，这个DataFrame博涵数据库的tables名

　　　　如果dbName没有指定，则使用当前数据库

　　　　返回的DataFrame有两列：tableName和isTemporary(这一列用BooleanType表示这个表是不是临时表)

　　　　Parameters: dbName – string, name of the database to use.

　　　　Returns: DataFrame

>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> df2 = sqlContext.tables()
>>> df2.filter("tableName = 'table1'").first()
Row(tableName=u'table1', isTemporary=True)

　　　　New in version 1.3.

　　　udf

　　　　Returns a UDFRegistration for UDF registration.

　　　　Returns: UDFRegistration

　　　　New in version 1.3.1.

　　uncacheTable(tableName)

　　　　从内存缓存中移除指定的表

　　　　New in version 1.0.

class pyspark.sql.HiveContext(sparkContext, jhiveContext=None)

　　　　spark SQL的一个变种，用于整合data存储在hive中

　　　　在类路径classpath中从 hive-site.xml读取配置，它支持运行SQL和HiveQL命令

　　　　Parameters: sparkContext – The SparkContext to wrap.

　　　　jhiveContext – 一个可选的JVM Scala HiveContext. 如果设置, 我们不是在JVM中实例化一个新的HiveContext,而是使所有调用使用这个对象.

　　　　Note Deprecated in 2.0.0. Use SparkSession.builder.enableHiveSupport().getOrCreate().

　　refreshTable(tableName)

　　　　无效并刷新所有给定的table的缓存元数据。

　　　　For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks.

　　　　 When those change outside of Spark SQL, users should call this function to invalidate the cache。

刷新页面返回顶部

来碗酸梅汤

公告

spark官方文档 翻译 之pyspark.sql.SQLContext

spark官方文档翻译之pyspark.sql.SQLContext