1-sparkSQL

Mongo Spark Connector Spark SQL

The following code snippets can be found in SparkSQL.scala.

Prerequisites

Have MongoDB up and running and Spark 2.2.x downloaded. This tutorial will use the Spark Shell allowing for instant feedback.
See the introduction for more information.

Insert some sample data into an empty database:

import org.bson.Document
import com.mongodb.spark._

val docs = """
 |{"name": "Bilbo Baggins", "age": 50}
 |{"name": "Gandalf", "age": 1000}
 |{"name": "Thorin", "age": 195}
 |{"name": "Balin", "age": 178}
 |{"name": "Kíli", "age": 77}
 |{"name": "Dwalin", "age": 169}
 |{"name": "Óin", "age": 167}
 |{"name": "Glóin", "age": 158}
 |{"name": "Fíli", "age": 82}
 |{"name": "Bombur"}""".trim.stripMargin.split("[\\r\\n]+").toSeq
sc.parallelize(docs.map(Document.parse)).saveToMongoDB()

Spark SQL

The entry point to Spark SQL is the SparkSession class or one of its descendants. To create a basic SparkSession all you need is the
SparkSession.builder().

import org.apache.spark.sql.SparkSession

val sc: SparkContext // An existing SparkContext.
val sparkSession = SparkSession.builder().getOrCreate()

First enable the Mongo Connector specific functions on the SparkSession:

import com.mongodb.spark.sql._

Datasets

The Mongo Spark Connector provides the com.mongodb.spark.sql.DefaultSource class that creates Datasets from MongoDB.
However, the easiest way to create a DataFrame is by using the MongoSpark helper:

val df = MongoSpark.load(sparkSession)  // Uses the SparkConf
df.printSchema()

Will return:

root
 |-- _id: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

Note: In Spark 2.0 the DataFrame class became a type alias to DataSet[Row].


MongoSpark.load(sparkSession) is shorthand for configuring and loading via the DataFrameReader. The following examples are alternative
methods for creating DataFrames:

sparkSession.loadFromMongoDB() // Uses the SparkConf for configuration
sparkSession.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://example.com/database.collection"))) // Uses the ReadConfig

sparkSession.read.mongo()
sparkSession.read.format("mongo").load()

// Set custom options:
sparkSession.read.mongo(customReadConfig)
sparkSession.read.format("mongo").options.(customReadConfig.asOptions).load()

In the following example we can filter and output the characters with ages under 100:

df.filter(df("age") < 100).show()
+--------------------+---+-------------+
|                 _id|age|         name|
+--------------------+---+-------------+
|56eaa0d7ba58b9043...| 50|Bilbo Baggins|
|56eaa0d7ba58b9043...| 77|         Kíli|
|56eaa0d7ba58b9043...| 82|         Fíli|
+--------------------+---+-------------+

Note: Unlike RDD's when using filters with Datasets or SparkSQL the underlying Mongo Connector code will construct an aggregation
pipeline to filter the data in MongoDB before sending it to Spark.


Schema inference and explicitly declaring a schema

By default reading from MongoDB in a SparkSession infers the schema by sampling documents from the database.
If you know the shape of your documents then you can use a simple case class to define the schema instead, thus preventing the extra queries.


Note: When providing a case class for the schema only the declared fields will be returned by MongoDB, helping minimize the data sent across the wire.


The following example creates Character case class and then uses it to represent the documents / schema for the collection:

case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)()
explicitDF.printSchema()

Outputs:

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)

The following example converts the DataFrame into a Dataset. Or more correctly converts Dataset[Row] to Dataset[Character]:

explicitDF.as[Character]

Conflicting Types

When inferring the schema, if a collection contains a field with two incompatible data types then a StringType will be used. To work
around conflicting datatypes you can filter out the invalid datatypes by explicitly providing your own pipeline filtering them out. Note
that filters passed down to MongoDB on fields with Conflicts will not work as expected and the filtering must happen inside Spark. The
special functional helpers below can be used to ensure correct filtering.

RDD to DataFrame / Datasets

The MongoRDD class provides helpers to create DataFrames and Datasets directly:

val dataframeInferred = MongoSpark.load[Character](sparkSession)
val dataframeExplicit = MongoSpark.load[Character](sparkSession)
val dataset = MongoSpark.load[Character](sparkSession).as[Character]()

SQL queries

Spark SQL works on top of Datasets, to be able to use SQL you need to register a temporary table first and then you can run SQL queries
over the data.

The following example creates the "characters" temporary table and then queries it to find all characters that are 100 or older.

val characters = MongoSpark.load(sparkSession).toDF[Character]()
characters.createOrReplaceTempView("characters")

val centenarians = sparkSession.sql("SELECT name, age FROM characters WHERE age >= 100")
centenarians.show()

Note: You must use the same SparkSession that registers the table when querying it.


Saving Datasets

The connector provides the ability to persist data into MongoDB.

In the following example we save the centenarians into the "hundredClub" collection:

MongoSpark.save(centenarians.write.option("collection", "hundredClub"))
println("Reading from the 'hundredClub' collection:")
MongoSpark.load[Character](sparkSession, ReadConfig(Map("collection" -> "hundredClub"), Some(ReadConfig(sparkSession)))).show()

Outputs:

+-------+----+
|   name| age|
+-------+----+
|Gandalf|1000|
| Thorin| 195|
|  Balin| 178|
| Dwalin| 169|
|    Óin| 167|
|  Glóin| 158|
+-------+----+

MongoSpark.save(dataFrameWriter) is shorthand for configuring and saving via the DataFrameWriter. The following examples are alternative
methods for writing DataFrames to MongoDB:

dataFrameWriter.write.mongo()
dataFrameWriter.write.format("mongo").save()

Note: Saving Datasets to MongoDB that contain an _id field will replace and upsert any existing documents.

The WriteConfig.replaceDocument configuration setting allows you to configure if you want to replace the whole document or just the
fields in the Dataset.


DataTypes

Spark supports a limited number of data types, to ensure that all bson types can be round tripped in and out of Spark Datasets.
Custom StructTypes are created for any unsupported Bson Types. The following table shows the mapping between the Bson Types and
Spark Types:

Bson Type Spark Type
Document StructType
Array ArrayType
32-bit integer Integer
64-bit integer Long
Decimal128 Decimal
Binary data Array[Byte] or StructType: { subType: Byte, data: Array[Byte]}
Boolean Boolean
Date java.sql.Timestamp
DBPointer StructType: { ref: String , oid: String}
Double Double
JavaScript StructType: { code: String }
JavaScript with scope StructType: { code: String , scope: String }
Max key StructType: { maxKey: Integer }
Min key StructType: { minKey: Integer }
Null null
ObjectId StructType: { oid: String }
Regular Expression StructType: { regex: String , options: String }
String String
Symbol StructType: { symbol: String }
Timestamp StructType: { time: Integer , inc: Integer }
Undefined StructType: { undefined: Boolean }

Dataset support

To help better support Datasets, the following Scala case classes and JavaBean classes have been created to represent the unsupported Bson
Types:

Bson Type Scala case class JavaBean
  com.mongodb.spark.sql.fieldTypes com.mongodb.spark.sql.fieldTypes.api.java.
Binary data Binary Binary
DBPointer DBPointer DBPointer
JavaScript JavaScript JavaScript
JavaScript with scope JavaScriptWithScope JavaScriptWithScope
Max key MaxKey MaxKey
Min key MinKey MinKey
ObjectId ObjectId ObjectId
Regular Expression RegularExpression RegularExpression
Symbol Symbol Symbol
Timestamp Timestamp Timestamp
Undefined Undefined Undefined

For convenience all Bson Types can be represented as a String value as well, however these values lose all their type information and if
saved back to MongoDB they would be stored as a String.

Defining and filtering unsupported bson data types

As not all Bson types have equivalent Spark types querying them outside of using Datasets can be verbose and requires in depth knowledge of
the StructType that can be used to represent those types.

To help users with these types there are StructField helpers that can be used when defining the schema for a DataFrame. There are also
custom helpers that can be used as User Defined Functions
to aid the queryability of the data. To enable all helper functions set the registerSQLHelperFunctions configuration option to true,
alternatively you can manually register the required methods.

Bson Type StructField helpers User Defined Function helpers
  com.mongodb.spark.sql.StructFields com.mongodb.spark.sql.helpers.UDF
Binary data binary binary / binaryWithSubType
DBPointer dbPointer dbPointer
JavaScript javascript javascript
JavaScript with scope javascriptWithScope javascriptWithScope
Max key maxKey maxKey
Min key minKey minKey
ObjectId objectId objectId
Regular Expression regularExpression regularExpression / regularExpressionWithOptions
Symbol symbol symbol
Timestamp timestamp timestamp
Undefined undefined undefined

Note: Performance warning: UDF functions are high level functions that cannot be passed to MongoDB and converted into aggregation queries.


Below is an example of using the helpers when defining and querying an ObjectId field:

// Load sample data
import org.bson.Document
import org.bson.types.ObjectId
import com.mongodb.spark._
import com.mongodb.spark.sql._

val objectId = "123400000000000000000000"
val newDocs = Seq(new Document("_id", new ObjectId(objectId)).append("a", 1), new Document("_id", new ObjectId()).append("a", 2))
MongoSpark.save(sc.parallelize(newDocs))

// Set the schema using the ObjectId StructFields helper
import org.apache.spark.sql.types.DataTypes
import com.mongodb.spark.sql.helpers.StructFields

val schema = DataTypes.createStructType(Array(
  StructFields.objectId("_id", nullable = false),
  DataTypes.createStructField("a", DataTypes.IntegerType, false))
)

// Create a dataframe with the helper functions registered
val df1 = MongoSpark.read(sparkSession).schema(schema).option("registerSQLHelperFunctions", "true").load()

// Query using the ObjectId string
df1.filter(s"_id = ObjectId('$objectId')").show()

Outputs:

+--------------------+---+
|                 _id|  a|
+--------------------+---+
|[1234000000000000...|  1|
+--------------------+---+

Next - Configuring

posted @ 2022-09-13 14:53  三里清风18  阅读(53)  评论(0编辑  收藏  举报