Spark SQL JSON數(shù)據(jù)集

2018-11-26 16:34 更新

Spark SQL JSON數(shù)據(jù)集

Spark SQL能夠自動(dòng)推斷JSON數(shù)據(jù)集的模式,加載它為一個(gè)SchemaRDD。這種轉(zhuǎn)換可以通過下面兩種方法來實(shí)現(xiàn)

  • jsonFile :從一個(gè)包含JSON文件的目錄中加載。文件中的每一行是一個(gè)JSON對(duì)象
  • jsonRDD :從存在的RDD加載數(shù)據(jù),這些RDD的每個(gè)元素是一個(gè)包含JSON對(duì)象的字符串

注意,作為jsonFile的文件不是一個(gè)典型的JSON文件,每行必須是獨(dú)立的并且包含一個(gè)有效的JSON對(duì)象。結(jié)果是,一個(gè)多行的JSON文件經(jīng)常會(huì)失敗

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
// Create a SchemaRDD from the file(s) pointed to by path
val people = sqlContext.jsonFile(path)

// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root
//  |-- age: integer (nullable = true)
//  |-- name: string (nullable = true)

// Register this SchemaRDD as a table.
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// Alternatively, a SchemaRDD can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
以上內(nèi)容是否對(duì)您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號(hào)
微信公眾號(hào)

編程獅公眾號(hào)