syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me

Question:
In Spark how to query my data with SQL?

I first encountered this question while studying the Spark documentation:

https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

The above doc suggests I should do the following:
  • Create a SQLcontext
  • Use a case class to define a schema
  • Create a SchemaRDD from some data constrained by the schema
  • Use the SchemaRDD to register a TempTable
  • Use SQL to SELECT rows from TempTable
  • Collect the rows in a second SchemaRDD
  • Visualize the data in the second SchemaRDD
Steps two and three are sometimes tedious depending on the complexity of the schema and the relationship(s) between the schema and the underlying data.

The last four steps seem straightforward.

H2O offers an alternate set of steps:
  • Create a SQLcontext
  • Create a DataFrame
  • Create a SchemaRDD from DataFrame
  • Use the SchemaRDD to register a TempTable
  • Use SQL to SELECT rows from TempTable
  • Collect the rows in a second SchemaRDD
  • Visualize the data in the second SchemaRDD
I like the "H2O way" because in step two, I can easily create a DataFrame from a CSV.

It frees me from coding up a schema. Instead, the data describes itself.

Then at step three I only need to write one short line of syntax.

It feels much less complicated than the method described in the Spark documentation.

Here is some syntax:
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.examples.h2o.DemoUtils._
import org.apache.spark.sql.{SQLContext, SchemaRDD}
import water.fvec._

implicit val sqlContext = new SQLContext(sc)
import sqlContext._

// Create H2O context and start H2O services around the Spark cluster
implicit val h2oContext = new H2OContext(sc).start()
import h2oContext._

//
// Load and parse bike data (year 2013) into H2O by using H2O parser
//

val dataFiles = Array[String](
    "2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv",
    "2013-11.csv", "2013-12.csv").map(f => new java.io.File(DIR_PREFIX, f))

// Load and parse data
val bikesDF = new DataFrame(dataFiles:_*)

//
// Transform DataFrame into SchemaRDD
//
val bikesRdd = asSchemaRDD(bikesDF)

// Register table and SQL table
sqlContext.registerRDDAsTable(bikesRdd, "bikesRdd")

//
// Do grouping with help of Spark SQL
//
val bikesPerDayRdd = sql(
  """SELECT Days, start_station_id, count(*) bikes
    |FROM bikesRdd
    |GROUP BY Days, start_station_id """.stripMargin)
The above syntax can be found in the URL below:

https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/Meetup20150226.script.scala

syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me