syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me

Question:
In H2O Sparkling-Water how does TimeSplit work?

I bumped into TimeSplit here:

https://github.com/h2oai/sparkling-water/blob/master/examples/scripts/Meetup20150226.script.scala

The API call looks like this:
val startTimeF = bikesDF('starttime)
bikesDF.add(new TimeSplit().doIt(startTimeF))
The use-case concerns a column of continuous data named starttime which resides in a DataFrame called bikesDF.

This data is in the form of milliseconds since 1970-01-01 AKA 'The Epoch'.

The use-case implies that I want days since the Epoch.

So, this boils down to syntax which does a simple conversion of all values in the starttime column to another column.

I notice that the API-call does not allow me to specify the name of the new column.

What is the name of the new column?

To find that name I could run the API-call and then inspect the bikesDF DataFrame.

Or I study the syntax which supports the API call.

I found that syntax near the bottom of the page listed below:

https://github.com/h2oai/sparkling-water/blob/master/examples/src/main/scala/org/apache/spark/examples/h2o/CitiBikeSharingDemo.scala

Currently that syntax looks like this:
class TimeSplit extends MRTask[TimeSplit] {
  def doIt(time: DataFrame):DataFrame =
      DataFrame(doAll(1, time).outputFrame(Array[String]("Days"), null))

  override def map(msec: Chunk, day: NewChunk):Unit = {
    for (i <- 0 until msec.len) {
      day.addNum(msec.at8(i) / (1000 * 60 * 60 * 24)); // Days since the Epoch
    }
  }
}
Currently I do not know Scala well enough to have a deep understanding of the above syntax.

It is obvious the syntax creates a class with two methods inside.

The class is a subclass of MRTask.

I do not know why the author placed [TimeSplit] to the right of MRTask.

I can see that .doIt() takes an argument of type DataFrame. Also I see that in the API call an example input of startTimeF.

When I first saw how startTimeF was created, I was not sure what kind of object it was. Now I know it is a DataFrame.

Also it teaches me how to create a DataFrame from DataFrame-column-slice.

I should use syntax like this:
val startTimeF = bikesDF('starttime)
Anyway, when I look at the next line in doIt(), I see that that doIt() just calls DataFrame() which calls doAll() which calls doAll().outputFrame().

Back to my question, What is the name of the new column?

The arguments to outputFrame() suggest to me that the name of the new column I am curious about is: "Days".

Now back to the original question, In H2O Sparkling-Water how does TimeSplit work?

I see now that TimeSplit works by me creating an object using syntax like this:
new TimeSplit()
Then I can call a method like this:
ts_object.doIt(startTimeF)
Control then passes to syntax inside .doIt().

Inside that method is a call to doAll().

The name doAll() suggest to me that it does something to all the objects which make up the DataFrame.

I suspect that doAll() implicitly calls the second method in the class.

That second method is map().

I do not see that method called anywhere but I think it is being used by doAll(), why else would that method reside there?

Another big clue is the syntax inside map().

It is obviously converting milliseconds to days which is the intent behind the API call according to the comments.

So, TimeSplit works by coordinating two methods.

The first method declares that it wants a DataFrame and the output is also a DataFrame. It defines the name of the column in the output.

Also the first method indirectly calls a second method which is an overridden method from MRTask-class called .map() using a call named doAll().

Inside .map(), I iterate through all the millisecond-values and convert them to days.

Once that is done, the day-collection is passed to doIt() which passes me a DataFrame full of day-values.

syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me