syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me

Question:
How to start with Spark?

On Ubuntu it is easy to install Spark.

We need Java. Lets download Java:

http://www.oracle.com/technetwork/java/javase/downloads

After I downloaded Java I did this:
tar zxf jdk-8u92-linux-x64.tar.gz
mv jdk1.8.0_92 ~/
cd             ~/
rm -rf            jdk
ln -s jdk1.8.0_92 jdk
I added syntax to my .bashrc:
export JAVA_HOME=${HOME}/jdk
export PATH="${JAVA_HOME}/bin:${PATH}"
I tested Java:
java -showversion
Next I downloaded Spark:
cd ~/Downloads/
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
tar zxf spark-1.6.1-bin-hadoop2.6.tgz
mv spark-1.6.1-bin-hadoop2.6 ~/
cd                           ~/
rm -rf spark
ln -s spark-1.6.1-bin-hadoop2.6 spark
I added syntax to my .bashrc (notice my use of PYSPARK_PYTHON):
export SPARK_HOME=${HOME}/spark
export PATH="${SPARK_HOME}/bin:${PATH}"
export PYSPARK_PYTHON=${HOME}/anaconda3/bin/python
I tested Spark:
spark-shell
I saw this:

dan@nia111:~ $ 
dan@nia111:~ $ 
dan@nia111:~ $ spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/   _/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> 
scala> 
Currently I see spark as having 3 main command line interfaces:
  • spark-shell
  • pyspark
  • spark-submit
Here is a demo of spark-shell:
spark-shell -i demo11.scala
Here is a copy of demo11.scala:

/*

Demo:
$SPARK_HOME/bin/spark-shell -i demo11.scala

 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import scala.math.random

import org.apache.spark._
import org.apache.spark.SparkContext._

var count = 0
for (i <- 1 to 100000) {
  val x = random * 2 - 1
  val y = random * 2 - 1
  if (x*x + y*y < 1) count += 1
}
println("Pi is roughly " + 4 * count / 100000.0)

System.exit(0)

Here is a demo of pyspark:
pyspark
I only use pyspark for interactive work.

My favorite of the three is spark-submit. I use it to submit python scripts to Spark:
spark-submit demo10.py
Here is a copy of demo10.py:

# demo10.py

# This script should demonstrate how to connect Python syntax to Spark.

# ref:
# http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications

# Demo:
# $SPARK_HOME/bin/spark-submit demo10.py

from pyspark import SparkContext

sc = SparkContext("local", "Simple_App")
myfile = "/etc/hosts"  # Should be some file on your system
readme_md_rdd = sc.textFile(myfile).cache()

num1s = readme_md_rdd.filter(lambda s: '1' in s).count()
num2s = readme_md_rdd.filter(lambda s: '2' in s).count()

print("Lines with 1: %i, lines with 2: %i" % (num1s, num2s))

'bye'
That should be enough information to get you over the initial hurdle of running Spark on your Linux laptop.

Exercises:
  • Convert demo11.scala to demo11.py
  • try: spark-submit demo11.py
  • Convert demo10.py to demo10.scala
  • try: spark-shell -i demo10.scala


Questions?

E-me: bikle101@gmail.com

syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me