syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me

Question:
In Spark what does reduceByKey() do?

Spark-reduceByKey() reminds me of the GROUP BY feature in SQL.

I might have have collection which looks like this:
mylist = [ {'k1': 1}, {'k2': 2}, {'k1': -2}, {'k3': 4}, {'k2': -5}, {'k1': 4} ]
Question: What is the sum of values for the 'k1' pairs?
Answer: 1 + -2 + 4 is 3

So if I use reduceByKey() to get the answer for all the pairs, I should get this result:
mylist = [ {'k1': 3}, {'k2': -3}, {'k3': 4} ]
I started Spark on my laptop and studied the behavior of reduceByKey():
dan@feb ~/spark $ 
dan@feb ~/spark $ 
dan@feb ~/spark $ bin/pyspark
Python 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Aug 21 2014, 18:22:21) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org

Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/02/01 22:52:04 INFO BlockManagerMaster: Trying to register BlockManager

15/02/01 22:52:04 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.2.0
      /_/

Using Python version 2.7.8 (default, Aug 21 2014 18:22:21)
SparkContext available as sc.
>>> 
>>> mylist = [ {'k1': 1}, {'k2': 2}, {'k1': -2}, {'k3': 4}, {'k2': -5}, {'k1': 4} ]
>>> myrdd = sc.parallelize(mylist)
>>> myrdd.take(6)

[{'k1': 1}, {'k2': 2}, {'k1': -2}, {'k3': 4}, {'k2': -5}, {'k1': 4}]
>>> 
>>> 
>>> mylam = lambda v1,v2: v1+v2
>>> 
>>> reduced_rdd = myrdd.reduceByKey(mylam)
>>> 
>>> 
>>> reduced_rdd.count()

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/dan/spark/python/pyspark/worker.py", line 107, in main
    process()
  File "/home/dan/spark/python/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/dan/spark/python/pyspark/rdd.py", line 2073, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/dan/spark/python/pyspark/rdd.py", line 247, in func
    return f(iterator)
  File "/home/dan/spark/python/pyspark/rdd.py", line 1561, in combineLocally
    merger.mergeValues(iterator)
  File "/home/dan/spark/python/pyspark/shuffle.py", line 252, in mergeValues
    for k, v in iterator:
ValueError: need more than 1 value to unpack

It seems that reduceByKey() does not want an RDD built from a list of dictionaries.

quick-start.html builds an RDD from a list of tuples.

I build an RDD from a list of tuples:

>>> 
>>> 
>>> mylist2 = [ ('k1', 1), ('k2', 2), ('k1', -2), ('k3', 4), ('k2', -5), ('k1', 4) ]
>>> 
>>> myrdd2 = sc.parallelize(mylist2)
>>> 
>>> myrdd2.count()

6
>>> 
>>> myrdd2.take(6)

[('k1', 1), ('k2', 2), ('k1', -2), ('k3', 4), ('k2', -5), ('k1', 4)]
>>> 
>>> 
>>> reduced_rdd2 = myrdd2.reduceByKey(mylam)
>>> 
>>> reduced_rdd2.count()

3
>>> 
>>> 
>>> reduced_rdd2.take(reduced_rdd2.count())

[('k2', -3), ('k1', 3), ('k3', 4)]
>>> 
>>> 
>>> mylist3 = [ ['k1', 1], ['k2', 2], ['k1', -2], ['k3', 4], ['k2', -5], ['k1', 4] ]
>>> 
>>> myrdd3 = sc.parallelize(mylist3)
>>> 
>>> myrdd3.count()

6
>>> 
>>> 
>>> myrdd3.take(6)

[['k1', 1], ['k2', 2], ['k1', -2], ['k3', 4], ['k2', -5], ['k1', 4]]
>>> 
>>> 
>>> reduced_rdd3 = myrdd3.reduceByKey(mylam)
>>> 
>>> 
>>> reduced_rdd3.count()

3
>>> 
>>> 
>>> reduced_rdd3.take(reduced_rdd3.count())

[('k2', -3), ('k1', 3), ('k3', 4)]
>>> 
>>> 
>>> 
It worked. Yay!

But it seems strange that Spark forced me to build my RDD from a list of tuples.

If reduceByKey() is intended to work with a collection of key-value pairs, it seems obvious to me that each pair should reside in the type of Python object intended for key-value pairs, a dictionary not a tuple.

Fortunately, though, I now know how to interact with Spark-reduceByKey().

syntax.us Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me