Word Count with Apache Spark and python (Mac OS X)

Overview

As a first step in verifying Apache Spark. As anyone with Hadoop experience knows, it counts the same words in a file. The environment is Mac OSX, but I wonder if it is almost the same for Linux. The complete code is here.

Installation

$ brew install apache-spark

Installation confirmation

OK if spark-shell works and `scala>` is displayed

$ /usr/local/bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:51 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:51 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.

scala>

Try word counting of local files with python

This was written with reference to the description on the Official Site.

Directory structure

Please prepare as follows.

$ tree
.
├── input
│   └── data #Text to read
└── wordcount.py #Execution script

1 directory, 4 files

Write code

Here we use python. You can write in scala or Java. I'm good at it, so let's go. Like this.

wordcount.py


#!/usr/bin/env python
# coding:utf-8

from pyspark import SparkContext

def execute(sc, src, dest):
    '''
Perform word count
    '''
    #Read src file
    text_file = sc.textFile(src)
    counts = text_file.flatMap(lambda line: line.split(" ")) \
                 .map(lambda word: (word, 1)) \
                 .reduceByKey(lambda a, b: a + b)
    #Export results
    counts.saveAsTextFile(dest)

if __name__ == '__main__':
    sc = SparkContext('local', 'WordCount')
    src  = './input'
    dest = './output'
    execute(sc, src, dest)

Read file preparation

Appropriately. For example, like this.

./input/data


aaa
bbb
ccc
aaa
bbbb
ccc
aaa

Run

The following command.

$ which pyspark
/usr/local/bin/pyspark

#Run
$ pyspark ./wordcount.py

When you execute it, a log will flow. (Like Hadoop Streaming)

Verification

./output/part-00000


(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)

It was counted correctly.

bonus

Note that if the output destination directory (./output) has already been generated, the next process will fail. It is a good idea to attach a shell like the one below to the same directory.

exec.sh


#!/bin/bash

rm -fR ./output
/usr/local/bin/pyspark ./wordcount.py

echo ">>>>> result"
cat ./output/*
$ sh exec.sh
・ ・ ・
>>>>> result
(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)

Recommended Posts

Word Count with Apache Spark and python (Mac OS X)
Install lp_solve on Mac OS X and call it with python.
Python3 + Django ~ Mac ~ with Apache
Install Python 2.7.9 and Python 3.4.x with pip.
Test Python with Miniconda on OS X and Linux with travis-ci
Get started with the Python framework Django on Mac OS X
CentOS 6.4 with Python 2.7.3 with Apache with mod_wsgi and Django
Put OpenCV in OS X with Homebrew and input / output video with python
Run Zookeeper x python (kazoo) on Mac OS X
Put Python 2.7.x on Mac OSX 10.15.5 with pyenv
Install shogun with python modular (OS X Yosemite)
Shpinx (Python documentation builder) on Mac OS X
Install selenium on Mac and try it with python
[Mac OS] Use Kivy with PyCharm! [Python application development]
Build a Python development environment on Mac OS X
mac OS X 10.15.x pyenv Python If you can't install
Install PyQt5 with homebrew on Mac OS X Marvericks (10.9.2)
Investigate Java and python data exchange with Apache Arrow
pangolin x python x mac os build failed memorandum unsolved
Streaming Python and SensorTag, Kafka, Spark Streaming-Part 5: Connecting from Jupyter to Spark with Apache Toree
I tried to build an environment for machine learning with Python (Mac OS X)
Programming with Python and Tkinter
x86 compiler self-made with python
Python and hardware-Using RS232C with Python-
Using multiple versions of Python on Mac OS X (2) Usage
I learned MNIST with Caffe and tried to draw it (MAC OS X El Capitan)
Using NAOqi 2.4.2 Python SDK on Mac OS X El Capitan
Apache mod_auth_tkt and Python AuthTkt
Create an LCD (16x2) game with Raspberry Pi and Python
Build a python environment with pyenv (OS X El Capitan 10.11.3)
Play with Mastodon's archive in Python 2 Count replies and favourites
Memo on Mac OS X
python with pyenv and venv
[Python x Zapier] Get alert information and notify with Slack
How to install Theano on Mac OS X with homebrew
About Python and os operations
Using OpenCV with Python @Mac
Works with Python and R
Using multiple versions of Python on Mac OS X (1) Multiple Ver installation
Apache Beam 2.0.x with Google Cloud Dataflow starting with IntelliJ and Gradle
[Machine learning] Try running Spark MLlib with Python and make recommendations
Build a Python environment on your Mac with Anaconda and PyCharm
Error and solution when installing python3 with homebrew on mac (catalina 10.15)
Continuation ・ Notes on preparing the Python development environment on Mac OS X
Quickly install OpenCV 2.4 (+ python) on OS X and try the sample
How to run Jupyter and Spark on Mac with minimal settings
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Robot running with Arduino and python
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
Make apache log csv with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
Scraping with Python, Selenium and Chromedriver
Install Sphinx on Mac OS X
Scraping with Python and Beautiful Soup
Installation of scikit-learn (Mac OS X)
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-