참고: http://spark.apache.org/docs/1.6.2/sql-programming-guide.html


// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)



jar 추가 하기

  • spark-shell --jars <jar_name1,jar_name2, ...>



repl에서와 동일하게 다음과 같이 하면 버전이 나온다. 


scala> scala.util.Properties.versionString

res0: String = version 2.10.5



spark-shell 사용중에 saveAsTextFile을 사용하기전에 제약사항이 저장할 디렉토리에 파일이 없어야 한다.

따라서 미리 파일을 지워야 하는데 s3같은 경우는 hadoop fs -rmr 커맨드 한줄로 가능하다.

spark-shell 안에서 hadoop 커맨드를 실행하고자 하는 경우 다음과 같이 하면 된다.


scala> import sys.process._

import sys.process._


scala> "hadoop fs -rmr s3://버킷명/지우고_싶은_디렉토리" !



AWS S3 Java SDK를 써도 되긴하는데 귀찮다. 

참고
* http://alvinalexander.com/scala/scala-execute-exec-external-system-commands-in-scala


'Spark' 카테고리의 다른 글

임시 테이블 만들기  (0) 2017.01.23
spark-shell 사용법  (0) 2016.09.30
spark-shell에서 scala 버전 구하기  (0) 2016.09.29
Date에 Range를 넣어보자.  (0) 2016.02.11


$ spark-shell --packages "io.lamma:lamma_2.11:2.3.0"

Ivy Default Cache set to: /home/hadoop/.ivy2/cache

The jars for the packages stored in: /home/hadoop/.ivy2/jars

:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.7.1-amzn-0.jar!/org/apache/ivy/core/settings/ivysettings.xml

joda-time#joda-time added as a dependency

io.lamma#lamma_2.11 added as a dependency

:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0

        confs: [default]

        found joda-time#joda-time;2.9.2 in central

        found io.lamma#lamma_2.11;2.3.0 in central

downloading https://repo1.maven.org/maven2/io/lamma/lamma_2.11/2.3.0/lamma_2.11-2.3.0.jar ...

        [SUCCESSFUL ] io.lamma#lamma_2.11;2.3.0!lamma_2.11.jar (68ms)

:: resolution report :: resolve 1799ms :: artifacts dl 78ms

        :: modules in use:

        io.lamma#lamma_2.11;2.3.0 from central in [default]

        joda-time#joda-time;2.9.2 from central in [default]

        ---------------------------------------------------------------------

        |                  |            modules            ||   artifacts   |

        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|

        ---------------------------------------------------------------------

        |      default     |   2   |   1   |   1   |   0   ||   2   |   1   |

        ---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent

        confs: [default]

        1 artifacts copied, 1 already retrieved (338kB/13ms)

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0

      /_/


Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_91)

Type in expressions to have them evaluated.

Type :help for more information.

16/02/11 05:49:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

Spark context available as sc.

Thu Feb 11 05:49:54 UTC 2016 Thread[main,5,main] java.io.FileNotFoundException: derby.log (Permission denied)

----------------------------------------------------------------

Thu Feb 11 05:49:55 UTC 2016:

Booting Derby version The Apache Software Foundation - Apache Derby - 10.10.1.1 - (1458268): instance a816c00e-0152-cee0-f31c-000025557508

on database directory /mnt/tmp/spark-9e5a20ed-acea-4dc8-ab4b-5803c4a3dca9/metastore with class loader sun.misc.Launcher$AppClassLoader@753d556f

Loaded from file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.7.1-amzn-0.jar

java.vendor=Oracle Corporation

java.runtime.version=1.7.0_91-mockbuild_2015_10_27_19_01-b00

user.dir=/etc/spark/conf.dist

os.name=Linux

os.arch=amd64

os.version=4.1.13-19.30.amzn1.x86_64

derby.system.home=null

Database Class Loader started - derby.database.classpath=''

16/02/11 05:50:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

16/02/11 05:50:04 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

SQL context available as sqlContext.


scala> import io.lamma.Date

import io.lamma.Date


scala> Date("2016-02-01") to Date("2016-02-11") foreach println

Date(2016,2,1)

Date(2016,2,2)

Date(2016,2,3)

Date(2016,2,4)

Date(2016,2,5)

Date(2016,2,6)

Date(2016,2,7)

Date(2016,2,8)

Date(2016,2,9)

Date(2016,2,10)

Date(2016,2,11)


scala>



+ Recent posts