spark-shell 사용중에 saveAsTextFile을 사용하기전에 제약사항이 저장할 디렉토리에 파일이 없어야 한다.

따라서 미리 파일을 지워야 하는데 s3같은 경우는 hadoop fs -rmr 커맨드 한줄로 가능하다.

spark-shell 안에서 hadoop 커맨드를 실행하고자 하는 경우 다음과 같이 하면 된다.


scala> import sys.process._

import sys.process._


scala> "hadoop fs -rmr s3://버킷명/지우고_싶은_디렉토리" !



AWS S3 Java SDK를 써도 되긴하는데 귀찮다. 

참고
* http://alvinalexander.com/scala/scala-execute-exec-external-system-commands-in-scala


'Spark' 카테고리의 다른 글

임시 테이블 만들기  (0) 2017.01.23
spark-shell 사용법  (0) 2016.09.30
spark-shell에서 scala 버전 구하기  (0) 2016.09.29
Date에 Range를 넣어보자.  (0) 2016.02.11


$ spark-shell --packages "io.lamma:lamma_2.11:2.3.0"

Ivy Default Cache set to: /home/hadoop/.ivy2/cache

The jars for the packages stored in: /home/hadoop/.ivy2/jars

:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.7.1-amzn-0.jar!/org/apache/ivy/core/settings/ivysettings.xml

joda-time#joda-time added as a dependency

io.lamma#lamma_2.11 added as a dependency

:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0

        confs: [default]

        found joda-time#joda-time;2.9.2 in central

        found io.lamma#lamma_2.11;2.3.0 in central

downloading https://repo1.maven.org/maven2/io/lamma/lamma_2.11/2.3.0/lamma_2.11-2.3.0.jar ...

        [SUCCESSFUL ] io.lamma#lamma_2.11;2.3.0!lamma_2.11.jar (68ms)

:: resolution report :: resolve 1799ms :: artifacts dl 78ms

        :: modules in use:

        io.lamma#lamma_2.11;2.3.0 from central in [default]

        joda-time#joda-time;2.9.2 from central in [default]

        ---------------------------------------------------------------------

        |                  |            modules            ||   artifacts   |

        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|

        ---------------------------------------------------------------------

        |      default     |   2   |   1   |   1   |   0   ||   2   |   1   |

        ---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent

        confs: [default]

        1 artifacts copied, 1 already retrieved (338kB/13ms)

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0

      /_/


Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_91)

Type in expressions to have them evaluated.

Type :help for more information.

16/02/11 05:49:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

Spark context available as sc.

Thu Feb 11 05:49:54 UTC 2016 Thread[main,5,main] java.io.FileNotFoundException: derby.log (Permission denied)

----------------------------------------------------------------

Thu Feb 11 05:49:55 UTC 2016:

Booting Derby version The Apache Software Foundation - Apache Derby - 10.10.1.1 - (1458268): instance a816c00e-0152-cee0-f31c-000025557508

on database directory /mnt/tmp/spark-9e5a20ed-acea-4dc8-ab4b-5803c4a3dca9/metastore with class loader sun.misc.Launcher$AppClassLoader@753d556f

Loaded from file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.7.1-amzn-0.jar

java.vendor=Oracle Corporation

java.runtime.version=1.7.0_91-mockbuild_2015_10_27_19_01-b00

user.dir=/etc/spark/conf.dist

os.name=Linux

os.arch=amd64

os.version=4.1.13-19.30.amzn1.x86_64

derby.system.home=null

Database Class Loader started - derby.database.classpath=''

16/02/11 05:50:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

16/02/11 05:50:04 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

SQL context available as sqlContext.


scala> import io.lamma.Date

import io.lamma.Date


scala> Date("2016-02-01") to Date("2016-02-11") foreach println

Date(2016,2,1)

Date(2016,2,2)

Date(2016,2,3)

Date(2016,2,4)

Date(2016,2,5)

Date(2016,2,6)

Date(2016,2,7)

Date(2016,2,8)

Date(2016,2,9)

Date(2016,2,10)

Date(2016,2,11)


scala>



sdk manager를 이용하여 scala를 간단하게 설치해보자.


먼저 sdkman을 설치한다.


$ curl -s get.sdkman.io | bash


그 다음 새창을 열거나 다음 커맨드로 sdkman을 실행가능하도록 한다.

$ source "$HOME/.sdkman/bin/sdkman-init.sh"
다음 커맨드로 scala를 설치하면 작업 완료이다.

[~]# sdk install scala

==== BROADCAST =================================================================

* 08/02/16: Gradle 2.11 released on SDKMAN! #gradle

* 06/02/16: Vertx 3.2.1 released on SDKMAN! #vertx

* 04/02/16: Kotlin 1.0.0-rc-1036 released on SDKMAN! #kotlin

================================================================================


Downloading: scala 2.11.7


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0

100 27.1M  100 27.1M    0     0   360k      0  0:01:17  0:01:17 --:--:--  344k


Installing: scala 2.11.7

Done installing!


Do you want scala 2.11.7 to be set as default? (Y/n):  y


Setting scala 2.11.7 as default.

[~]# which scala

/Users/gilbird/.sdkman/candidates/scala/current/bin/scala





sbteclipse를 설치하면 sbt eclipse 커맨드로 eclipse 프로젝트를 생성할 수 있다.


먼저 sbteclipse 플러그인 사용을 위한 설정을 한다.

프로젝트 디렉토리에서 project 디렉토리와 plugins.sbt 파일을 생성한다.


mkdir project

vi project/plugins.sbt


다음 내용추가 


addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")


프로젝트 디렉토리에서 다음 커맨드 실행


# sbt eclipse

[info] Loading global plugins from /Users/gilbird/.sbt/0.13/plugins

[info] Updating {file:/Users/gilbird/.sbt/0.13/plugins/}global-plugins...

[info] Resolving org.fusesource.jansi#jansi;1.4 ...

[info] downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbteclipse/sbteclipse-plugin/scala_2.10/sbt_0.13/4.0.0/jars/sbteclipse-plugin.jar ...

[info] [SUCCESSFUL ] com.typesafe.sbteclipse#sbteclipse-plugin;4.0.0!sbteclipse-plugin.jar (6861ms)

[info] downloading https://jcenter.bintray.com/org/scalaz/scalaz-core_2.10/7.1.0/scalaz-core_2.10-7.1.0.jar ...

[info] [SUCCESSFUL ] org.scalaz#scalaz-core_2.10;7.1.0!scalaz-core_2.10.jar(bundle) (9362ms)

[info] downloading https://jcenter.bintray.com/org/scalaz/scalaz-effect_2.10/7.1.0/scalaz-effect_2.10-7.1.0.jar ...

[info] [SUCCESSFUL ] org.scalaz#scalaz-effect_2.10;7.1.0!scalaz-effect_2.10.jar(bundle) (1953ms)

[info] Done updating.

[info] Set current project to ScalaTest (in build file:/Users/gilbird/Work/scala/ScalaTest/)

[info] About to create Eclipse project files for your project(s).

[info] Updating {file:/Users/gilbird/Work/scala/ScalaTest/}scalatest...

[info] Resolving org.fusesource.jansi#jansi;1.4 ...

[info] downloading https://jcenter.bintray.com/org/scala-lang/scala-library/2.10.1/scala-library-2.10.1.jar ...

[info] [SUCCESSFUL ] org.scala-lang#scala-library;2.10.1!scala-library.jar (6513ms)

[info] downloading https://jcenter.bintray.com/org/scala-lang/scala-compiler/2.10.1/scala-compiler-2.10.1.jar ...

[info] [SUCCESSFUL ] org.scala-lang#scala-compiler;2.10.1!scala-compiler.jar (10524ms)

[info] downloading https://jcenter.bintray.com/org/scala-lang/scala-reflect/2.10.1/scala-reflect-2.10.1.jar ...

[info] [SUCCESSFUL ] org.scala-lang#scala-reflect;2.10.1!scala-reflect.jar (3837ms)

[info] downloading https://jcenter.bintray.com/org/scala-lang/jline/2.10.1/jline-2.10.1.jar ...

[info] [SUCCESSFUL ] org.scala-lang#jline;2.10.1!jline.jar (2989ms)

[info] Done updating.

[info] Successfully created Eclipse project files for project(s):

[info] ScalaTest


ls를 해보면 .project파일이 생성되어 있다.


# ls

total 24

0 ./          8 .classpath  0 .settings/  0 lib/        0 src/

0 ../         8 .project    8 build.sbt   0 project/    0 target/


그 다음은 기존의 이클립트 프로젝트 import를 하면된다.



다음 주소에서 스크립트를 sbtmkdirs.sh 스크립트를 다운로드 받는다.

https://gist.github.com/alvinj/3194379


프로젝트를 만들고자 하는 디렉토리에서 스크립트를 실행하고 

몇가지 물음에 답하면 하위에 해당 프로젝트가 만들어진다.


# sbtmkdirs.sh

This script creates an SBT project directory beneath the current directory.


Directory/Project Name (MyFirstProject): ScalaTest

Create .gitignore File? (Y/n): n

Create README.md File? (Y/n): n


-----------------------------------------------

Directory/Project Name: ScalaTest

Create .gitignore File?: n

Create README.md File?: n

-----------------------------------------------

Create Project? (Y/n): y


Project created. See the following URL for build.sbt examples:

http://alvinalexander.com/scala/sbt-syntax-examples

[~/work/scala]#


스크립트 실행 후 다음 디렉토리 및 파일이 생성된 것을 확인할 수 있다.


# ls ScalaTest/

total 8

0 ./         0 ../        8 build.sbt  0 lib/       0 project/   0 src/       0 target/


다음 사이트 방문하여 라이브러리 다운로드


http://www.scala-sbt.org/download.html


압축을 풀고 bin디렉토리로 디동하여 sbt 실행하니 관련 라이브러리를 오랜 시간 받음


[~/gilbird/scala/sbt/bin]# ./sbt

Getting org.scala-sbt sbt 0.13.9 ...

downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.9/jars/sbt.jar ...

[SUCCESSFUL ] org.scala-sbt#sbt;0.13.9!sbt.jar (4724ms)

downloading https://jcenter.bintray.com/org/scala-lang/scala-library/2.10.5/scala-library-2.10.5.jar ...

[SUCCESSFUL ] org.scala-lang#scala-library;2.10.5!scala-library.jar (2985ms)

downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/main/0.13.9/jars/main.jar ...

[SUCCESSFUL ] org.scala-sbt#main;0.13.9!main.jar (9263ms)

downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.9/jars/compiler-interface-bin.jar ...

[SUCCESSFUL ] org.scala-sbt#compiler-interface;0.13.9!compiler-interface-bin.jar (6848ms)

downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.9/jars/compiler-interface-src.jar ...

[SUCCESSFUL ] org.scala-sbt#compiler-interface;0.13.9!compiler-interface-src.jar (4634ms)

downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/precompiled-2_8_2/0.13.9/jars/compiler-interface-bin.jar ...

[SUCCESSFUL ] org.scala-sbt#precompiled-2_8_2;0.13.9!compiler-interface-bin.jar (6416ms)

downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/precompiled-2_9_2/0.13.9/jars/compiler-interface-bin.jar ...

[SUCCESSFUL ] org.scala-sbt#precompiled-2_9_2;0.13.9!compiler-interface-bin.jar (5475ms)

...



다시 실행하면 관련 라이브러리는 이미 다 받았으니 바로 실행되는 것 확인

현재 경로를 .profile에 등록하여 다른 디렉토리에서도 실행 가능하도록 함.


export SBT_HOME=[전체경로]/scala/sbt

export PATH=$SBT_HOME/bin:$PATH




스칼라 쉘에서 멀티라인을 입력하려면 다음과 같이 하면 된다.

  1. :paste 입력 (:p도 가능)
  2. 멀티라인 입력
  3. Ctrl + D 로 입력 종료


참고링크: http://alvinalexander.com/scala/how-to-enter-paste-multiline-commands-statements-into-scala-repl



core 갯수가 있더라고 CPU 사양이 있기 때문에 이것까지 반영하여 만든 성능 수치


Q: What is a “EC2 Compute Unit” and why did you introduce it?

Transitioning to a utility computing model fundamentally changes how developers have been trained to think about CPU resources. Instead of purchasing or leasing a particular processor to use for several months or years, you are renting capacity by the hour. Because Amazon EC2 is built on commodity hardware, over time there may be several different types of physical hardware underlying EC2 instances. Our goal is to provide a consistent amount of CPU capacity no matter what the actual underlying hardware.

Amazon EC2 uses a variety of measures to provide each instance with a consistent and predictable amount of CPU capacity. In order to make it easy for developers to compare CPU capacity between different instance types, we have defined an Amazon EC2 Compute Unit. The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. The EC2 Compute Unit (ECU) provides the relative measure of the integer processing power of an Amazon EC2 instance. Over time, we may add or substitute measures that go into the definition of an EC2 Compute Unit, if we find metrics that will give you a clearer picture of compute capacity.


http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it


+ Recent posts