https://github.com/open-guides/og-aws/blob/master/README.md




EMR에서는 YARN을 통해서 Spark을 구동할 수 있다.

이 때 현재 실행중인 Spark Job을 보고 싶은 경우가 있는데 Web UI를 보고 싶다면

다음과 같은 단계를 거쳐야 한다.

  • 터널링하여 Resource Manager 접속
  • 해당 job의 ApplicationMaster 링크 클릭

주의
  • ResourceManager를 거치지 않으면 이미지가 깨져나온다.
  • 따라서 YARN Resource Manager를 거치도록 한다.


ec2에서 sudo pip install이 안되는 경우가 있다.


[ec2-user@ip-172-31-17-194 ~]$ sudo pip install boto3
Traceback (most recent call last):
  File "/usr/bin/pip", line 5, in <module>
    from pkg_resources import load_entry_point
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 3138, in <module>
    @_call_aside
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 3124, in _call_aside
    f(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 3151, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 663, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 676, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 849, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'pip==6.1.1' distribution was not found and is required by the application
[ec2-user@ip-172-31-17-194 ~]$ which pip
/usr/local/bin/pip


우와 같이 나오는 경우 경로를 다음과 같이 경로를 수정하면 된다. 


sudo /usr/local/bin/pip


ADHOC-JOB으로 시작하는 클러스터 아이디를 뽑아내고 싶다면 쉘에서 다음과 같이 입력하면 클러스터 아이디를 추출할 수 있다.



CLUSTER_ID=`aws emr list-clusters --active --query 'Clusters[?starts_with(Name, \`ADHOC-JOB\`) == \`true\`].Id' | jq .[] | sed -e 's/^"//' -e 's/"$//'`



aws cli 커맨드에서 --query 옵션을 넣으면 JSON결과를 필터링해서 볼 수 있다.



emr CORE 인스턴스중 RUNNING인 인스턴스를 찾고 아이디만 출력하기


$ aws emr list-instances --cluster-id j-1RFIG3YHLDOXX --instance-group-types CORE --query 'Instances[?Status.State==`RUNNING`].Ec2InstanceId' [ "i-51d9f1xx", "i-55d9f1xx", "i-085653xx", "i-656d67xx", "i-1a6d67xx" ]



현재 동작중인 emr인스턴스명과 동작시간 가져오기.

$ aws emr list-clusters --active --query 'Clusters[*].[Name,NormalizedInstanceHours]' | jq [ [ "20160825-1135", 192 ], [ "20160825-1109", 448 ], [ "20160825-1100", 672 ] ]





1000시간 이상으로 필터링하고 Name:Value 쌍으로 값을 가져오기 (주의: []가 아니라 {}를 써야함)

$ aws emr list-clusters --query 'Clusters[?NormalizedInstanceHours > `1000`].{Name:Name, Time:NormalizedInstanceHours}' | jq '.' [ { "Name": "20160822-0900", "Time": 1632 }, { "Name": "20160804", "Time": 1320 }, { "Name": "20160804", "Time": 1848 }, { "Name": "20160802-2", "Time": 29064 } ]





기본 커맨드

$ aws ec2 describe-volumes
{
    "Volumes": [
        {
            "AvailabilityZone": "us-west-2a",
            "Attachments": [
                {
                    "AttachTime": "2013-09-17T00:55:03.000Z",
                    "InstanceId": "i-a071c394",
                    "VolumeId": "vol-e11a5288",
                    "State": "attached",
                    "DeleteOnTermination": true,
                    "Device": "/dev/sda1"
                }
            ],
            "VolumeType": "standard",
            "VolumeId": "vol-e11a5288",
            "State": "in-use",
            "SnapshotId": "snap-f23ec1c8",
            "CreateTime": "2013-09-17T00:55:03.000Z",
            "Size": 30
        },
        {
            "AvailabilityZone": "us-west-2a",
            "Attachments": [
                {
                    "AttachTime": "2013-09-18T20:26:16.000Z",
                    "InstanceId": "i-4b41a37c",
                    "VolumeId": "vol-2e410a47",
                    "State": "attached",
                    "DeleteOnTermination": true,
                    "Device": "/dev/sda1"
                }
            ],
            "VolumeType": "standard",
            "VolumeId": "vol-2e410a47",
            "State": "in-use",
            "SnapshotId": "snap-708e8348",
            "CreateTime": "2013-09-18T20:26:15.000Z",
            "Size": 8
        }
    ]
}


다음 조건의 데이터만 나오도록 하기

$ aws ec2 describe-volumes --query 'Volumes[?AvailabilityZone==`us-west-2a`]'


일부 필드만 나오도록 조정하기

$ aws ec2 describe-volumes --query 'Volumes[*].{ID:VolumeId,InstanceId:Attachments[0].InstanceId,AZ:AvailabilityZone,Size:Size}'
[
    {
        "InstanceId": "i-a071c394",
        "AZ": "us-west-2a",
        "ID": "vol-e11a5288",
        "Size": 30
    },
    {
        "InstanceId": "i-4b41a37c",
        "AZ": "us-west-2a",
        "ID": "vol-2e410a47",
        "Size": 8
    }
]


일부 필드와 값만 나오도록 조정하기

$ aws ec2 describe-volumes --query 'Volumes[*].[VolumeId, Attachments[0].InstanceId, AvailabilityZone, Size]'
[
    [
        "vol-e11a5288",
        "i-a071c394",
        "us-west-2a",
        30
    ],
    [
        "vol-2e410a47",
        "i-4b41a37c",
        "us-west-2a",
        8
    ]
]





참고 페이지: 커맨드 결과 제어하기 

http://docs.aws.amazon.com/ko_kr/cli/latest/userguide/controlling-output.html







리소스매니저 재시작하기



sudo /sbin/stop hadoop-yarn-resourcemanager

sudo /sbin/start hadoop-yarn-resourcemanager




core 갯수가 있더라고 CPU 사양이 있기 때문에 이것까지 반영하여 만든 성능 수치


Q: What is a “EC2 Compute Unit” and why did you introduce it?

Transitioning to a utility computing model fundamentally changes how developers have been trained to think about CPU resources. Instead of purchasing or leasing a particular processor to use for several months or years, you are renting capacity by the hour. Because Amazon EC2 is built on commodity hardware, over time there may be several different types of physical hardware underlying EC2 instances. Our goal is to provide a consistent amount of CPU capacity no matter what the actual underlying hardware.

Amazon EC2 uses a variety of measures to provide each instance with a consistent and predictable amount of CPU capacity. In order to make it easy for developers to compare CPU capacity between different instance types, we have defined an Amazon EC2 Compute Unit. The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. The EC2 Compute Unit (ECU) provides the relative measure of the integer processing power of an Amazon EC2 instance. Over time, we may add or substitute measures that go into the definition of an EC2 Compute Unit, if we find metrics that will give you a clearer picture of compute capacity.


http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it


emr-4.2.0에서 zeppelin 서비스 포트 변경하는 법


Step1 설정 변경


cd /usr/lib/zeppelin/conf

sudo vi zeppelin-env.sh


다음 값을 변경하면 됨

export ZEPPELIN_PORT



Step2 서버 재시작


cd ../bin

./zeppelin-daemon.sh restart

+ Recent posts