注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

程序员小站

J2EE丨Spring | JVM | Scala

 
 
 

日志

 
 

Spark & shark 安装使用  

2014-03-21 17:13:31|  分类: hadoop |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Apache Spark is a fast and general engine for large-scale data processing. http://spark.incubator.apache.org/
Shark is a Hive compatible query engine Based on Spark. http://shark.cs.berkeley.edu/
Spark安装:

下载:

     http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating.tgz

编译

sbt/sbt assembly

修改配置文件:

    conf/spark-env.sh
    export SCALA_HOME=/etc/scala-2.10.3
    export JAVA_HOME=/usr/jdk64/jdk1.7.0_51
    export SPARK_MASTER_IP=192.168.1.100
    export SPARK_WORKER_MEMORY=4G
    export SPARK_MASTER_WEBUI_PORT=8280  
    export SPARK_WORKER_WEBUI_PORT=8200

    vim /app/spark/conf/slaves
    slave1
    slave2

启动&停止

    spark/sbin/start-all.sh
    spark/sbin/stop-all.sh

部署slave节点

       scp -r spark  192.168.1.101:/ ---目录要一样。
运行示例 ./bin/run-example org.apache.spark.examples.SparkPi local ./bin/run-example org.apache.spark.examples.SparkPi spark://192.168.1.100:7077
浏览master的web UI http://master:8280 (默认http://master:8080) 读取hdfs /app/spark-0.9/bin/spark-shell scala> var file=sc.textFile("hdfs://wemeetcluster:8020/tmp/action.20140218_12.node12.intra.hiwemeet.com.log"); scala> file.first 一行信息 scala> file.collect; 14/03/06 17:03:05 INFO SparkContext: Starting job: collect at <console>:15 14/03/06 17:03:05 INFO DAGScheduler: Got job 1 (collect at <console>:15) with 1 output partitions (allowLocal=false) 14/03/06 17:03:05 INFO DAGScheduler: Final stage: Stage 1 (collect at <console>:15) 14/03/06 17:03:05 INFO DAGScheduler: Parents of final stage: List() 14/03/06 17:03:05 INFO DAGScheduler: Missing parents: List() 14/03/06 17:03:05 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[1] at textFile at <console>:12), which has no missing parents 14/03/06 17:03:06 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[1] at textFile at <console>:12) 14/03/06 17:03:06 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 14/03/06 17:03:06 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/03/06 17:03:06 INFO TaskSetManager: Serialized task 1.0:0 as 1634 bytes in 9 ms 14/03/06 17:03:06 INFO Executor: Running task ID 0 14/03/06 17:03:06 INFO BlockManager: Found block broadcast_0 locally 14/03/06 17:03:06 INFO HadoopRDD: Input split: hdfs://wemeetcluster:8020/tmp/action.20140218_12.node12.intra.hiwemeet.com.log:0+160 14/03/06 17:03:06 INFO Executor: Serialized size of result for 0 is 693 14/03/06 17:03:06 INFO Executor: Sending result for 0 directly to driver 14/03/06 17:03:06 INFO Executor: Finished task ID 0 14/03/06 17:03:06 INFO TaskSetManager: Finished TID 0 in 93 ms on localhost (progress: 0/1) 14/03/06 17:03:06 INFO TaskSchedulerImpl: Remove TaskSet 1.0 from pool 14/03/06 17:03:06 INFO DAGScheduler: Completed ResultTask(1, 0) 14/03/06 17:03:06 INFO DAGScheduler: Stage 1 (collect at <console>:15) finished in 0.109 s 14/03/06 17:03:06 INFO SparkContext: Job finished: collect at <console>:15, took 0.259848716 s res2: Array[String] = Array({"action":"yuanzhou","time":1392697879}, {"action":"yuanzhou","time":1392697882}, {"action":"yuanzhou","time":1392697981}, {"action":"yuanzhou","time":1392697984})

Shark 安装:

Shark 0.9.0 requires:

        Scala 2.10.3
        AMPLab's Hive 0.11
        Spark 0.9.x

下载

         https://github.com/amplab/shark/archive/v0.9.0.zip

          wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz

 编译

      ./sbt/sbt package

  

 设置

 /shark/conf/shark-env.sh

export SPARK_MEM=4g
export SCALA_HOME="/etc/scala-2.10.3"
export SHARK_MASTER_MEM=1g
export HIVE_CONF_DIR="/etc/hive"
export HADOOP_HOME="/etc/hadoop"
export SPARK_HOME="/app/spark-0.9"

测试使用

    ./bin/shark-withinfo

    CREATE TABLE src(key INT, value STRING);
    LOAD DATA LOCAL INPATH '/app/shark/data/examples/files/kv1.txt' INTO TABLE src;
    SELECT COUNT(1) FROM src;    
    CREATE TABLE src_cached AS SELECT * FROM SRC;
    SELECT COUNT(1) FROM src_cached;
    同样的命令可以在hive中执行,速度比hive要快好几十倍

shell使用:

    ./bin/shark    -H  
    ./bin/shark -e "SELECT * FROM src"
    ./bin/shark -i file.hql 

服务器方式使用:

    ./bin/shark  sharkserver2
    $ bin/beeline
    beeline > !connect jdbc:hive2://localhost:10000/default

数据分析使用:   

               ./bin/shark-shell
               scala> val youngUsers = sc.sql2rdd("SELECT * FROM users WHERE age < 20")
                scala> println(youngUsers.count)
 
                scala> val featureMatrix = youngUsers.map(extractFeatures(_))
                scala> kmeans(featureMatrix)
  评论这张
 
阅读(2039)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017