Hulb's Homepage

Welcome to My Website

This is the place where I share my ideas.You can also visit me on @CSDN or @oschina.

Give Chapter and Verse and Author for Transshipment, copy or other using. Thanks!

Spark性能调优之——在实际项目中分配更多的资源

2016-08-31
分配更多资源：

性能调优的王道，就是增加和分配更多的资源，性能和速度上提升，是显而易见的，基本上，在一定范围内，增加资源与性能的提升，是成正比的，

写完一个复杂的spark作业之后，进行性能调优的时候

首先第一步，我决定就是要来调节最优的资源配置，在这个基础之上，如果说你的spark作业，能够分配的资源达到你的能力范围的顶端之后，无法分配更多资源了，公司资源有限，那么才是考虑去做后面的这些性能调优的点。

1.分配哪些资源？

2.在哪里分配资源？

3.调节到多大，算是最大呢？

4.为什么分配了这些资源以后，性能会得到提升？

答案：

1.分配哪些资源？
```
executor
cpu per executor
memory per executor
driver memory
```
2.在哪里分配资源？
```
在我们生产环境中，提交spark作业时，用的spark-submit shell脚本，里面调整对应的参数。 ``` /usr/local/spark/bin/spark-submit \ --class cn.spark.sparktest.core.WordCountCluster \ --num-executors 3 \  配置executor的数量 --driver-memory 100m \  配置driver的内存（影响不大） --executor-memory 100m \  配置每个executor的内存大小 --executor-cores 3 \  配置每个executor的cpu core数量 /usr/local/SparkTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar \ ```
```
3.调节到多大，算是最大呢？

第一种

Spark Standalone 公司集群上，搭建一套spark 集群，你心里应该清楚每台机器能够给你使用的，大概有多少内存。多少cpu core ，那么，设置的时候，就根据这个实际的情况，去调节每个spark作业的资源分配。比如说你的

每台机器能够给你使用4G内存，2个cpu core； 20台机器；总共： 80G内存，40core。

Executor 20个；所以：80/ 20 = 4G 内存。。。。4G内存， 40/20 = 2 所以 2个cpu core，平均每个executor。

第二种

Yarn资源队列，资源调度。应该去查看你的spark作业要提交到的资源队列，大概有多少资源？ 500G内存，100个cpu core；

executor 50；因此计算得到每个executor 10G内存； 2个 core；

时间记录“12.48‘
Read All
Spark大数据常见错误分享总结（来自苏宁）

2016-08-30
使用案例机器学习
- 1.商品特征未读降维：SVD 、PCA
- 2.商品挂错页面检查：TF-IDF、SVM、Logistic - -Regression
- 3.相关推荐算法模型训练：Loginistic Regression、kmeans、SVM
- 4.商品爆品预测：Loginistic Regression
- 5.关联性分析：FPGrowth
- 6.开发了基于Mllib的机器学习平台
经验分享

用户常见错误

1. 问题：Collect 大量数据到Driver端，导致driver oom；算法开发的时候没有注意

解决：driver不能堆积大量数据，尽量不要在driver保存数据

2. 问题：维表数据没用cache内存或者repartition数目太多解决：将维表数据cache到内存，分区数目不能太多

3. 问题：未对Spark的持久化级别进行选择，需要根据实际的业务需求进行选择解决：统计RDD的数据量，大数据量将Memory_AND_DISK作为首选

4. 问题：读写DB没有设置合理的分区数目，并发量太高，影响业务解决：统计DB的表分区结构，监控DB服务load，压测到位

5. 问题：Spark使用Hbase scan性能不稳定解决：Get性能相对稳定，尽量使用Get

6. 问题：History server 重启需要回放180G日志，需要4个小时，新完成的app在History server无法立即看到解决：改为多线程会放 SPARK-13988

7. 问题经常回出现class not found ，但是class文件再包里面存在解决办法打印classloadder分析，建议不要轻易修改源码classloader

8. PCA算法只能支持小于14W feature特性解决办法使用SVD进行降维

9. 问题 FPGrowth不支持 KryoSerializer 解决办法 1.6.2 之前使用java Serializer

10. Spark在使用JDBC接口建立DataFrame时，需通过执行SQL来获取该JDBC数据源的Schema，导致创建大量的DataFrame的时候非常耗时

解决办法：Schema相同的table可以不用重复获取schema 地址：https://github.com/ouyangshourui/SparkJDBCSchema/wiki 4000个DataFrame的初始化时间从原先的25分钟缩短为10分钟以内
Read All

Spark 之dataframe与rdd 转换

2016-08-30

DataFrame可以从结构化文件、hive表、外部数据库以及现有的RDD加载构建得到。具体的结构化文件、hive表、外部数据库的相关加载可以参考其他章节。这里主要针对从现有的RDD来构建DataFrame进行实践与解析。

Spark SQL 支持两种方式将存在的RDD转化为DataFrame。

第一种方法是使用反射来推断包含特定对象类型的RDD的模式。在写Spark程序的同时，已经知道了模式，这种基于反射的方法可以使代码更简洁并且程序工作得更好。

第二种方法是通过一个编程接口来实现，这个接口允许构造一个模式，然后在存在的RDD上使用它。虽然这种方法代码较为冗长，但是它允许在运行期间之前不止列以及列的类型的情况下构造DataFrame。

DataFrame是一个带有列名的分布式数据集合。等同于一张关系型数据库中的表或者R/Python中的data frame，不过在底层做了很多优化；我们可以使用结构化数据文件、Hive tables，外部数据库或者RDDS来构造DataFrames。

一、利用反射推断Schema

Spark SQL能够将含Row对象的RDD转换成DataFrame，并推断数据类型。通过将一个键值对（key/value）列表作为kwargs传给Row类来构造Rows。key定义了表的列名，类型通过看第一列数据来推断。（所以这里RDD的第一列数据不能有缺失）未来版本中将会通过看更多数据来推断数据类型，像现在对JSON文件的处理一样。

二、编程指定Schema

通过编程指定Schema需要3步：

1.从原来的RDD创建一个元祖或列表的RDD。

2.用StructType 创建一个和步骤一中创建的RDD中元祖或列表的结构相匹配的Schema。

3.通过SQLContext提供的createDataFrame方法将schema 应用到RDD上。

##三、实战

1.使用Java实战RDD与DataFrame转换

简单介绍：动态构造有时候有些麻烦：spark开发了一个API就是DataSet ，DataSet可以基于RDD，RDD里面有类型。他可以基于这种类型。 sparkSQL+DataFrame+DataSet:三者都相当重要，在2.0的时候编码会使用大量使用DataSet。 DataSet上可以直接查询。Spark的核心RDD+DataFrame+DataSet:最终会形成三足鼎立。RDD实际是服务SparkSQL的。DataSet是想要用所有的子框架都用DataSet进行计算。DataSet的底层是无私计划。这就让天然的性能优势体现出来。官方建议使用hiveContext，在功能上比SQLContext的更好更高级的功能。

 public class RDD2DataFrameByProgrammatically {
	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setMaster("local").setAppName("RDD2DataFrameByProgrammatically");
		JavaSparkContext sc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(sc);
		JavaRDD<String> lines = sc.textFile("E://test.txt");	
		/** * 第一步：在RDD的基础上创建类型为Row的RDD */
		JavaRDD<Row> personsRDD = lines.map(new Function<String, Row>() {
			@Override
			public Row call(String line) throws Exception {
				String[] splited = line.split(",");
				return RowFactory.create(Integer.valueOf(splited[0]), splited[1],Integer.valueOf(splited[2]));
			}
		});
		/*** 第二步：动态构造DataFrame的元数据，一般而言，有多少列以及每列的具体类型可能来自于JSON文件，也可能来自于DB */
		List<StructField> structFields = new ArrayList<StructField>();
		structFields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
		structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
		structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
		//构建StructType，用于最后DataFrame元数据的描述
		StructType structType =DataTypes.createStructType(structFields);
		
		/*** 第三步：基于以后的MetaData以及RDD<Row>来构造DataFrame*/
		DataFrame personsDF = sqlContext.createDataFrame(personsRDD, structType);
		/** 第四步：注册成为临时表以供后续的SQL查询操作*/
		personsDF.registerTempTable("persons");
		/** 第五步，进行数据的多维度分析*/
		DataFrame result = sqlContext.sql("select * from persons where age >20");
		/**第六步：对结果进行处理，包括由DataFrame转换成为RDD<Row>，以及结构持久化*/
		List<Row> listRow = result.javaRDD().collect();
		for(Row row : listRow){
			System.out.println(row);
		}
	}
}

2.使用Scala实战RDD与DataFrame转换

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext


class RDD2DataFrameByProgrammaticallyScala {
  def main(args:Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("RDD2DataFrameByProgrammaticallyScala") //设置应用程序的名称，在程序运行的监控界面可以看到名称
    conf.setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    val people = sc.textFile("C://Users//DS01//Desktop//persons.txt")
    val schemaString = "name age"
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.types.{StructType,StructField,StringType};
    val schema = StructType(schemaString.split(" ").map(fieldName=>StructField(fieldName,StringType,true)))
    val rowRDD = people.map(_.split(",")).map(p=>Row(p(0),p(1).trim))
    val peopleDataFrame = sqlContext.createDataFrame(rowRDD,schema)
    peopleDataFrame.registerTempTable("people")
    val results = sqlContext.sql("select name from people")
    results.map(t=>"Name: "+t(0)).collect().foreach(println)
  }
}

Read All

Spark二次排序学习总结

2016-08-17

1. 二次排序

Spark二次排序，即组装一个新的key并在这个key里实现排序接口所定义的方法。

例如一组数据：（点击次数，下单次数，支付次数） A:(30,35,40) B:(35,35,40) C:(30,38,40) D:(35,35,45)

需要分别对点击次数，下单次数，支付次数做比较。比较完35【点击次数】相等，则要对【下单次数】二次比较，若【下单次数】还是相等，则要对【支付次数再次比较】直到返回正确比较结果。

二次排序即需要自定义key 以及比较方法并返回比较结果。

2.Java代码实现如下：

import java.io.Serializable;

import scala.math.Ordered;

/**
 * 品类二次排序key
 * 
 * 封装你要排序的那几个字段
 * 
 * 实现ordered接口要求的几个方法
 * 
 * 跟其他key相比，如何来判定大于，大于等于，小于，小于等于
 * 
 * 必须实现序列化接口(否则会报错)
 * 
 * @author lxh
 *
 */
public class CategorySortKey implements Ordered<CategorySortKey> ,Serializable{

	private long clickCount;
	private long orderCount;
	private long payCount;
	
	

	
	
	public CategorySortKey(long clickCount, long orderCount, long payCount) {
		super();
		this.clickCount = clickCount;
		this.orderCount = orderCount;
		this.payCount = payCount;
	}


	/**
	 * 大于
	 */
	@Override
	public boolean $greater(CategorySortKey other) {
		
		if(clickCount > other.getClickCount()){
			return true;
		}else if(clickCount == other.getClickCount() &&
				orderCount>other.getOrderCount()){
			return true;
		}else if(clickCount == other.getClickCount() &&
				orderCount == other.getOrderCount() &&
				payCount > other.getPayCount()){
			return true;
		}
		return false;
	}

	
	/**
	 * 大于等于
	 */
	@Override
	public boolean $greater$eq(CategorySortKey other) {
		if($greater(other)){
			return true;
		}else if(clickCount == other.getClickCount() &&
				orderCount == other.getOrderCount() &&
				payCount == other.getPayCount()){
			return true;
		}
		return false;
	}

	/**
	 * 小于
	 */
	@Override
	public boolean $less(CategorySortKey other) {
		if(clickCount < other.getClickCount()){
			return true;
		}else if(clickCount == other.getClickCount() &&
				orderCount < other.getOrderCount()){
			return true;
		}else if(clickCount == other.getClickCount() &&
				orderCount == other.getOrderCount() &&
				payCount < other.getPayCount()){
			return true;
		}
		return false;
	}

	/**
	 * 小于等于
	 */
	@Override
	public boolean $less$eq(CategorySortKey other) {
		if($less(other)){
			return true;
		}else if(clickCount == other.getClickCount() &&
				orderCount == other.getOrderCount() &&
				payCount == other.getPayCount()){
			return true;
		}
		return false;
	}

	@Override
	public int compare(CategorySortKey other) {
		if(clickCount-other.getClickCount()!=0){
			return (int)(clickCount-other.getClickCount());
		}else if(orderCount - other.getOrderCount()!=0){
			return (int)(orderCount - other.getOrderCount());
		}else if(payCount - other.getPayCount()!=0){
			return (int)(payCount - other.getPayCount());
		}
		return 0;
	}

	@Override
	public int compareTo(CategorySortKey other) {
		if(clickCount-other.getClickCount()!=0){
			return (int)(clickCount-other.getClickCount());
		}else if(orderCount - other.getOrderCount()!=0){
			return (int)(orderCount - other.getOrderCount());
		}else if(payCount - other.getPayCount()!=0){
			return (int)(payCount - other.getPayCount());
		}
		return 0;
	}
	
	
	
	
	public final long getClickCount() {
		return clickCount;
	}

	public final void setClickCount(long clickCount) {
		this.clickCount = clickCount;
	}

	public final long getOrderCount() {
		return orderCount;
	}

	public final void setOrderCount(long orderCount) {
		this.orderCount = orderCount;
	}

	public final long getPayCount() {
		return payCount;
	}

	public final void setPayCount(long payCount) {
		this.payCount = payCount;
	}

}

3.Scala代码实现：

/**
  * 自定义的key
  * 
 * @author lxh
  *
 */
class SessionSortKey(val clickCount:Int, val orderCount:Int,val payCount:Int) extends Ordered[SessionSortKey] with Serializable{


  def compare(other:SessionSortKey):Int = {
    if(this.clickCount - other.clickCount != 0){
      this.clickCount - other.clickCount
    }else if(this.orderCount - other.orderCount!= 0){
      this.orderCount - other.orderCount
    }else if( this.payCount - other.payCount != 0){
      this.payCount - other.payCount
    }else{
      /**
        * 一定要有else！！！
        */
      0
    }
  }
}

Scala测试类：

import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by lxh on 2016/8/17.
  */
object SessionSortKeyTest {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("sortKeyTest").setMaster("local")

    val sc = new SparkContext(conf)

    val ar = Array(new Tuple2(new SessionSortKey(30,35,40),"1"),
    new Tuple2(new SessionSortKey(35,35,40),"2"),
    new Tuple2(new SessionSortKey(30,38,40),"3"))


    val rdd = sc.parallelize(ar,10)
    val sortedRdd = rdd.sortByKey(false)

    for(tuple <- sortedRdd.collect()){
      println(tuple._2)
    }

    /**
      * 输出结果
      * 2
      * 3
      * 1
      * 正确
      */


  }

}

Read All

Spark自定义累加器的使用

2016-08-07

1.为什么要使用自定义累加器

前文讲解过spark累加器的简单使用：http://blog.csdn.net/lxhandlbb/article/details/51931713

但是若业务较为复杂,需要使用多个广播变量时，就会使得程序变得非常复杂，不便于扩展维护，因此可以考虑自定义累加器。

##2.怎么使用自定义累加器

###Java版本：

package com.luoxuehuan.sparkproject.spark;


import org.apache.spark.AccumulatorParam;

import com.luoxuehuan.sparkproject.constant.Constants;
import com.luoxuehuan.sparkproject.util.StringUtils;

/**
 * 
 * @author lxh
 * 
 * AccumulatorParam<String>
 * String 针对 String格式 进行分布式计算
 * 也可以用自己的model ，但必须是可以序列化的！
 * 然后基于这种特殊的数据格式，可以实现自己复杂的分布式计算逻辑
 * 
 * 各个task 分布式在运行，可以根据你需求，task给Accumulator传入不同的值。
 * 
 * 根据不同的值，去做复杂的逻辑。
 * Spark Core里面很实用的高端技术！
 * 
 * 
 *
 */
public class SessionAggrAccumulator implements AccumulatorParam<String> {

	/**
	 * 
	 */
	private static final long serialVersionUID = 8528303091681331462L;
	/**
	 * Zoro方法，其实主要用于数据的初始化
	 * 那么，我们这里，就返回一个值，就是初始化中，所有范围区间的数量，多少0
	 * 
	 * 各个范围区间的统计数量的拼接，还是采用|分割。
	 */
	@Override
	public String zero(String v) {
		return Constants.SESSION_COUNT + "=0|"
				+ Constants.TIME_PERIOD_1s_3s + "=0|"
				+ Constants.TIME_PERIOD_4s_6s + "=0|"
				+ Constants.TIME_PERIOD_7s_9s + "=0|"
				+ Constants.TIME_PERIOD_10s_30s + "=0|"
				+ Constants.TIME_PERIOD_30s_60s + "=0|"
				+ Constants.TIME_PERIOD_1m_3m + "=0|"
				+ Constants.TIME_PERIOD_3m_10m + "=0|"
				+ Constants.TIME_PERIOD_10m_30m + "=0|"
				+ Constants.TIME_PERIOD_30m + "=0|"
				+ Constants.STEP_PERIOD_1_3 + "=0|"
				+ Constants.STEP_PERIOD_4_6 + "=0|"
				+ Constants.STEP_PERIOD_7_9 + "=0|"
				+ Constants.STEP_PERIOD_10_30 + "=0|"
				+ Constants.STEP_PERIOD_30_60 + "=0|"
				+ Constants.STEP_PERIOD_60 + "=0";
	}

	
	/**
	 * 这两个方法可以理解为一样的。
	 * 这两个方法，其实主要就是实现，v1可能就是我们初始化的那个连接串
	 * v2，就是我们在遍历session的时候，判断出某个session对应的区间，然后会用Constants.TIME_PERIOD_1s_3s
	 * 所以，我们，要做的事情就是
	 * 在v1中，找到v2对应的value，累加1，然后再更新回连接串里面去
	 */
	@Override
	public String addInPlace(String v1, String v2) {
		return add(v1, v2);
	}
	
	@Override
	public String addAccumulator(String v1, String v2) {
		return add(v1, v2);
	}
	
	/**
	 * session统计计算逻辑。
	 * @param v1 连接串
	 * @param v2 范围区间
	 * @return 更新以后的连接串
	 */
	private String add(String v1,String v2){
		//校验：v1位空的话，直接返回v2
		if(StringUtils.isEmpty(v1)) {
			return v2;
		}
		
		// 使用StringUtils工具类，从v1中，提取v2对应的值，并累加1
		String oldValue = StringUtils.getFieldFromConcatString(v1, "\\|", v2);
		if(oldValue != null) {
			// 将范围区间原有的值，累加1
			int newValue = Integer.valueOf(oldValue) + 1;
			// 使用StringUtils工具类，将v1中，v2对应的值，设置成新的累加后的值
			return StringUtils.setFieldInConcatString(v1, "\\|", v2, String.valueOf(newValue));  
		}
		
		return v1;
	}

}

Scala版本：

package com.core

import com.util.{Constants, StringUtils}
import org.apache.spark.{AccumulatorParam, SparkConf, SparkContext}

/**
  * Created by lxh on 2016/8/7.
  */
object SessionAggrStatAccumulatorTest {

  def main(args: Array[String]) {

    object SessionAggrStatAccumulator extends AccumulatorParam[String]{

      /**
        * 初始化和方法
        * @param initialValue
        * @return
        */
      def zero(initialValue:String):String = {
        Constants.SESSION_COUNT + "=0|" + Constants.TIME_PERIOD_1s_3s + "=0|" + Constants.TIME_PERIOD_4s_6s + "=0|" + Constants.TIME_PERIOD_7s_9s + "=0|" + Constants.TIME_PERIOD_10s_30s + "=0|" + Constants.TIME_PERIOD_30s_60s + "=0|" + Constants.TIME_PERIOD_1m_3m + "=0|" + Constants.TIME_PERIOD_3m_10m + "=0|" + Constants.TIME_PERIOD_10m_30m + "=0|" + Constants.TIME_PERIOD_30m + "=0|" + Constants.STEP_PERIOD_1_3 + "=0|" + Constants.STEP_PERIOD_4_6 + "=0|" + Constants.STEP_PERIOD_7_9 + "=0|" + Constants.STEP_PERIOD_10_30 + "=0|" + Constants.STEP_PERIOD_30_60 + "=0|" + Constants.STEP_PERIOD_60 + "=0";
      }
      /**
        * 其次需要实现一个累加方法
        * @param v1
        * @param v2
        * @return
        */
      def addInPlace(v1:String,v2:String):String  = {

        /**
          * 如果初始值为空,那么返回v2
          */
        if(v1 ==""){
          v2
        }else{
          /**
            * 从现有的连接串中提取v2所对应的值
            */
          val oldValue = StringUtils.getFieldFromConcatString(v1,"\\|",v2)
          /**
            * 累加1
            */
          val newValue = Integer.valueOf(oldValue)+1
          /**
            * 改链接串中的v2设置新的累加后的值
            */
          StringUtils.setFieldInConcatString(v1,"\\|",v2,String.valueOf(newValue))
        }
      }
    }


    val sparkConf  = new SparkConf().setAppName("accutest").setMaster("local")

    val sc = new SparkContext(sparkConf)
    /**
      * 使用Accumulator()()方法（curry）,创建自定义的accumulator
      * 柯里化
      */
    val sessionAggrStatAccumulator = sc.accumulator("")(SessionAggrStatAccumulator)
    val arr  = Array(Constants.TIME_PERIOD_1s_3s , Constants.TIME_PERIOD_4s_6s)
    val rdd = sc.parallelize(arr,1)
    rdd.foreach{sessionAggrStatAccumulator.add(_)}
    println(sessionAggrStatAccumulator.value)
  }

}

Read All

6/12

Welcome to My Website

Spark性能调优之——在实际项目中分配更多的资源

分配更多资源：

1.分配哪些资源？

2.在哪里分配资源？

3.调节到多大，算是最大呢？

4.为什么分配了这些资源以后，性能会得到提升？

答案：

1.分配哪些资源？

2.在哪里分配资源？

3.调节到多大，算是最大呢？

第一种

第二种

Spark大数据常见错误分享总结（来自苏宁）

使用案例 机器学习

经验分享

用户常见错误

Spark 之dataframe与rdd 转换

一、利用反射推断Schema

二、编程指定Schema

1.从原来的RDD创建一个元祖或列表的RDD。

2.用StructType 创建一个和步骤一中创建的RDD中元祖或列表的结构相匹配的Schema。

3.通过SQLContext提供的createDataFrame方法将schema 应用到RDD上。

1.使用Java实战RDD与DataFrame转换

2.使用Scala实战RDD与DataFrame转换

Spark二次排序学习总结

1. 二次排序

2.Java代码实现如下：

3.Scala代码实现：

Spark自定义累加器的使用

1.为什么要使用自定义累加器

Scala版本：

使用案例机器学习