Hulb's Homepage

Welcome to My Website

This is the place where I share my ideas.You can also visit me on @CSDN or @oschina.

Give Chapter and Verse and Author for Transshipment, copy or other using. Thanks!

最近经历的一些大数据面试题

2016-12-10
公司A：
- 1.讲讲你做的过的项目。项目里有哪些难点重点注意点呢？
- 2.讲讲多线程吧，要是你，你怎么实现一个线程池呢？
- 3.讲一下Mapreduce或者hdfs的原理和机制。map读取数据分片。
- 4.shuffle 是什么？怎么调优？
- 5.项目用什么语言写？ Scala？ Scala的特点？和Java的区别？
- 6.理论基础怎么样，比如数据结构，里面的快速排序，或者，树？讲一讲你了解的树的知识？
- 7.数学怎么样呢？
- 8.讲一下数据库，SQl ，左外连接，原理，实现？
- 9.还了解过数据的什么知识？数据库引擎？
- 10.Hadoop的机架怎么配置的？
- 11.Hbase的设计有什么心得？
- 12.Hbase的操作是用的什么API还是什么工具？
- 13.对调度怎么理解.? 用什么工具吗？
- 14.用kettle 这种工具还是自己写程序？你们公司是怎么做的？
- 15.你们数据中心开发周期是多长？
- 16.你们hbase里面是存一些什么数据。
二面。三个人。
- 1.讲讲你做的项目。
- 2.平时对多线程这方面是怎么处理呢？异步是怎么思考呢？遇到的一些锁啊，是怎么做的呢？比如两个人同时操作一样东西。怎么做的呢？一些并发操作设计到一些变量怎么做的呢？
- 3.你们用的最多是 http协议吧？有没有特殊的头呢？讲讲你对tcp/ip的理解？
- 4.有没有用过Zookeeper呢？ Zookeeper的适用场景是什么？ HA 状态维护分布式锁全局配置文件管理操作Zookeeper是用的什么？
Spark方面：
- 5.spark开发分两个方面？哪两个方面呢？
- 6.比如一个读取hdfs上的文件，然后count有多少行的操作，你可以说说过程吗。那这个count是在内存中，还是磁盘中计算的呢？磁盘中。
- 7.spark和Mapreduce快？为什么快呢？快在哪里呢？ 1.内存迭代。2.RDD设计。 3,算子的设计。
- 8.spark sql又为什么比hive快呢？
- 10.RDD的数据结构是怎么样的？ Partition数组。 dependence
- 11.hadoop的生态呢。说说你的认识。 hdfs底层存储 hbase 数据库 hive数据仓库 Zookeeper分布式锁 spark大数据分析
公司B：
- 1.Spark工作的一个流程。
```
提交任务。 
QQ图片20161019131411.png
用户提交一个任务。 入口是从sc开始的。 sc会去创建一个taskScheduler。根据不同的提交模式， 会根据相应的taskchedulerImpl进行任务调度。
同时会去创建Scheduler和DAGScheduler。DAGScheduler 会根据RDD的宽依赖或者窄依赖，进行阶段的划分。划分好后放入taskset中，交给taskscheduler 。
appclient会到master上注册。首先会去判断数据本地化，尽量选最好的本地化模式去执行。
打散 Executor选择相应的Executor去执行。ExecutorRunner会去创建CoarseGrainerExecutorBackend进程。 通过线程池的方式去执行任务。

反向：
Executor向 SchedulerBackend反向注册

Spark On Yarn模式下。 driver负责计算调度。appmaster 负责资源的申请。
```
- 2.Hbase的PUT的一个过程。
- 3.RDD算子里操作一个外部map比如往里面put数据。然后算子外再遍历map。有什么问题吗。
- 4.shuffle的过程。调优。
- 5.5个partition里面分布有12345678910.用算子求最大值或者和。不能用广播变量和累加器。或者sortbykey.
- 6.大表和小表join.
- 7.知道spark怎么读hbase吗？spark on hbase.。华为的。
- 8.做过hbase的二级索引吗？
- 9.sort shuffle的优点？
- 10.stage怎么划分的？宽依赖窄依赖是什么？
公司W：
- 1.讲讲你做过的项目(一个整体思路)
- 2.问问大概情况。公司里集群规模。hbase数据量。数据规模。
- 3.然后挑选数据工厂开始详细问。问hbase.。加闲聊。
- 4.问二次排序是什么。topn是什么。二次排序要继承什么接口？
- 5.计算的数据怎么来的。
- 6.kakfadirect是什么，。为什么要用这个，有什么优点？。和其他的有什么区别。
```
http://blog.csdn.net/erfucun/article/details/52275369

  /**
   * Create an input stream that directly pulls messages from Kafka Brokers
   * without using any receiver. This stream can guarantee that each message
   * from Kafka is included in transformations exactly once (see points below).
   *
   * Points to note:
   *  - No receivers: This stream does not use any receiver. It directly queries Kafka
   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked
   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on
   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.
   *    You can access the offsets used in each batch from the generated RDDs (see
   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).
   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing
   *    in the [[StreamingContext]]. The information on consumed offset can be
   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).
   *  - End-to-end semantics: This stream ensures that every records is effectively received and
   *    transformed exactly once, but gives no guarantees on whether the transformed data are
   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure
   *    that the output operation is idempotent, or use transactions to output records atomically.
   *    See the programming guide for more details.
   *
   * @param ssc StreamingContext object
   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">
   *    configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"
   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in
   *    host1:port1,host2:port2 form.
   * @param fromOffsets Per-topic/partition Kafka offsets defining the (inclusive)
   *    starting point of the stream
   * @param messageHandler Function for translating each message and metadata into the desired type
   */
```
- 7.问了shuffle过程。
- 8.怎么调优的，jvm怎么调优的？
- 9.jvm结构？堆里面几个区？
- 10.数据清洗怎么做的？
- 11.怎么用spark做数据清洗
- 12.跟我聊了spark的应用，商场里广告投放，以及黄牛检测
- 13.spark读取数据，是几个Partition呢？ hdfs几个block 就有几个 Partition？
- 14.spark on yarn的两种模式? client 模式？和cluster模式？
- 15.jdbc？mysql的驱动包名字叫什么？
- 16.region多大会分区？
公司Q
- 1.说说Mapreduce？一整个过程的理解。讲一下。
- 2.hbase存数据用什么rowkey？加时间戳的话，会不会出现时间戳重复的问题，怎么做的呢？
- 3.Spring的两大模块？ AOP，IOC在你们项目中分别是怎么用的呢？
- 4.你们集群的规模，数据量？
Read All
Spark性能调优之——在实际项目中广播大变量

2016-12-10
Spark性能调优之——在实际项目中广播大变量

[数澜 Spark]性能调优系列，返回目录请猛戳这里

「数澜·大数据」技术团队荣誉出品

本文目录：

一、为什么要用广播变量？

[TOC]

一、为什么要用广播变量

1.一个Spark Application

Driver进程

其实就是我们写的Spark作业，打成jar运行起来的主进程。

比如一个1M的map（随机抽取的map），创建1000个副本，网络传输！分到1000个机器上，则占用了1G内存。

不必要的网络消耗，和内存消耗。

2.会出现的恶劣情况：

如果你是从哪个表里面读取了一些维度数据，比方说，所有商品的品类的信息，在某个算子函数中使用到100M。

1000个task 。100G的数据，要进行网络传输，集群瞬间性能下降。

3.解决方案：

如果说，task使用大变量（1M-100M），明知道会导致大量消耗。该怎么做呢？

使用广播变量:
- 1.广播变量里面会在Driver有一份初始副本。一个executor 会对应一份blockManager！
- 2.task在运行的时候，想要使用广播变量中的数据，此时会首先在本地的Executor对应的BlockManager上获取，如果没有。则： blockManager会Driver上拉取map（也有可能从距离比较近的其他节点的Executor的BlockManager上获取！这样效率更高）
Read All
Spark性能调优之——在实际项目中使用kryo序列化

2016-09-22

一、Java的序列化机制

ObjectOutputStream/ObjectInputStream 对象输入输入流机制，来进行序列化。

这种默认序列化机制，的好处在于，处理方便，不需要手动做什么事，只要在算子里面使用的变量，实现Serializable接口的，可序列化即可。

但是缺点在于，默认的序列化机制的效率不高，序列化的速度比较慢，序列化以后的数据，占用的内存空间相对还是比较大。

可以手动序列化格式的优化。

Spark支持Kryo序列化机制。Kryo序列化机制，比默认的Java序列化机制，速度要快，序列化的数据要更小。大概是Java的1/10.

所以减少传输数据，减少内存消耗。

Read All

Hadoop maven项目报错：missing artifact jdk.tools Jdk.tools Jar 1.6

2016-09-22

错误：

pom.xml报错：Missing artifact jdk.tools:jdk.tools:jar:1.6

解决：

添加：

 <dependency>
        <groupId>jdk.tools</groupId>
        <artifactId>jdk.tools</artifactId>
        <version>1.6</version>
        <scope>system</scope>
        <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
    </dependency>

Read All

Hdfs文件操作filesystem使用api报错：copytolocalfile nullpointerexception

2016-09-22

出错：

Exception in thread "main" java.lang.NullPointerException
	at java.lang.ProcessBuilder.start(Unknown Source)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:774)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:646)
	at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:472)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:460)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:365)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1968)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1937)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1913)
	at com.beifeng.TestCopy.myCopyToLocal(TestCopy.java:44)
	at com.beifeng.TestCopy.main(TestCopy.java:15)

解决：

//fs.copyToLocalFile(new Path("/hadoop/put/111.txt"), new Path("e:/txt/copyFormHDFS.txt"));
		fs.copyToLocalFile(false, new Path("/hadoop/put/111.txt"), new Path("e:/txt/copyFormHDFS.txt"),true);

Read All

5/12

Welcome to My Website

最近经历的一些大数据面试题

Spark性能调优之——在实际项目中广播大变量

Spark性能调优之——在实际项目中广播大变量

一、为什么要用广播变量

1.一个Spark Application

2.会出现的恶劣情况：

3.解决方案：

Spark性能调优之——在实际项目中使用kryo序列化

一、Java的序列化机制

Hadoop maven项目报错：missing artifact jdk.tools Jdk.tools Jar 1.6

错误：

解决：

Hdfs文件操作filesystem使用api报错：copytolocalfile nullpointerexception

出错：

解决：