本书的参考资源包括:参考书籍、参考论文、Hadoop jira和网络资源四部分,具体如下:

【参考书籍】

  • [1] Tom White.周敏奇,王晓玲,金澈清,钱卫宁译. Hadoop权威指南.2版. 北京:清华大学出版社,2011.
  • [2] Chuck Lam.韩冀中译. Hadoop实战.北京:人民邮电出版社,2011.
  • [3] Eric Sammer.Hadoop Operations.O'Reilly Media,2012.
  • [4] 孙玉琴. Java网络编程精解.北京电子工业出版社 ,2007.
  • [5] Ron Hitchens. Java NIO. O'Reilly Media,2002.
  • [6] George Coulouris,Jean Dollimore,Tim Kindberg.金蓓弘等译.分布式系统概念与设计.北京:机械工业出版社,2004.
  • [7] Erich Gamma, Richard Helm, Ralph Johnson,John Vlissides,李英军等译.设计模式:可复用面向对象软件的基础.北京:机械工业出版社.2000.
  • [8] Eric Freeman,Elisabeth Freeman, Kathy Sterra,Bert Bates.O'Reilly公司. Head First 设计模式》,北京:中国电力出版社,2007.

【参考论文】

  • [1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” in Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6. Berkeley, CA, USA: USENIX Association, 2004, pp. 107–113.
  • [2] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29-43, Lake George, New York, 2003.
  • [3] Jorge-Arnulfo Quiané-Ruiz, Christoph Pinkel, J?rg Schad, Jens Dittrich. RAFTing MapReduce: Fast recovery on the RAFT. In Serge Abiteboul, Klemens Böhm, Christoph Koch, Kian-Lee Tan, editors, Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany.
  • [4] Matei Zaharia, Andrew Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica, Improving MapReduce Performance in Heterogeneous Environments, 8th USENIX Symposium on Operating Systems Design Implementation, pp. 29-42, San Diego, CA, December, 2008.
  • [5] Quan Chen, Daqiang Zhang, Minyi Guo, Qianni Deng, Song Guo, "SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment," Computer and Information Technology (CIT), 2010 IEEE 10th International Conference.
  • [6] 梁李印,“阿里Hadoop集群架构及服务体系”, PPT,Hadoop与大数据技术大会(HBTC 2012).
  • [7] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In USENIX NSDI, 2011.
  • [8] Hong Mao, Shengqiu Hu, Zhenzhong Zhang, Limin Xiao, Li Ruan: A Load-Driven Task Scheduler with Adaptive DSC for MapReduce. GreenCom 2011: 28-33.
  • [9] Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, Dhiraj Sehgal. Hadoop Acceleration through Network Levitated Merging. SC11. Seattle, WA.
  • [10] Herodotos Herodotou. Hadoop Performance Models, Technical Report, CS-2011-05,Computer Science Department Duke University.
  • [11] 连林江:“百度分布式计算技术发展”,2012.07.08.
  • [12] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Job scheduling for multi-user mapreduce clusters,” EECS Department, University of California, Berkeley, Tech. Rep., Apr 2009.
  • [13] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Efficient Fair Scheduling for MapReduce”, PPT.
  • [14] Todd Lipcon, Cloudera, “Optimiziong MapReduce Job Performance ”, Hadoop Summit 2012.
  • [15] M. Zaharia,D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling”in Proc. of EuroSys. ACM, 2010, pp. 265-278.
  • [16] Thomas Sandholm and Kevin Lai. Dynamic proportional share scheduling in hadoop. In JSSPP ’10: 15th Workshop on Job Scheduling Strategies for Parallel Processing, 2010.
  • [17] J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguade and, M. Steinder, and I. Whalley, “Performance-driven task co-scheduling for mapreduce environments,” in Network Operations and Management Symposium (NOMS), 2010 IEEE, 2010, pp. 373 –380.
  • [18] Faraz Ahmad,Seyong Lee,Mithuna Thottethodi and T. N. Vijaykumar, “MapReduce with Communication Overlap(MaRCO)”, ECE Technical Reports, 2007.11.01.
  • [19] Owen O’Malley , “Plugging the Holes:Security and Compatibility”, PPT.
  • [20] Kerberos认证协议的教学设计,计算机系统与网络安全设计课题组,电子科技大学科学与工程学院.
  • [21] Owen O’Malley, Kan Zhang, Sanjay Radia,Ram Marti, and Christopher Harrell, “Hadoop Security Design”, Yahoo!
  • [22] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, NSDI 2011, March 2011.
  • [23] Dominant Resource Fairness: Fair Allocation of Multiple Resources Types. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, NSDI 2011, March 2011.
  • [24] “yarn(hadoop2)框架的一些软件设计模式”,CSDN.
  • [25] AMD white paper:“Hadoop Performance Tuning Guide”.

【参考网络资源】

  • [1] Apache log4j网址:http://logging.apache.org/log4j/index.html.
  • [2] Nutch官方网站:http://nutch.apache.org/.
  • [3] Lucene 官方网站:http://lucene.apache.org/.
  • [4] HDFS RAID介绍:http://wiki.apache.org/hadoop/HDFS-RAID.
  • [5] An update on Apache Hadoop 1.0:
  • http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/.

  • [6] Fault inject框架介绍: http://hadoop.apache.org/docs/hdfs/r0.21.0/faultinject_framework.html.
  • [7] Spark官方主页:http://www.spark-project.org/.
  • [8] Oozie官方主页:http://incubator.apache.org/oozie/.
  • [9] 排序基准:http://sortbenchmark.org/.
  • [10] HBase官方主页:http://hbase.apache.org/.
  • [11] Hive官方主页:http://hive.apache.org/.
  • [12] Pig官方主页:http://pig.apache.org/.
  • [13] Cascading官方主页:http://www.cascading.org/.
  • [14] Azkaban官方主页:http://sna-projects.com/azkaban/.
  • [15] Using Hadoop IPC/RPC for distributed applications : http://www.supermind.org/blog/520.
  • [16] Architecture of a Highly Scalable NIO-Based Server : http://today.java.net/pub/a/today/2007/02/13/architecture-of-highly-scalable-nio-server.html.
  • [17] New I/O APIs:http://docs.oracle.com/javase/1.4.2/docs/guide/nio/.
  • [18] Thrift官方主页:http://thrift.apache.org/.
  • [19] Protocal Buffer官方主页:http://code.google.com/p/protobuf/.
  • [20] Avro官方主页:http://avro.apache.org/.
  • [21] “在Hadoop上调试HadoopStreaming程序的方法详解”, 道凡.
  • [22] Hanborq optimized Hadoop Distribution:https://github.com/hanborq/hadoop.
  • [23] MapReduce:详解Shuffle过程:http://langyu.iteye.com/blog/992916.
  • [24] 快速排序及优化:http://rdc.taobao.com/team/jm/archives/252 .
  • [25] Hadoop源代码分析:http://caibinbupt.iteye.com/ .
  • [26] nativetask代码及文档:https://github.com/decster/nativetask.
  • [27] HOD 说明文档: http://hadoop.apache.org/docs/stable/hod_scheduler.html.
  • [28] Torque 官方网站:http://www.adaptivecomputing.com/products/open-source/torque/.
  • [29] Capacity Scheduler 说明文档: http://hadoop.apache.org/docs/stable/capacity_scheduler.html.
  • [30] Fair Scheduler 说明文档:http://hadoop.apache.org/docs/stable/fair_scheduler.html.
  • [31] Max-Min Fairness (Wikipedia):http://en.wikipedia.org/wiki/Max-min fairness.
  • [32] Kerberos Wiki介绍:http://jianlee.ylinux.org/Computer/Wiki/kerberos.html.
  • [33] Cloudera CDH3文档:https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide.
  • [34] YARN与Mesos比较:http://www.quora.com/How-does-YARN-compare-to-Mesos.
  • [35] Hortonworks官方博客:http://hortonworks.com/blog/.
  • [36] Cloudera官方博客:http://blog.cloudera.com/blog/.
  • [37] Facebook Hadoop代码:https://github.com/facebook/hadoop-20.
  • [38] Mesos官方网站:http://www.mesosproject.org/.
  • [39] http://www.oberhumer.com/opensource/lzo/.
  • [40] http://code.google.com/p/snappy/.
  • [41] https://github.com/toddlipcon/hadoop-lzo.

【参考Hadoop Jira 】

  • [1] HDFS-1052:HDFS scalability with multiple namenodes.
  • [2] HDFS-1623:High Availability Framework for HDFS NN.HDFS-200:In HDFS, sync() not yet guarantees data available to the new readers.
  • [3] HDFS-265:Revisit append.
  • [4] HDFS-503:Implement erasure coding as a layer on HDFS.
  • [5] HDFS-245:Create symbolic links in HDFS.
  • [6] HADOOP-4487:Security features for Hadoop.
  • [7] HADOOP-6332:Large-scale Automated Test Framework.
  • [8] HADOOP-1230:Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes.
  • [9] MAPREDUCE-334:Change mapred.lib code to use new api.
  • [10] HADOOP-1722:Make streaming to handle non-utf8 byte array.
  • [11] HADOOP-7775:RPC Layer improvements to support protocol compatibility.
  • [12] HADOOP-7347:IPC Wire Compatibility.
  • [13] HADOOP-4797:RPC Server can leave a lot of direct buffers.
  • [14] HDFS-2676:Remove Avro RPC.
  • [15] HDFS-2058:DataTransfer Protocol using protobufs.
  • [16] MAPREDUCE-1099: Setup and cleanup tasks could affect job latency if they are caught running on bad nodes.
  • [17] MAPREDUCE-463:The job setup and cleanup tasks should be optional.
  • [18] MAPREDUCE-744:Support in DistributedCache to share cache files with other users after HADOOP-4493.
  • [19] HADOOP-153:skip records that fail Task.
  • [20] HADOOP-2141:speculative execution start up condition based on completion time.
  • [21] MAPREDUCE-2657:TaskTracker should handle disk failures.
  • [22] MAPREDUCE-1906:Lower minimum heartbeat interval for tasktracker > Jobtracker.
  • [23] HADOOP-3245:Provide ability to persist running jobs (extend HADOOP-1876).
  • [24] MAPREDUCE-873:Simplify Job Recovery.
  • [25] MAPREDUCE-211:Provide a node health check script and run it periodically to check the node health status.
  • [26] HADOOP-4305:repeatedly blacklisted tasktrackers should get declared dead.
  • [27] HADOOP-5643:Ability to blacklist tasktracker.
  • [28] MAPREDUCE-2657:TaskTracker should handle disk failures.
  • [29] MAPREDUCE-2415:Distribute TaskTracker userlogs onto multiple disks.
  • [30] HADOOP-692:Rack-aware Replica Placement.
  • [31] MAPREDUCE-2415:Distribute TaskTracker userlogs onto multiple disks.
  • [32] MAPREDUCE-2364:Shouldn't hold lock on rjob while localizing resources.
  • [33] HADOOP-5883:TaskMemoryMonitorThread might shoot down tasks even if their processes momentarily exceed the requested memory.
  • [34] MAPREDUCE-1221:Kill tasks on a node if the free physical memory on that machine falls below a configured threshold.
  • [35] MAPREDUCE-211:Provide a node health check script and run it periodically to check the node health status.
  • [36] MAPREDUCE-4039: Sort Avoidance.
  • [37] MAPREDUCE-4049:plugin for generic shuffle service.
  • [38] HADOOP-331:map outputs should be written to a single output file with an index.
  • [39] MAPREDUCE-240:Improve the shuffle phase by using the "connection: keep-alive" and doing batch transfers of files.
  • [40] MAPREDUCE-2841:Task level native optimization.
  • [41] MAPREDUCE-64:Map-side sort is hampered by io.sort.record.percent.
  • [42] HADOOP-1965:Handle map output buffers better.
  • [43] MAPREDUCE-1380:Adaptive Scheduler.
  • [44] MAPREDUCE-1439:Learning Scheduler.
  • [45] MAPREDUCE-4360:Capacity Scheduler Hierarchical leaf queue does not honor the max capacity of container queue.
  • [46] MAPREDUCE-2905:CapBasedLoadManager incorrectly allows assignment when assignMultiple is true (was: assignmultiple per job).
  • [47] HADOOP-4487:Security features for Hadoop.
  • [48] MAPREDUCE-2405:MR-279: Implement uber-AppMaster (in-cluster LocalJobRunner for MRv2).
  • [49] YARN-3:Add support for CPU isolation/monitoring of containers.
  • [50] YARN-2:Enhance CS to schedule accounting for both memory and cpu cores.
  • [51] YARN-137:Change the default scheduler to the CapacityScheduler.
  • [52] MAPREDUCE-211:Provide a node health check script and run it periodically to check the node health status.
  • [53] MAPREDUCE-1906:Lower default minimum heartbeat interval for tasktracker > Jobtracker.
  • [54] MAPREDUCE-2355:Add an out of band heartbeat damper.
  • [55] HADOOP-3136:Assign multiple tasks per TaskTracker heartbeat.
  • [56] HADOOP-7206:Integrate Snappy compression.
  • [57] HADOOP-7714:Umbrella for usage of native calls to manage OS cache and readahead.