MapReduce的核心资料索引 [转]

转自http://prinx.blog.163.com/blog/static/190115275201211128513868/和http://www.cnblogs.com/jie465831735/archive/2013/03/06.html

按如下顺序看效果最佳:

1. MapReduce Simplied Data Processing on Large Clusters

2. Hadoop环境的安装 By 徐伟

3. Parallel K-Means Clustering Based on MapReduce

4. 《Hadoop权威指南》的第一章和第二章

5. 迭代式MapReduce框架介绍董的博客

6. HaLoop: Efficient Iterative Data Processing on Large Clusters

7. Twister: A Runtime for Iterative MapReduce

8. 迭代式MapReduce解决方案（一）

9. 迭代式MapReduce解决方案（二）

10. 迭代式MapReduce解决方案（三）

11. Granules: A Lightweight, Streaming Runtime for Cloud Computing With Support for Map-Reduce

12. On the Performance of Distributed Data Clustering Algorithms in File and Streaming Processing Systems

13. Spark: Cluster Computing with Working Set

14. iMapReduce: A Distributed Computing Framework for Iterative Computation

15. 《Hadoop权威指南》的第三章到第十章

16. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

17. Clustering Very Large Multi-dimensional Datasets with MapReduce

18. HBase环境的安装 By 徐伟 + HBase 测试程序

Ps：简单讲解一下上面的流程，MapReduce计算模型就是Google在(1)中提出来的，一定要仔细看这篇论文，我当初因为看的不够仔细走了很多的弯路。Hadoop是一个开源的MapReduce计算模型实现，按照(2)来安装，以及跑一遍Word Count程序，基本上就算是入门了。(3)这篇文章价值不大，但是可以通过其看一下K-Means算法是如何MapReduce化的，以后就可以举一反三了。(4)的作用就是加深对(1-3)的理解。从(5)开始就可以进入迭代MapReduce的子领域了，董是这方面的大牛。(6)(7)是(5)中提到的两篇论文，(5-7)都要仔细的看，把迭代MapReduce的基础打牢。(8-10)也是董的文章，加深一下对迭代MapReduce问题的理解。(11)(12)是Jaliya Ekanayake、Shrideep Pallickara合作的文章，他们是国外迭代MapReduce领域的发文章最多的两个人。(13)是伯克利大学的迭代MapReduce的文章，Spark是所有实验室产品中唯一已经商用推广的，赞！(14)这篇文章，我看的不是很细致，但是Collector的灵感就是来源于这篇文章。这个时候估计你已经有自己的解决方案了，要编程实现自己的设计了，需要仔细的看(15)了。(16) Map-Reduce-Merge咱们实验室曾经做过的一个问题。(17)这篇文章+Canopy算法，可以得出一些关于用MapReduce实现高质量数据抽样的思路。(18)如果需要使用HBase，可以参考这篇文章。

posted @ 2013-03-06 21:36 南宫星海阅读(25) 评论(0) 编辑

研究云计算与大数据分析处理领域建议看的学术论文列表

转自http://cloud.dlmu.edu.cn/cloudsite/index.php?action-viewnews-itemid-123-php-1

[1] Zhou AY. Data intensive computing-challenges of data management techniques. Communications of CCF, 2009,5(7):50.53 (in Chinese with English abstract).
[2] Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C. MAD skills: New analysis practices for big data. PVLDB, 2009,2(2): 1481.1492.
[3] Schroeder B, Gibson GA. Understanding failures in petascale computers. Journal of Physics: Conf. Series, 2007,78(1):1.11.
[4] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Brewer E, Chen P, eds. Proc. of the OSDI. California: USENIX Association, 2004. 137.150.
[5] Pavlo A, Paulson E, Rasin A, Abadi DJ, Dewitt DJ, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Cetintemel U, Zdonik SB, Kossmann D, Tatbul N, eds. Proc. of the SIGMOD. Rhode Island: ACM Press, 2009. 165.178.
[6] Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng AY, Olukotun K. Map-Reduce for machine learning on multicore. In: Scholkopf B, Platt JC, Hoffman T, eds. Proc. of the NIPS. Vancouver: MIT Press, 2006. 281.288.
[7] Wang CK, Wang JM, Lin XM, Wang W, Wang HX, Li HS, Tian WP, Xu J, Li R. MapDupReducer: Detecting near duplicates over massive datasets. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 1119.1122.
[8] Liu C, Guo F, Faloutsos C. BBM: Bayesian browsing model from petabyte-scale data. In: Elder JF IV, Fogelman-Soulié F, Flach PA, Zaki MJ, eds. Proc. of the KDD. Paris: ACM Press, 2009. 537.546.
[9] Panda B, Herbach JS, Basu S, Bayardo RJ. PLANET: Massively parallel learning of tree ensembles with MapReduce. PVLDB, 2009,2(2):1426.1437.
[10] Lin J, Schatz M. Design patterns for efficient graph algorithms in MapReduce. In: Rao B, Krishnapuram B, Tomkins A, Yang Q, eds. Proc. of the KDD. Washington: ACM Press, 2010. 78.85.
[11] Zhang CJ, Ma Q, Wang XL, Zhou AY. Distributed SLCA-based XML keyword search by Map-Reduce. In: Yoshikawa M, Meng XF, Yumoto T, Ma Q, Sun LF, Watanabe C, eds. Proc. of the DASFAA. Tsukuba: Springer-Verlag, 2010. 386.397.
[12] Stupar A, Michel S, Schenkel R. RankReduce—Processing K-nearest neighbor queries on top of MapReduce. In: Crestani F, Marchand-Maillet S, Chen HH, Efthimiadis EN, Savoy J, eds. Proc. of the SIGIR. Geneva: ACM Press, 2010. 13.18.
[13] Wang GZ, Salles MV, Sowell B, Wang X, Cao T, Demers A, Gehrke J, White W. Behavioral simulations in MapReduce. PVLDB, 2010,3(1-2):952.963.
[14] Gunarathne T, Wu TL, Qiu J, Fox G. Cloud computing paradigms for pleasingly parallel biomedical applications. In: Hariri S, Keahey K, eds. Proc. of the HPDC. Chicago: ACM Press, 2010. 460−469.
[15] Delmerico JA, Byrnesy NA, Brunoz AE, Jonesz MD, Galloz SM, Chaudhary V. Comparing the performance of clusters, hadoop, and active disks on microarray correlation computations. In: Yang YY, Parashar M, Muralidhar R, Prasanna VK, eds. Proc. of the HiPC. Kochi: IEEE Press, 2009. 378−387.
[16] Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J. Ricardo: Integrating R and hadoop. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 987−998.
[17] Wegener D, Mock M, Adranale D, Wrobel S. Toolkit-Based high-performance data mining of large data on MapReduce clusters. In: Saygin Y, Yu JX, Kargupta H, Wang W, Ranka S, Yu PS, Wu XD, eds. Proc. of the ICDM Workshop. Washington: IEEE Computer Society, 2009. 296−301.
[18] Kovoor G, Singer J, Lujan M. Building a Java Map-Reduce framework for multi-core architectures. In: Ayguade E, Gioiosa R, Stenstrom P, Unsal O, eds. Proc. of the HiPEAC. Pisa: HiPEAC Endowment, 2010. 87−98.
[19] De Kruijf M, Sankaralingam K. MapReduce for the cell broadband engine architecture. IBM Journal of Research and Development, 2009,53(5):1−12.
[20] Becerra Y, Beltran V, Carrera D, Gonzalez M, Torres J, Ayguade E. Speeding up distributed MapReduce applications using hardware accelerators. In: Barolli L, Feng WC, eds. Proc. of the ICPP. Vienna: IEEE Computer Society, 2009. 42−49.
[21] Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In: Dally WJ, ed. Proc. of the HPCA. Phoenix: IEEE Computer Society, 2007. 13−24.
[22] Ma WJ, Agrawal G. A translation system for enabling data mining applications on GPUs. In: Zhou P, ed. Proc. of the Supercomputing (SC). New York: ACM Press, 2009. 400−409.
[23] He BS, Fang WB, Govindaraju NK, Luo Q, Wang TY. Mars: A MapReduce framework on graphics processors. In: Moshovos A, Tarditi D, Olukotun K, eds. Proc. of the PACT. Ontario: ACM Press, 2008. 260−269.
[24] Stuart JA, Chen CK, Ma KL, Owens JD. Multi-GPU volume rendering using MapReduce. In: Hariri S, Keahey K, eds. Proc. of the MapReduce Workshop (HPDC 2010). New York: ACM Press, 2010. 841−848.
[25] Hong CT, Chen DH, Chen WG, Zheng WM, Lin HB. MapCG: Writing parallel program portable between CPU and GPU. In: Salapura V, Gschwind M, Knoop J, eds. Proc. of the PACT. Vienna: ACM Press, 2010. 217−226.
[26] Jiang W, Ravi VT, Agrawal G. A Map-Reduce system with an alternate API for multi-core environments. In: Chiba T, ed. Proc. of the CCGRID. Melbourne: IEEE Press, 2010. 84−93.
[27] Liao HJ, Han JZ, Fang JY. Multi-Dimensional index on hadoop distributed file system. In: Xu ZW, ed. Proc. of the Networking, Architecture, and Storage (NAS). Macau: IEEE Computer Society, 2010. 240−249.
[28] Zou YQ, Liu J, Wang SC, Zha L, Xu ZW. CCIndex: A complemental clustering index on distributed ordered tables for multi- dimensional range queries. In: Ding C, Shao ZY, Zheng R, eds. Proc. of the NPC. Zhengzhou: Springer-Verlag, 2010. 247−261.
[29] Zhang SB, Han JZ, Liu ZY, Wang K, Feng SZ. Accelerating MapReduce with distributed memory cache. In: Huang XX, ed. Proc. of the ICPADS. Shenzhen: IEEE Press, 2009. 472−478.
[30] Dittrich J, Quian′e-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 2010,3(1-2):518−529.
[31] Chen ST. Cheetah: A high performance, custom data warehouse on top of MapReduce. PVLDB, 2010,3(1-2):1459−1468.
[32] Iu MY, Zwaenepoel W. HadoopToSQL: A MapReduce query optimizer. In: Morin C, Muller G, eds. Proc. of the EuroSys. Paris: ACM Press, 2010. 251−264.
[33] Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian YY. A comparison of join algorithms for log processing in MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 975−986.
[34] Zhou MQ, Zhang R, Zeng DD, Qian WN, Zhou AY. Join optimization in the MapReduce environment for column-wise data store. In: Fang YF, Huang ZX, eds. Proc. of the SKG. Ningbo: EEE Computer Society, 2010. 97−104.
[35] Afrati FN, Ullman JD. Optimizing joins in a Map-Reduce environment. In: Manolescu I, Spaccapietra S, Teubner J, Kitsuregawa M, Léger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 99−110.
[36] Sandholm T, Lai K. MapReduce optimization using regulated dynamic prioritization. In: Douceur JR, Greenberg AG, Bonald T, Nieh J, eds. Proc. of the SIGMETRICS. Seattle: ACM Press, 2009. 299.310.
[37] Hoefler T, Lumsdaine A, Dongarra J. Towards efficient MapReduce using MPI. In: Oster P, ed. Proc. of the EuroPVM/MPI. Berlin: Springer-Verlag, 2009. 240.249.
[38] Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1-2):494.505.
[39] Kambatla K, Rapolu N, Jagannathan S, Grama A. Asynchronous algorithms in MapReduce. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 245.254.
[40] Polo J, Carrera D, Becerra Y, Torres J, Ayguade E, Steinder M, Whalley I. Performance-Driven task co-scheduling for MapReduce environments. In: Tonouchi T, Kim MS, eds. Proc. of the IEEE Network Operations and Management Symp. (NOMS). Osaka: IEEE Press, 2010. 373.380.
[41] Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Draves R, van Renesse R, eds. Proc. of the ODSI. Berkeley: USENIX Association, 2008. 29.42.
[42] Xie J, Yin S, Ruan XJ, Ding ZY, Tian Y, Majors J, Manzanares A, Qin X. Improving MapReduce performance through data placement in heterogeneous hadoop clusters. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshop on Heterogeneity in Computing (IPDPS 2010). Atlanta: IEEE Press, 2010. 1.9.
[43] Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguade E. Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 653.662.
[44] Papagiannis A, Nikolopoulos DS. Rearchitecting MapReduce for heterogeneous multicore processors with explicitly managed memories. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 121.130.
[45] Jiang DW, Ooi BC, Shi L, Wu S. The performance of MapReduce: An in-depth study. PVLDB, 2010,3(1-2):472.483.
[46] Berthold J, Dieterle M, Loogen R. Implementing parallel Google Map-Reduce in Eden. In: Sips HJ, Epema DHJ, Lin HX, eds. Proc. of the Euro-Par. Delft: Springer-Verlag, 2009. 990.1002.
[47] Verma A, Zea N, Cho B, Gupta I, Campbell RH. Breaking the MapReduce stage barrier. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 235.244.
[48] Yang HC, Dasdan A, Hsiao RL, Parker DS. Map-Reduce-Merge simplified relational data processing on large clusters. In: Chan CY, Ooi BC, Zhou AY, eds. Proc. of the SIGMOD. Beijing: ACM Press, 2007. 1029.1040.
[49] Seo SW, Jang I, Woo KC, Kim I, Kim JS, Maeng S. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: Rana O, Tang FL, Kosar T, eds. Proc. of the CLUSTER. New Orleans: IEEE Press, 2009. 1.8.
[50] Babu S. Towards automatic optimization of MapReduce programs. In: Kansal A, ed. Proc. of the ACM Symp. on Cloud Computing (SoCC). New York: ACM Press, 2010. 137.142.
[51] Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: A not-so-foreign language for data processing. In: Wang JTL, ed. Proc. of the SIGMOD. Vancouver: ACM Press, 2008. 1099.1110.
[52] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 2007,41(3):59.72.
[53] Isard M, Yu Y. Distributed data-parallel computing using a high-level programming language. In: Cetintemel U, Zdonik SB, Kossmann D, Tatbul N, eds. Proc. of the SIGMOD. Rhode Island: ACM Press, 2009. 987.994.
[54] Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou JR. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 2008,1(2):1265.1276.
[55] Condie T, Conway N, Alvaro P, Hellerstein JM, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 1115.1118.
[56] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive a warehousing solution over a MapReduce framework. PVLDB, 2009,2(2):938.941.
[57] Ghoting A, Pednault E. Hadoop-ML: An infrastructure for the rapid implementation of parallel reusable analytics. In: Culotta A, ed. Proc. of the Large-Scale Machine Learning: Parallelism and Massive Datasets Workshop (NIPS 2009). Vancouver: MIT Press, 2009. 6.
[58] Yang C, Yen C, Tan C, Madden S. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: Li FF, Moro MM, Ghandeharizadeh S, Haritsa JR, Weikum G, Carey MJ, Casati F, Chang EY, Manolescu I, Mehrotra S, Dayal U, Tsotras VJ, eds. Proc. of the ICDE. Long Beach: IEEE Press, 2010. 657.668.
[59] Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: An architectural hybrid of MapReduce and DBMS technologes for analytical workloads. PVLDB, 2009,2(1):922.933.
[60] Abouzied A, Bajda-Pawlikowski K, Huang JW, Abadi DJ, Silberschatz A. HadoopDB in action: Building real world applications. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 1111.1114.
[61] Friedman E, Pawlowski P, Cieslewicz J. SQL/MapReduce: A practical approach to self describing, polymorphic, and parallelizable user defined functions. PVLDB, 2009,2(2):1402.1413.
[62] Stonebraker M, Abadi D, DeWitt DJ, Maden S, Paulson E, Pavlo A, Rasin A. MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM, 2010,53(1):64.71.
[63] Dean J, Ghemawat S. MapReduce: A flexible data processing tool. Communications of ACM, 2010,53(1):72.77.
[64] Xu Y, Kostamaa P, Gao LK. Integrating hadoop and parallel DBMS. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 969.974.
[65] Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H. Data warehousing and analytics infrastructure at facebook. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 1013.1020.
[66] Mcnabb AW, Monson CK, Seppi KD. MRPSO: MapReduce particle swarm optimization. In: Ryan C, Keijzer M, eds. Proc. of the GECCO. Atlanta: ACM Press, 2007. 177.185.
[67] Kang U, Tsourakakis CE, Faloutsos C. PEGASUS: A peta-scale graph mining system—Implementation and observations. In: Wang W, Kargupta H, Ranka S, Yu PS, Wu XD, eds. Proc. of the ICDM. Miami: IEEE Computer Society, 2009. 229.238.
[68] Kang S, Bader DA. Large scale complex network analysis using the hybrid combination of a MapReduce cluster and a highly multithreaded system. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshops and Phd Forum (IPDPS 2010). Atlanta: IEEE Presss, 2010. 11.19.
[69] Logothetis D, Yocum K. AdHoc data processing in the cloud. PVLDB, 2008,1(1):1472.1475.
[70] Olston C, Bortnikov E, Elmeleegy K, Junqueira F, Reed B. Interactive analysis of WebScale data. In: DeWitt D, ed. Proc. of the CIDR. Asilomar: Online www.crdrdb.org, 2009.
[71] Bose JH, Andrzejak A, Hogqvist M. Beyond online aggregation: Parallel and incremental data mining with online Map-Reduce. In: Tanaka K, Zhou XF, Zhang M, Jatowt A, eds. Proc. of the Workshop on Massive Data Analytics on the Cloud (WWW 2010). Raleigh: ACM Press, 2010. 3.
[72] Kumar V, Andrade H, Gedik B, Wu KL. DEDUCE: At the intersection of MapReduce and stream processing. In: Manolescu I, Spaccapietra S, Teubner J, Kitsuregawa M, Léger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 657.662.
[73] Abramson D, Dinh MN, Kurniawan D, Moench B, DeRose L. Data centric highly parallel debugging. In: Hariri S, Keahey K, eds. Proc. of the HPDC. Chicago: ACM Press, 2010. 119.129.
[74] Morton K, Friesen A, Balazinska M, Grossman D. Estimating the progress of MapReduce pipelines. In: Li FF, Moro MM, Ghandeharizadeh S, et al., eds. Proc. of the ICDE. Long Beach: IEEE Press, 2010. 681.684.
[75] Morton K, Balazinska M, Grossman D. ParaTimer: A progress indicator for MapReduce DAGs. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 507.518.
[76] Lang W, Patel JM. Energy management for MapReduce clusters. PVLDB, 2010,3(1-2):129.139.
[77] Wieder A, Bhatotia P, Post A, Rodrigues R. Brief announcement: Modeling MapReduce for optimal execution in the cloud. In: Richa AW, Guerraoui R, eds. Proc. of the PODC. Zurich: ACM Press, 2010. 408.409.
[78] Zheng Q. Improving MapReduce fault tolerance in the cloud. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshops and Phd Forum (IPDPS 2010). Atlanta: IEEE Presss, 2010. 1.6.
[79] Groot S. Jumbo: Beyond MapReduce for workload balancing. In: Mylopoulos J, Zhou LZ, Zhou XF, eds. Proc. of the PhD Workshop (VLDB 2010). Singapore: VLDB Endowment, 2010. 7.12.
[80] Chatziantoniou D, Tzortzakakis E. ASSET queries: A declarative alternative to MapReduce. SIGMOD Record, 2009,38(2):35.41.
[81] Bu YY, Howe B, Balazinska M, Ernst MD. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 2010,3(1-2): 285−296.
[82] Wang HJ, Qin XP, Zhang YS, Wang S, Wang ZW. LinearDB: A relational approach to make data warehouse scale like MapReduce. In: Yu JX, Kim MH, Unland R, eds. Proc. of the DASFAA. Hong Kong: Springer-Verlag, 2011. 306−320

posted @ 2013-03-06 21:30 南宫星海阅读(17) 评论(0) 编辑

云计算核心论文 .

转自 http://blog.csdn.net/zhaomirong/article/details/7832215

Google
1. nosqldbs-NOSQL Introduction and Overview
2. system and method for data distribution(2009)
3. System and method for large-scale data processing using an application-independent framework(2010)
4. MapReduce: Simplified Data Processing on Large Clusters;
5. MapReduce-- a flexible data processing tool(2010)
6. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
7. MapReduce and Parallel DBMSs--Friends or Foes(2010)
8. Presentation:MapReduce and Parallel DBMSs:Together at Last (2010)
9. Twister: A Runtime for Iterative MapReduce(2010)
10. MapReduce Online(2009)
11. Megastore: Providing Scalable, Highly Available Storage for Interactive Services (2011,CIDR)
12. Interpreting the Data:Parallel Analysis with Sawzall
13. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (technical report 2010)
14. Large-scale Incremental Processing Using Distributed Transactions and Notifications(2010)
15. Improving MapReduce Performance in Heterogeneous Environments
16. Dremel: Interactive Analysis of WebScale Datasets(2011)
17. Large-scale Incremental Processing Using Distributed Transactions and Notifications
18. Chukwa: a scalable cloud monitoring System (presentation)
19. The Chubby lock service for loosely-coupled distributed systems
20. Paxos Made Simple(2001,Lamport)
21. Fast Paxos(2006)
22. Paxos Made Live - An Engineering Perspective(2007)
23. Classic Paxos vs. Fast Paxos: Caveat Emptor
24. On the Coordinator’s Rule for Fast Paxos(2005)
25. Paxos made code:Implementing a high throughput Atomic Broadcast (2009)
26. Bigtable: A Distributed Storage System for Structured Data(2006)
27. The Google File System

Google patent papers
1. Data processing system and method for financial debt instruments(1999)
2. Data processing system and method to enforce payment of royalties when copying softcopy books(1996)
3. Data processing systems and methods(2005)
4. Large-scale data processing in a distributed and parallel processing environment(2010)
5. METHODS AND SYSTEMS FOR MANAGEMENT OF DATA()
6. SEARCH OVER STRUCTURED DATA(2011)
7. System and method for maintaining replicated data coherency in a data processing system(1995)
8. System and method of using data mining prediction methodology(2006)
9. System and Methodology for Data Processing Combining Stream Processing and spreadsheet computation(2011)
10. Patent Factor index report of system and method of using data mining prediction methodology
11. Pregel: A System for Large-Scale Graph Processing(2010)

Hadoop
1. A simple totally ordered broadcast protocol
2. ZooKeeper: Wait-free coordination for Internet-scale systems
3. Zab: High-performance broadcast for primary-backup systems(2011)
4. wait-free syschronization(1991)
5. ON SELF-STABILIZING WAIT-FREE CLOCK SYNCHRONIZATION(1997)
6. Wait-free clock synchronization(ps format)
7. Programming with ZooKeeper - A basic tutorial
8. Hive – A Petabyte Scale Data Warehouse Using Hadoop
9. Thrift: Scalable Cross-Language Services Implementation(Facebook)
10. Hive other files: HiveMetaStore class picture, Chinese docs
11. Scaling out data preprocessing with Hive (2011)
12. HBase The Definitive Guide - 2011
13. Nova: Continuous Pig/Hadoop Workflows(yahoo,2011)
14. Pig Latin: A Not-So-Foreign Language for Data Processing(2008)
15. Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?(2009)
a. Some docs about HStreaming,Zebra
16. HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks
17. System Anomaly Detection in Distributed Systems through MapReduce-Based Log Analysis(2010)
18. Benchmarking Cloud Serving Systems with YCSB(2010)
19. Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework (2009)

SmallFile Combine in hadoop world
1. TidyFS: A Simple and Small Distributed File System(Microsoft)
2. Improving the storage efficiency of small files in cloud storage(chinese,2011)
3. Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications(2010)
4. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems(Facebook)
5. A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files(IBM,2010)

Job schedule
1. Job Scheduling for Multi-User MapReduce Clusters(Facebook)
2. MapReduce Scheduler Using Classifiers for Heterogeneous Workloads(2011)
3. Performance-Driven Task Co-Scheduling for MapReduce Environments
4. Towards a Resource Aware Scheduler in Hadoop(2009)
5. Delay Scheduling: A Simple Technique for Achieving
6. Locality and Fairness in Cluster Scheduling(yahoo,2010)
7. Dynamic Proportional Share Scheduling in Hadoop(HP)
8. Adaptive Task Scheduling for MultiJob MapReduce Environments(2010)
9. A Dynamic MapReduce Scheduler for Heterogeneous Workloads(2009)

HStreaming
1. HStreaming Cloud Documentation
2. S4: Distributed Stream Computing Platform(yahoo,2010)
3. Complex Event Processing(2009)
4. Hstreaming : http://www.hstreaming.com/resources/manuals/
5. StreamBase: http://streambase.com/developers-docs-pdfindex.htm
6. Twitter storm: http://www.infoq.com/cn/news/2011/09/twitter-storm-real-time-hadoop
7. Bulk Synchronous Parallel(BSP) computing
8. MPI

SQL/Mapreduce
1. Aster Data whilepaper:Deriving Deep Insights from Large Datasets with SQL-MapReduce (2004)
2. SQL/MapReduce: A practical approach to self-describing,polymorphic, and parallelizable user-defined functions(2009,aster)
3. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads(2009)
4. HadoopDB in Action: Building Real World Applications(2010)
5. Aster Data presentation: Making Advanced Analytics on Big Data Fast and Easy(2010)
6. A Scalable, Predictable Join Operator for
7. Highly Concurrent Data Warehouses(2009)
8. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce(2010)
9. Greenplum whilepaper:A Unified Engine for RDBMS and MapReduce(2004)
10. A Comparison of Approaches to Large-Scale Data Analysis(2009)
11. MAD Skills: New Analysis Practices for Big Data (2009)
12. C Store A Column oriented DBMS(2005)
13. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations(Microsoft)

Microsoft
1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (2007)

Amazon
1. Dynamo: Amazon’s Highly Available Key-value Store(2007)
2. Efficient Reconciliation and Flow Control for Anti-Entropy Protocols
3. The Eucalyptus Open-source Cloud-computing System
4. Eucalyptus: An Open-source Infrastructure for Cloud Computing(presentation)
5. Eucalyptus : A Technical Report on an Elastic Utility Computing Archietcture Linking Your Programs to Useful Systems (2008)
6. Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms(2011)
7. Database-Agnostic Transaction Support for Cloud Infrastructures
8. CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems(2011)
9. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures

Books
1. Distributed Systems Concepts and Design (5th Edition)
2. Principles of Computer Systems (7-11)
3. Distributed system(chapter)
4. Data-Intensive Text Processing with MapReduce (2010)
5. Hadoop in Action
6. 21 Recipes for Mining Twitter
7. Hadoop.The.Definitive.Guide.2nd.Edition
8. Pro hadoop

Other papers about Distributed system
1. Flexible Update Propagation for Weakly Consistent Replication(1997)
2. Providing High Availability Using Lazy Replication(1992)
3. Managing Update Conflicts in Bayou,a Weakly Connected Replicated Storage System(1995)
4. XMIDDLE: A Data-Sharing Middleware for Mobile Computing(2002)
5. design and implementation of sun network filesystem
6. Chord: A Scalable Peertopeer Lookup Service for Internet Applications(2001)
7. A Survey and Comparison of Peer-to-Peer Overlay Network Schemes(2004)
8. Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing(2001)

BI
1. 21 Recipes for Mining Twitter(Book)
2. Web Data Mining(Book)
3. Web Mining and Social Networking(Book)
4. mining the social web(book)
5. TEXTUAL BUSINESS INTELLIGENCE (Inmon)
6. Social Network Analysis and Mining for Business Applications(yahoo,2011)
7. Data Mining in Social Networks(2002)
8. Natural Language Processing with Python(book)
9. data_mining-10_methods(Chinese editation)
10. Mahout in Action(Book)
11. Text Mining Infrastructure in R(2008)
12. Text Mining Handbook(2010)

Web search engine
1. Building Efficient Multi-Threaded Search Nodes(Yahoo,2010)
2. The Anatomy of a Large-Scale Hypertextual Web Search Engine(google)

posted @ 2013-03-06 21:29 南宫星海阅读(17) 评论(0) 编辑

Hadoop（初步）

Hadoop

一个分布式系统基础架构，由Apache基金会开发。用户可以在不了解分布式底层细节的情况下，开发分布式程序。充分利用集群的威力高速运算和存储。Hadoop实现了一个分布式文件系统（Hadoop Distributed File System），简称HDFS。HDFS有着高容错性的特点，并且设计用来部署在低廉的（low-cost）硬件上。而且它提供高传输率（high throughput）来访问应用程序的数据，适合那些有着超大数据集（large data set）的应用程序。HDFS放宽了（relax）POSIX的要求（requirements）这样可以流的形式访问（streaming access）文件系统中的数据。

名字起源

Hadoop^[1]这个名字不是一个缩写，它是一个虚构的名字。该项目的创建者，Doug Cutting如此解释Hadoop的得名：“这个名字是我孩子给一个棕黄色的大象样子的填充玩具命名的。我的命名标准就是简短，容易发音和拼写，没有太多的意义，并且不会被用于别处。小孩子是这方面的高手。”[Hadoop: The Definitive Guide]

起源

Hadoop 由 Apache Software Foundation 公司于 2005 年秋天作为 Lucene的子

Hadoop logo

项目 Nutch的一部分正式引入。它受到最先由 Google Lab 开发的 Map/Reduce 和 Google File System(GFS) 的启发。2006 年 3 月份，Map/Reduce 和 Nutch Distributed File System (NDFS) 分别被纳入称为 Hadoop 的项目中。

Hadoop 是最受欢迎的在 Internet 上对搜索关键字进行内容分类的工具，但它也可以解决许多要求极大伸缩性的问题。例如，如果您要 grep 一个 10TB 的巨型文件，会出现什么情况？在传统的系统上，这将需要很长的时间。但是 Hadoop 在设计时就考虑到这些问题，采用并行执行机制，因此能大大提高效率。

诸多优点

Hadoop 是一个能够对大量数据进行分布式处理的软件框架。但是 Hadoop 是以一种可靠、高效、可伸缩的方式进行处理的。Hadoop 是可靠的，因为它假设计算元素和存储会失败，因此它维护多个工作数据副本，确保能够针对失败的节点重新分布处理。Hadoop 是高效的，因为它以并行的方式工作，通过并行处理加快处理速度。Hadoop 还是可伸缩的，能够处理 PB 级数据。此外，Hadoop 依赖于社区服务器，因此它的成本比较低，任何人都可以使用。

Hadoop是一个能够让用户轻松架构和使用的分布式计算平台。用户可以轻松地在Hadoop上开发和运行处理海量数据的应用程序。它主要有以下几个优点：

⒈高可靠性。Hadoop按位存储和处理数据的能力值得人们信赖。

⒉高扩展性。Hadoop是在可用的计算机集簇间分配数据并完成计算任务的，这些集簇可以方便地扩展到数以千计的节点中。

⒊高效性。Hadoop能够在节点之间动态地移动数据，并保证各个节点的动态平衡，因此处理速度非常快。

⒋高容错性。Hadoop能够自动保存数据的多个副本，并且能够自动将失败的任务重新分配。

Hadoop带有用 Java 语言编写的框架，因此运行在 Linux 生产平台上是非常理想的。Hadoop 上的应用程序也可以使用其他语言编写，比如 C++。

架构

Hadoop 有许多元素构成。其最底部是 Hadoop Distributed File System^[2]（HDFS），它存储 Hadoop 集群中所有存储节点上的文件。HDFS（对于本文）的上一层是 MapReduce 引擎，该引擎由 JobTrackers 和 TaskTrackers 组成。

HDFS

对外部客户机而言，HDFS 就像一个传统的分级文件系统。可以创建、删除、移动或重命名文件，等等。但是 HDFS 的架构是基于一组特定的节点构建的（参见图 1），这是由它自身的特点决定的。这些节点包括 NameNode（仅一个），它在 HDFS 内部提供元数据服务；DataNode，它为 HDFS 提供存储块。由于仅存在一个 NameNode，因此这是 HDFS 的一个缺点（单点失败）。

存储在 HDFS 中的文件被分成块，然后将这些块复制到多个计算机中（DataNode）。这与传统的 RAID 架构大不相同。块的大小（通常为 64MB）和复制的块数量在创建文件时由客户机决定。NameNode 可以控制所有文件操作。HDFS 内部的所有通信都基于标准的 TCP/IP 协议。

NameNode

NameNode 是一个通常在 HDFS 实例中的单独机器上运行的软件。它负责管理文件系统名称空间和控制外部客户机的访问。NameNode 决定是否将文件映射到 DataNode 上的复制块上。对于最常见的 3 个复制块，第一个复制块存储在同一机架的不同节点上，最后一个复制块存储在不同机架的某个节点上。注意，这里需要您了解集群架构。

实际的 I/O 事务并没有经过 NameNode，只有表示 DataNode 和块的文件映射的元数据经过 NameNode。当外部客户机发送请求要求创建文件时，NameNode 会以块标识和该块的第一个副本的 DataNode IP 地址作为响应。这个 NameNode 还会通知其他将要接收该块的副本的 DataNode。

NameNode 在一个称为 FsImage 的文件中存储所有关于文件系统名称空间的信息。这个文件和一个包含所有事务的记录文件（这里是 EditLog）将存储在 NameNode 的本地文件系统上。FsImage 和 EditLog 文件也需要复制副本，以防文件损坏或 NameNode 系统丢失。

DataNode

DataNode 也是一个通常在 HDFS 实例中的单独机器上运行的软件。Hadoop 集群包含一个 NameNode 和大量 DataNode。DataNode 通常以机架的形式组织，机架通过一个交换机将所有系统连接起来。Hadoop 的一个假设是：机架内部节点之间的传输速度快于机架间节点的传输速度。

DataNode 响应来自 HDFS 客户机的读写请求。它们还响应来自 NameNode 的创建、删除和复制块的命令。NameNode 依赖来自每个 DataNode 的定期心跳（heartbeat）消息。每条消息都包含一个块报告，NameNode 可以根据这个报告验证块映射和其他文件系统元数据。如果 DataNode 不能发送心跳消息，NameNode 将采取修复措施，重新复制在该节点上丢失的块。

文件操作

可见，HDFS 并不是一个万能的文件系统。它的主要目的是支持以流的形式访问写入的大型文件。如果客户机想将文件写到 HDFS 上，首先需要将该文件缓存到本地的临时存储。如果缓存的数据大于所需的 HDFS 块大小，创建文件的请求将发送给 NameNode。NameNode 将以 DataNode 标识和目标块响应客户机。同时也通知将要保存文件块副本的 DataNode。当客户机开始将临时文件发送给第一个 DataNode 时，将立即通过管道方式将块内容转发给副本 DataNode。客户机也负责创建保存在相同 HDFS 名称空间中的校验和（checksum）文件。在最后的文件块发送之后，NameNode 将文件创建提交到它的持久化元数据存储（在 EditLog 和 FsImage 文件）。

Linux 集群

Hadoop 框架可在单一的 Linux 平台上使用（开发和调试时），但是使用存放在机架上的商业服务器才能发挥它的力量。这些机架组成一个 Hadoop 集群。它通过集群拓扑知识决定如何在整个集群中分配作业和文件。Hadoop 假定节点可能失败，因此采用本机方法处理单个计算机甚至所有机架的失败。

集群系统

Google的数据中心使用廉价的Linux PC机组成集群，在上面运行各种应用。即使是分布式开发的新手也可以迅速使用Google的基础设施。核心组件是3个：

⒈GFS（Google File System）。一个分布式文件系统，隐藏下层负载均衡，冗余复制等细节，对上层程序提供一个统一的文件系统API接口。Google根据自己的需求对它进行了特别优化，包括：超大文件的访问，读操作比例远超过写操作，PC机极易发生故障造成节点失效等。GFS把文件分成64MB的块，分布在集群的机器上，使用Linux的文件系统存放。同时每块文件至少有3份以上的冗余。中心是一个Master节点，根据文件索引，找寻文件块。详见Google的工程师发布的GFS论文。

⒉MapReduce。Google发现大多数分布式运算可以抽象为MapReduce操作。Map是把输入Input分解成中间的Key/Value对，Reduce把Key/Value合成最终输出Output。这两个函数由程序员提供给系统，下层设施把Map和Reduce操作分布在集群上运行，并把结果存储在GFS上。

⒊BigTable。一个大型的分布式数据库，这个数据库不是关系式的数据库。像它的名字一样，就是一个巨大的表格，用来存储结构化的数据。

以上三个设施Google均有论文发表。

应用程序

Hadoop 的最常见用法之一是 Web 搜索。虽然它不是惟一的软件框架应用程序，但作为一个并行数据处理引擎，它的表现非常突出。Hadoop 最有趣的方面之一是 Map and Reduce 流程，它受到 Google开发的启发。这个流程称为创建索引，它将 Web 爬行器检索到的文本 Web 页面作为输入，并且将这些页面上的单词的频率报告作为结果。然后可以在整个 Web 搜索过程中使用这个结果从已定义的搜索参数中识别内容。

MapReduce

最简单的 MapReduce 应用程序至少包含 3 个部分：一个 Map 函数、一个 Reduce 函数和一个 main 函数。main 函数将作业控制和文件输入/输出结合起来。在这点上，Hadoop 提供了大量的接口和抽象类，从而为 Hadoop 应用程序开发人员提供许多工具，可用于调试和性能度量等。

MapReduce 本身就是用于并行处理大数据集的软件框架。MapReduce 的根源是函数性编程中的 map 和 reduce 函数。它由两个可能包含有许多实例（许多 Map 和 Reduce）的操作组成。Map 函数接受一组数据并将其转换为一个键/值对列表，输入域中的每个元素对应一个键/值对。Reduce 函数接受 Map 函数生成的列表，然后根据它们的键（为每个键生成一个键/值对）缩小键/值对列表。

这里提供一个示例，帮助您理解它。假设输入域是 one small step for man,one giant leap for mankind。在这个域上运行 Map 函数将得出以下的键/值对列表：

（one,1） (small,1） (step,1） (for,1） (man,1）

MapReduce 流程的概念流

(one,1） (giant,1） (leap,1） (for,1） (mankind,1）

如果对这个键/值对列表应用 Reduce 函数，将得到以下一组键/值对：

（one,2） (small,1） (step,1） (for,2） (man,1）（giant,1） (leap,1） (mankind,1）

结果是对输入域中的单词进行计数，这无疑对处理索引十分有用。但是，现在假设有两个输入域，第一个是 one small step for man，第二个是 one giant leap for mankind。您可以在每个域上执行 Map 函数和 Reduce 函数，然后将这两个键/值对列表应用到另一个 Reduce 函数，这时得到与前面一样的结果。换句话说，可以在输入域并行使用相同的操作，得到的结果是一样的，但速度更快。这便是 MapReduce 的威力；它的并行功能可在任意数量的系统上使用。图 2 以区段和迭代的形式演示这种思想。

现在回到 Hadoop 上，它是如何实现这个功能的？一个代表客户机在单个主系统上启动的 MapReduce 应用程序称为 JobTracker。类似于 NameNode，它是 Hadoop 集群中惟一负责控制 MapReduce 应用程序的系统。在应用程序提交之后，将提供包含在 HDFS 中的输入和输出目录。JobTracker 使用文件块信息（物理量和位置）确定如何创建其他 TaskTracker 从属任务。MapReduce 应用程序被复制到每个出现输入文件块的节点。将为特定节点上的每个文件块创建一个惟一的从属任务。每个 TaskTracker 将状态和完成信息报告给 JobTracker。图 3 显示一个示例集群中的工作分布。

Hadoop 的这个特点非常重要，因为它并没有将存储移动到某个位置以供处理，而是将处理移动到存储。这通过根据集群中的节点数调节处理，因此支持高效的数据处理。

Hadoop系统安装于配置

海量数据处理平台架构介绍

Hadoop能解决哪些问题

Hadoop在国内的情景

Hadoop简介

Hadoop生态系统介绍

HDFS简介

HDFS设计原则

HDFS系统结构

HDFS文件权限

HDFS文件读取

HDFS文件写入

HDFS文件存储

HDFS文件存储结构

HDFS开发常用命令

Hadoop管理员常用命令

HDFS API简介

用Java对HDFS编程

Mapreduce简介

编写MapReduce程序的步骤

MapReduce模型

MapReduce运行步骤

MapReduce执行流程

MapReduce基本流程

JobTracker(JT)和TaskTracker(TT)简介

Mapreduce原理

使用ZooKeeper来协作JobTracker

Hadoop Job Scheduler

mapreduce的类型与格式

mapreduce的数据类型与java类型对应关系

Writable接口

实现自定义的mapreduce类型

mapreduce驱动默认的设置

Combiners和Partitioner编程