1. I am Charles Humble and I am here at QCon London with Eva Andreasson from Cloudera. Eva, can you introduce yourself to the InfoQ community?

Who am I? I am Eva Andreasson and I am a product manager working for Cloudera at the moment. I also have a past as a JVM developer and I love Java.

2. So Cloudera is the main distributor for Apache Hadoop. Can you tell us a bit about what Hadoop is?

For someone who does not know what Hadoop is, it is a new way of processing data. Many people might be familiar with a database or a enterprise data warehouse and how you structure data up front to serve different queries, business insight queries you want to ask your data to serve your business.

Hadoop has come in as a very new way of thinking about how to process data. It kind of turns the world upside down. It actually allows you to store your data as is, you do not have to structure it at point of storage. You can decide later, as use cases come up, on how and what structure to apply to your data to serve various questions; meaning you do not have to scale down or lose any of your data to fit into traditional systems, you can actually store everything and decide later what you want to do with it. That is kind of the power of Hadoop. In addition, it is open source and it is linearly scalable and it is very, very cost efficient. So, it is this flexibility on the very cost efficient storage and processing framework has revolutionized the world in the data processing space.

3. Let's drill into that a little bit. So you have a couple of components; you have got a distributed file system [EVA: Yes] and then you've got HBase as well, which is kind of like a key value store, as I understand it, on top of HDFS. Is that right?

It is a column oriented key value store where you can have sparsely populated data, meaning many people use HBase, for instance, to save all your clicks on a web site. You might not click all the links, but for each visitor you save the data that visitor A clicked on all of these 10 links or visitor B clicked on 50 links, allowing you to match behavior. So you can look up a customer’s behavior on your web site and match out who else is similar, as they are visiting your web site, and maybe dynamically change the experience - present similar items that the previous customer with a similar behavior actually purchased. So, you can adapt your customer experience and do a better user experience, shorter sales cycle, and more efficient “get to the information you want on a web page”, as the visitors are there. So, it is very interesting use cases you can do with HBase.

4. If I am executing a MapReduce query on a set of files on HDFS where does that actually happen? Is the data being bought in, and is it being executed centrally, or is it being executed on all the individual nodes and then assembled afterwards? How does it work?

MapReduce is a Java framework and it follows a MapReduce algorithm, where in the mapping phase you write some code on how or what data is of relevance to solve a specific use case, a question you want to ask. So basically return all the rows from all my files that are stored in a distributed way on HDFS that contain “red sweater” or something like that. Then you get all the data that contains “red sweater”. It just maps it out, extracts that data, puts it into temporary work files that you then can apply logic to. You do the reducing step. You actually apply the structure in the mapping phase, while applying the logic in the reducing phase. So now you can count all those entries containing “red sweater” and you can do this over a petabyte of data or 100 terabytes or 1 terabyte. It does not matter because it is parallel processing and linearly scalable with as many nodes as you add to it. So the aggregation and logic in the reducing step returns the end result to you.

So it is a two-step algorithm, executed in parallel on a distributed file system and then aggregated back to you, with no need of structure up front. You decided when you designed your MapReduce job. But that is the simple way of using Hadoop. There are more advanced ways today.

5. That is quite a big shift, I think, in terms of how we think about working with data. So, if I am a large traditional enterprise, perhaps I have a big BI team and a bunch of DBAs who are very used to working with relational data, how easy is it for me to make the switch to working with something like Hadoop?

That is a very good question and I get it a lot. Working for Cloudera, we help customers do that all the time. There is a lot that has happened in the recent years. It is not just MapReduce, the batch oriented Java framework. You have a number of tools, a number of workloads that you can actually execute on the same integrated file storage. So, you have Hive which is a batch oriented query engine, you have Impala, which is a real time query engine, both supporting standard SQL and JDBC and ODBC. So what you can query, the application layer, many of the BI tools can already work on top of Hadoop and its ecosystem today.

Then you also have search, for instance free text search; we have recently integrated Solr to Cloudera’s distribution of Hadoop. Then you have other frameworks and machine learning and even running SasS workloads natively in this platform, allows you to do many of the workloads you previously did in your enterprise data warehouse, for instance, in a much more scalable and cost efficient way on Hadoop.

I would though recommend to start with something very well defined, something you know is repetitive, you have known results in your enterprise data warehouse, where you struggle from the business side perhaps with scale, you have to finish more data processing within the same fixed time window – that is a very common scenario. Take a well scoped out project and work with someone like Cloudera or some other vendor that has experience, some SI that has experience to do this kind of migration, and then learn from that. After that, you will be an expert and will see for sure the value of having Hadoop in your enterprise and you can grow incrementally. That is how our customers have been successful.

6. That is interesting. So I have got very different workloads there if I've got Impala and MapReduce and Solr all running on the same platform. How do you deal with things like resource management issues that I'm presuming that would throw up?

Yes. That is a very good question. We also get that a lot. With the new version coming out in March, we have a productionized YARN, which is a new way of resource scheduling. It is now production ready. It is going to be the default scheduling capacity within the entire platform. We have also added ways to control the resources through our production tool – Cloudera Manager – to make sure you can isolate various workloads from starving each other. We are starting with a static resource allocation for workloads that are more RAM focused, in-memory focused, such as, for instance Solr and HBase; so you can assure they can serve those real time use cases securely with having your data scientists or developer team explore with MapReduce and modeling and machine learning in the background, without affecting those running services.

But the whole idea through Cloudera Manager to control based on what user group they are and what priority they are and what resources should be firmly allocated or dynamically allocated to various use cases; that is key. That is what we provided through our production tool.

7. Do you have tools for fail over as well if say an HBase server goes down?

That is actually handled by ZooKeeper, which is the ... All these funny animal names in the Hadoop ecosystem. Apache Hadoop ecosystem, has a ZooKeeper - handling the fail over and the process management; a very key component that is seldom highlighted but it definitely deserves the spotlight. It handles HBase process management, it handles SolrCloud process management and it handles a lot of those fail over that you just expect from Cloudera’s distribution in production.

8. If I am, say, a Java developer and I want to get to grips with Hadoop, where would you suggest I start?

Come to our online free training on our website; that is a good place. Or download the quick-start VM to experiment, to wrap your head around this, because it opens the door to so many possibilities. You need that learning, you need that experimentation. You can also go and download HUE – Hadoop User Experience. It is a project launched by Cloudera, also open-source. There are a lot of “How to” videos that show different applications on how you can interact with the back-end; the various workloads on the enterprise data hub today. And ask us questions in our forums or in the open-source. There are many, many people involved; many engineers happy to help.

9. So of course you have a background with working on GC for Java and I am interested in ... We have talked a little bit about RAM intensive workloads with the search engine, with Solr and so on. Are there specific unsolved GC problems that have a significant impact on the work you are doing with Hadoop?

There are multiple questions in there. Let me try to think and dissect them. I think as we brought more real time workloads to Hadoop, to the enterprise data hub, there are more needs of in-memory workloads such as Solr and HBase and Impala. For Solr and HBase, they are very RAM intensive and they are both based on Java, which means they will rely on good garbage collection algorithms. And, as you might have heard in my talk, the biggest problem today that I see with garbage collection algorithms is the “stop-the-world” operations; especially that is evident when it comes to compaction and the way to handle fragmented heaps; and fragmented heaps are the result of a long time running Java process such as, for instance, Solr processes or HBase processes. So I definitely see GC problems coming down the road, if not already there, with these kinds of workloads.

If you look at the other workloads, like spinning-up sudden processes, applying structure to data, returning the result – it is a more dynamic, up and down spin; that is a different profile. There I do not see much garbage collection innovation helping.

But for those profiles where there are long-time running processes, we need to solve how we get all the threads to safe point, which I think is the biggest bottle neck for stop-the-world operations in garbage collectors today.

10. Obviously we've got Java 8 coming out any minute. Is there anything in Java 8 that is sort of exciting for you?

That is a trick question. There is a lot of good stuff in Java 8. I have not been able to catch up on everything yet, but I am excited for other reasons, not related to my work, around the Nashorn project that will help JavaScript a lot. And also I am excited about G1, that many people talk about, but it hasn’t really solved the problem of stop-the-world operations, so I am yet to see and get very excited about the garbage collection part of Java forward.

Charles: Great. OK. Thank you very much.

Thank you.

infoq的更多相关文章

  1. infoq 微信后台存储架构

    infoq 上微信后台存储架构 视频很是值得认真一听,大概内容摘要如下: 主要内容:同城分布式强一致,园区级容灾KV存储系统 - sync 序列号发生器      移动互联网场景下,频繁掉线重连,使用 ...

  2. infoq - neo4j graph db

    My name is Charles Humble and I am here at QCon New York 2014 with Ian Robinson. Ian, can you introd ...

  3. JavaScript MVC框架和语言总结[infoq]

    infoq关于javascript的语言和框架的总结,非常全面,值得一读. http://www.infoq.com/minibooks/emag-javascript Contents of the ...

  4. Selenium实战脚本集(3)--抓取infoq里的测试新闻

    描述 打开infoq页面,抓取最新的一些测试文章 需要抓取文章的标题和内容 如果你有个人blog的话,可以将这些文章转载到自己的blog 要求 不要在新窗口打开文章 自行了解最新的测试思潮与实践

  5. 转: 学习开源项目的若干建议(infoq)

    转: http://www.infoq.com/cn/news/2014/04/learn-open-source 学习开源项目的若干建议 作者 崔康 发布于 2014年4月11日 | 注意:GTLC ...

  6. 转:自建CDN防御DDoS(1, 2, 3)infoq

    本文中提到的要点: 1.  针对恶意流的应对方法与策略.(基本上,中级的,顶级的) 2.  IP分类的脚本 3.  前端proxy工具的选择与使用. 4.  开源日志系统的选择与比较. (http:/ ...

  7. 转: 微博的多机房部署的实践(from infoq)

    转:  http://www.infoq.com/cn/articles/weibo-multi-idc-architecture 在国内网络环境下,单机房的可靠性无法满足大型互联网服务的要求,如机房 ...

  8. 打造自己的程序员品牌(摘自Infoq)

    John Sonmez是Simple Programmer的创始人.作者与程序员,关注于如何让复杂的事情变得简单.他是一位专业的软件开发者.架构师与讲师,感兴趣的领域包括测试驱动开发.如何编写整洁的代 ...

  9. 如何使代码审查更高效【摘自InfoQ】

      代码审查者在审查代码时有非常多的东西需要关注.一个团队需要明确对于自己的项目哪些点是重要的,并不断在审查中就这些点进行检查. 人工审查代码是十分昂贵的,因此尽可能地使用自动化方式进行审查,如:代码 ...

随机推荐

  1. [LintCode] Min Stack 最小栈

    Implement a stack with min() function, which will return the smallest number in the stack. It should ...

  2. Python之路第一课Day2--随堂笔记

    入门知识拾遗 一.bytes类型 bytes转二进制然后转回来 msg="张杨" print(msg) print(msg.encode("utf-8")) p ...

  3. JavaScript 实现彩票中随机数组的获取

    1.效果图: <!DOCTYPE html> <html lang="en"> <head> <meta charset="UT ...

  4. 【emWin】例程九:绘制流位图

    实验指导书及代码包下载: 链接:http://pan.baidu.com/s/1kVDIWIF 密码:9jbo 实验现象:

  5. htmlentities,html_entity_decode,addslashes

    PHP htmlspecialchars_decode() 函数 PHP htmlspecialchars() 函数 PHP html_entity_decode() 函数 PHP中混淆的三组函数总结 ...

  6. [转载] 自定义百度网盘分享密码 (Javascript)

    压缩版 javascript:require(["function-widget-1:share/util/service/createLinkShare.js"]).protot ...

  7. ios中自定义tableView,CollectionView的cell什么时候用nib加载,什么时候用标识重用

    做了一段时间的iOS,在菜鸟的路上还有很长的路要走,把遇到的问题记下来,好记性不如烂笔头. 在项目开发中大家经常会用到tableView和collectionView两个控件,然而在cell的自定义上 ...

  8. [lua大坑]一个莫名其妙的lua执行时崩溃引出的堆栈大小问题

    这是一个坑,天坑!如果不是我随手删除了一个本地变量,这个问题直到现在我应该也没有头绪. 首先,写了一个新的lua脚本,载入,执行.在执行的时候,出了这么一个莫名其妙的问题: EXC_BAD_ACCES ...

  9. iOS圆饼图和圆环的绘制,并且添加引线

    在开发中经常遇到统计之类的需求,特此封装了一个简单的圆饼图和圆环图,效果图如下 代码下载地址:https://github.com/minyahui/MYHCricleView.git

  10. 贪吃蛇的java代码分析(一)

    自我审视 最近自己学习java已经有了一个多月的时间,从一开始对变量常量的概念一无所知,到现在能勉强写几个小程序玩玩,已经有了长足的进步.今天没有去学习,学校里要进行毕业答辩和拍毕业照了,于是请了几天 ...