1. I am Charles Humble and I am here at QCon London with Eva Andreasson from Cloudera. Eva, can you introduce yourself to the InfoQ community?

Who am I? I am Eva Andreasson and I am a product manager working for Cloudera at the moment. I also have a past as a JVM developer and I love Java.

2. So Cloudera is the main distributor for Apache Hadoop. Can you tell us a bit about what Hadoop is?

For someone who does not know what Hadoop is, it is a new way of processing data. Many people might be familiar with a database or a enterprise data warehouse and how you structure data up front to serve different queries, business insight queries you want to ask your data to serve your business.

Hadoop has come in as a very new way of thinking about how to process data. It kind of turns the world upside down. It actually allows you to store your data as is, you do not have to structure it at point of storage. You can decide later, as use cases come up, on how and what structure to apply to your data to serve various questions; meaning you do not have to scale down or lose any of your data to fit into traditional systems, you can actually store everything and decide later what you want to do with it. That is kind of the power of Hadoop. In addition, it is open source and it is linearly scalable and it is very, very cost efficient. So, it is this flexibility on the very cost efficient storage and processing framework has revolutionized the world in the data processing space.

3. Let's drill into that a little bit. So you have a couple of components; you have got a distributed file system [EVA: Yes] and then you've got HBase as well, which is kind of like a key value store, as I understand it, on top of HDFS. Is that right?

It is a column oriented key value store where you can have sparsely populated data, meaning many people use HBase, for instance, to save all your clicks on a web site. You might not click all the links, but for each visitor you save the data that visitor A clicked on all of these 10 links or visitor B clicked on 50 links, allowing you to match behavior. So you can look up a customer’s behavior on your web site and match out who else is similar, as they are visiting your web site, and maybe dynamically change the experience - present similar items that the previous customer with a similar behavior actually purchased. So, you can adapt your customer experience and do a better user experience, shorter sales cycle, and more efficient “get to the information you want on a web page”, as the visitors are there. So, it is very interesting use cases you can do with HBase.

4. If I am executing a MapReduce query on a set of files on HDFS where does that actually happen? Is the data being bought in, and is it being executed centrally, or is it being executed on all the individual nodes and then assembled afterwards? How does it work?

MapReduce is a Java framework and it follows a MapReduce algorithm, where in the mapping phase you write some code on how or what data is of relevance to solve a specific use case, a question you want to ask. So basically return all the rows from all my files that are stored in a distributed way on HDFS that contain “red sweater” or something like that. Then you get all the data that contains “red sweater”. It just maps it out, extracts that data, puts it into temporary work files that you then can apply logic to. You do the reducing step. You actually apply the structure in the mapping phase, while applying the logic in the reducing phase. So now you can count all those entries containing “red sweater” and you can do this over a petabyte of data or 100 terabytes or 1 terabyte. It does not matter because it is parallel processing and linearly scalable with as many nodes as you add to it. So the aggregation and logic in the reducing step returns the end result to you.

So it is a two-step algorithm, executed in parallel on a distributed file system and then aggregated back to you, with no need of structure up front. You decided when you designed your MapReduce job. But that is the simple way of using Hadoop. There are more advanced ways today.

5. That is quite a big shift, I think, in terms of how we think about working with data. So, if I am a large traditional enterprise, perhaps I have a big BI team and a bunch of DBAs who are very used to working with relational data, how easy is it for me to make the switch to working with something like Hadoop?

That is a very good question and I get it a lot. Working for Cloudera, we help customers do that all the time. There is a lot that has happened in the recent years. It is not just MapReduce, the batch oriented Java framework. You have a number of tools, a number of workloads that you can actually execute on the same integrated file storage. So, you have Hive which is a batch oriented query engine, you have Impala, which is a real time query engine, both supporting standard SQL and JDBC and ODBC. So what you can query, the application layer, many of the BI tools can already work on top of Hadoop and its ecosystem today.

Then you also have search, for instance free text search; we have recently integrated Solr to Cloudera’s distribution of Hadoop. Then you have other frameworks and machine learning and even running SasS workloads natively in this platform, allows you to do many of the workloads you previously did in your enterprise data warehouse, for instance, in a much more scalable and cost efficient way on Hadoop.

I would though recommend to start with something very well defined, something you know is repetitive, you have known results in your enterprise data warehouse, where you struggle from the business side perhaps with scale, you have to finish more data processing within the same fixed time window – that is a very common scenario. Take a well scoped out project and work with someone like Cloudera or some other vendor that has experience, some SI that has experience to do this kind of migration, and then learn from that. After that, you will be an expert and will see for sure the value of having Hadoop in your enterprise and you can grow incrementally. That is how our customers have been successful.

6. That is interesting. So I have got very different workloads there if I've got Impala and MapReduce and Solr all running on the same platform. How do you deal with things like resource management issues that I'm presuming that would throw up?

Yes. That is a very good question. We also get that a lot. With the new version coming out in March, we have a productionized YARN, which is a new way of resource scheduling. It is now production ready. It is going to be the default scheduling capacity within the entire platform. We have also added ways to control the resources through our production tool – Cloudera Manager – to make sure you can isolate various workloads from starving each other. We are starting with a static resource allocation for workloads that are more RAM focused, in-memory focused, such as, for instance Solr and HBase; so you can assure they can serve those real time use cases securely with having your data scientists or developer team explore with MapReduce and modeling and machine learning in the background, without affecting those running services.

But the whole idea through Cloudera Manager to control based on what user group they are and what priority they are and what resources should be firmly allocated or dynamically allocated to various use cases; that is key. That is what we provided through our production tool.

7. Do you have tools for fail over as well if say an HBase server goes down?

That is actually handled by ZooKeeper, which is the ... All these funny animal names in the Hadoop ecosystem. Apache Hadoop ecosystem, has a ZooKeeper - handling the fail over and the process management; a very key component that is seldom highlighted but it definitely deserves the spotlight. It handles HBase process management, it handles SolrCloud process management and it handles a lot of those fail over that you just expect from Cloudera’s distribution in production.

8. If I am, say, a Java developer and I want to get to grips with Hadoop, where would you suggest I start?

Come to our online free training on our website; that is a good place. Or download the quick-start VM to experiment, to wrap your head around this, because it opens the door to so many possibilities. You need that learning, you need that experimentation. You can also go and download HUE – Hadoop User Experience. It is a project launched by Cloudera, also open-source. There are a lot of “How to” videos that show different applications on how you can interact with the back-end; the various workloads on the enterprise data hub today. And ask us questions in our forums or in the open-source. There are many, many people involved; many engineers happy to help.

9. So of course you have a background with working on GC for Java and I am interested in ... We have talked a little bit about RAM intensive workloads with the search engine, with Solr and so on. Are there specific unsolved GC problems that have a significant impact on the work you are doing with Hadoop?

There are multiple questions in there. Let me try to think and dissect them. I think as we brought more real time workloads to Hadoop, to the enterprise data hub, there are more needs of in-memory workloads such as Solr and HBase and Impala. For Solr and HBase, they are very RAM intensive and they are both based on Java, which means they will rely on good garbage collection algorithms. And, as you might have heard in my talk, the biggest problem today that I see with garbage collection algorithms is the “stop-the-world” operations; especially that is evident when it comes to compaction and the way to handle fragmented heaps; and fragmented heaps are the result of a long time running Java process such as, for instance, Solr processes or HBase processes. So I definitely see GC problems coming down the road, if not already there, with these kinds of workloads.

If you look at the other workloads, like spinning-up sudden processes, applying structure to data, returning the result – it is a more dynamic, up and down spin; that is a different profile. There I do not see much garbage collection innovation helping.

But for those profiles where there are long-time running processes, we need to solve how we get all the threads to safe point, which I think is the biggest bottle neck for stop-the-world operations in garbage collectors today.

10. Obviously we've got Java 8 coming out any minute. Is there anything in Java 8 that is sort of exciting for you?

That is a trick question. There is a lot of good stuff in Java 8. I have not been able to catch up on everything yet, but I am excited for other reasons, not related to my work, around the Nashorn project that will help JavaScript a lot. And also I am excited about G1, that many people talk about, but it hasn’t really solved the problem of stop-the-world operations, so I am yet to see and get very excited about the garbage collection part of Java forward.

Charles: Great. OK. Thank you very much.

Thank you.

infoq的更多相关文章

  1. infoq 微信后台存储架构

    infoq 上微信后台存储架构 视频很是值得认真一听,大概内容摘要如下: 主要内容:同城分布式强一致,园区级容灾KV存储系统 - sync 序列号发生器      移动互联网场景下,频繁掉线重连,使用 ...

  2. infoq - neo4j graph db

    My name is Charles Humble and I am here at QCon New York 2014 with Ian Robinson. Ian, can you introd ...

  3. JavaScript MVC框架和语言总结[infoq]

    infoq关于javascript的语言和框架的总结,非常全面,值得一读. http://www.infoq.com/minibooks/emag-javascript Contents of the ...

  4. Selenium实战脚本集(3)--抓取infoq里的测试新闻

    描述 打开infoq页面,抓取最新的一些测试文章 需要抓取文章的标题和内容 如果你有个人blog的话,可以将这些文章转载到自己的blog 要求 不要在新窗口打开文章 自行了解最新的测试思潮与实践

  5. 转: 学习开源项目的若干建议(infoq)

    转: http://www.infoq.com/cn/news/2014/04/learn-open-source 学习开源项目的若干建议 作者 崔康 发布于 2014年4月11日 | 注意:GTLC ...

  6. 转:自建CDN防御DDoS(1, 2, 3)infoq

    本文中提到的要点: 1.  针对恶意流的应对方法与策略.(基本上,中级的,顶级的) 2.  IP分类的脚本 3.  前端proxy工具的选择与使用. 4.  开源日志系统的选择与比较. (http:/ ...

  7. 转: 微博的多机房部署的实践(from infoq)

    转:  http://www.infoq.com/cn/articles/weibo-multi-idc-architecture 在国内网络环境下,单机房的可靠性无法满足大型互联网服务的要求,如机房 ...

  8. 打造自己的程序员品牌(摘自Infoq)

    John Sonmez是Simple Programmer的创始人.作者与程序员,关注于如何让复杂的事情变得简单.他是一位专业的软件开发者.架构师与讲师,感兴趣的领域包括测试驱动开发.如何编写整洁的代 ...

  9. 如何使代码审查更高效【摘自InfoQ】

      代码审查者在审查代码时有非常多的东西需要关注.一个团队需要明确对于自己的项目哪些点是重要的,并不断在审查中就这些点进行检查. 人工审查代码是十分昂贵的,因此尽可能地使用自动化方式进行审查,如:代码 ...

随机推荐

  1. Hibernate的实体类为什么要实现Serializable序列化接口?

    Hibernate的实体类中为什么要继承Serializable?   hibernate有二级缓存,缓存会将对象写进硬盘,就必须序列化,以及兼容对象在网络中的传输 等等. java中常见的几个类(如 ...

  2. 转:ibatis的cacheModel

    转:ibatis的cacheModel cachemodel是ibatis里面自带的缓存机制,正确的应用能很好提升我们系统的性能. 使用方法:在sqlmap的配置文件中加入 <cacheMode ...

  3. Android-studio开发 快捷键

    这会儿正在学android开发,使用的是Android-studio 记录一下开发工具默认的 快捷键

  4. thinkphp1

    命名空间 含义:从广义上来说,命名空间是一种封装事物的方法. 用途:用来解决命名冲突 namespace xxx\xxx; 使用: use xxx\xx\yy; new\xx\xx\yy; // 单一 ...

  5. jenkins,dns错误log过大

    http://stackoverflow.com/questions/31719756/how-to-stop-jenkins-log-from-becoming-huge Recently my j ...

  6. C fopen

    格式:文件指针名=fopen(文件名,使用文件方式) 参数:文件名 意义"C://TC//qwe.txt" 文件C:/TC/qwe.txt"qwe.txt" 和 ...

  7. VS2013菜单栏文字全大写的问题

    从VS2010转到VS2013,2013新增的很多功能确实很方便,只是有一点,菜单栏文字变成全大写了,看着有点不习惯. 打开注册表编辑器,找到 HKEY_CURRENT_USER\Software\M ...

  8. 分享一个oraclehelper

    分享一个拿即用的oraclehelper 首先要引用本机中的oralce access,如果是64位的话,也必须是64位运行,不然会报连接为空connection 等于null. using Orac ...

  9. [lua大坑]一个莫名其妙的lua执行时崩溃引出的堆栈大小问题

    这是一个坑,天坑!如果不是我随手删除了一个本地变量,这个问题直到现在我应该也没有头绪. 首先,写了一个新的lua脚本,载入,执行.在执行的时候,出了这么一个莫名其妙的问题: EXC_BAD_ACCES ...

  10. memcached服务器

    memcached是高性能的分布式内存缓存服务器,通过缓存数据库查询结果,减少数据库访问次数,提高动态web应用的速度和可扩展性.为了提高性能,memcached把数据存储在内存中,重启memcach ...