<Impala><Overview><UDF>
Overview
- Apache Impala (incubating) is the open source, native analytic database for apache Hadoop.
Features
- Do BI-style Queries on Hadoop:
- low latency and high concurrency for BI/analytic queries on Hadoop(not delivered by batch frameworks such as Apache Hive).
- scales linearly, even in multitenant environments.
- Unify ur Infrasturecture: Utilize the same file and data formats and metadata, security, and resource management frameworks as your Hadoop deployment—no redundant infrastructure or data conversion/duplication.
- Implement Quickly: supports SQL
- Count on Enterprise-class Security
- Retain Freedom from Lock-in: open-source
- Expand the Hadoop User-verse
Architecuture
- Circumvents MapReduce to avoid latency, directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.
- Some advantages:
- Thx to local processing on data nodes, network bottlenecks are avoided.
- A signle, open, and unified metadata store can be utilized.
- Costly data format conversion is unnecessary and thus no overhead is incurred.
- All data is immediately query-able, with no delays for ETL.
- All hardware is utilized for Impala queries as well as for MR.
- Only a single machine pool is needed to scale.
Documentation
... skip
Impala User-Defined Functions(UDFs)
- UDF let you code ur own application logic for processing column values during an Impala query.
UDFS Concepts
- U can code either scalar functions for producing results one row at a time.
- Or more complex aggregate functions for doing analysis across.
UDFs and UDAFs
- The most general kind of udf takes single input value and produces a single output value. When used in a query, it is called once for each row in the result set. eg:
select customer_name, is_frequent_customer(customer_id) from customers;
select obfuscate(sensitive_column) from sensitive_data; - A user-defined aggergate function(UDAF) accepts a group of values and returns a single value. U can use UDAFs to summarize and condense sets of rows, in the same style as the built-in COUNT, MAX(), SUM(), and AVG() functions. When called in a query that uses the GROUP BY clause, the function is called once for each combination of GROUP BY values. eg:
-- Evaluates multiple rows but returns a single value
select closest_restaurant(latitude, longitude) from places; -- Evaluates batches of rows and returns a separate value for each batch.
select most_profitable_locartion(store_id, sales, expenses, tax_rate, depreciation) from franchise_data group by year; - Currently, Impala does not support other categories of udf, such as user-defined table functions(UDTFs) or window functions.
Native Impala UDFs
- Impala supports UDFs written in C++, in addition to supporting existing Hive UDFs written in Java.
- Where practical, use C++ UDFs because the compiled native code can yield higher performance, with UDF execution time often 10x faster for a C++ UDF than the equivalent Java UDF.
Using Hive UDFs with Impala
- Impala can run Java-based user-defined functions (UDFs), originally written for Hive, with no changes, subject to the following conditions:
- The parameter and return value must all use scalar data types supported by Impala. That's to say, complex or nested types are not supported.
- Currently, Hive UDFs that accept or return the TIMESTAMP type are not supported.
- Hive UDAFs and UDTFs are not supported.
- Typically, a Java UDF will execute several times slower in Impala than the equivalent native UDF written in C++.
- What to do next?
- write ur udf
- upload the jar to a hdfs path(where impala can read)
- for each Java-based UDF that u want to call through Impala, issue a CREATE FUNCTION statement, with a LOCATION clause containing the full HDFS path or the JAR file, and a SYMBOL clause with the fully qualified name of the class, using dots as separators and without the .class extension. eg:
create function my_neg(bigint)
returns bigint location '/user/hive/udfs/hive.jar'
symbol = 'org.apache.hadoop.hive.ql.udf.UDFOPNegative'; - call the function from ur queries, passing arguments of the correct type to match the function signature.
FYI
<Impala><Overview><UDF>的更多相关文章
- 简单物联网:外网访问内网路由器下树莓派Flask服务器
最近做一个小东西,大概过程就是想在教室,宿舍控制实验室的一些设备. 已经在树莓上搭了一个轻量的flask服务器,在实验室的路由器下,任何设备都是可以访问的:但是有一些限制条件,比如我想在宿舍控制我种花 ...
- 利用ssh反向代理以及autossh实现从外网连接内网服务器
前言 最近遇到这样一个问题,我在实验室架设了一台服务器,给师弟或者小伙伴练习Linux用,然后平时在实验室这边直接连接是没有问题的,都是内网嘛.但是回到宿舍问题出来了,使用校园网的童鞋还是能连接上,使 ...
- 外网访问内网Docker容器
外网访问内网Docker容器 本地安装了Docker容器,只能在局域网内访问,怎样从外网也能访问本地Docker容器? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动Docker容器 ...
- 外网访问内网SpringBoot
外网访问内网SpringBoot 本地安装了SpringBoot,只能在局域网内访问,怎样从外网也能访问本地SpringBoot? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装Java 1 ...
- 外网访问内网Elasticsearch WEB
外网访问内网Elasticsearch WEB 本地安装了Elasticsearch,只能在局域网内访问其WEB,怎样从外网也能访问本地Elasticsearch? 本文将介绍具体的实现步骤. 1. ...
- 怎样从外网访问内网Rails
外网访问内网Rails 本地安装了Rails,只能在局域网内访问,怎样从外网也能访问本地Rails? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动Rails 默认安装的Rails端口 ...
- 怎样从外网访问内网Memcached数据库
外网访问内网Memcached数据库 本地安装了Memcached数据库,只能在局域网内访问,怎样从外网也能访问本地Memcached数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装 ...
- 怎样从外网访问内网CouchDB数据库
外网访问内网CouchDB数据库 本地安装了CouchDB数据库,只能在局域网内访问,怎样从外网也能访问本地CouchDB数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动Cou ...
- 怎样从外网访问内网DB2数据库
外网访问内网DB2数据库 本地安装了DB2数据库,只能在局域网内访问,怎样从外网也能访问本地DB2数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动DB2数据库 默认安装的DB2 ...
- 怎样从外网访问内网OpenLDAP数据库
外网访问内网OpenLDAP数据库 本地安装了OpenLDAP数据库,只能在局域网内访问,怎样从外网也能访问本地OpenLDAP数据库? 本文将介绍具体的实现步骤. 1. 准备工作 1.1 安装并启动 ...
随机推荐
- calc_load
http://www.penglixun.com/tech/system/how_to_calc_load_cpu.html #define FSHIFT 11 /* nr of bits of pr ...
- ubuntu opencv
sudo apt-get updatesudo apt-get upgrade sudo apt-get install build-essential libgtk2.0-dev libjpeg-d ...
- XXE漏洞
原理:XML外部实体注入,简称XXE漏洞,XML数据在传输中数据被修改,服务器执行被恶意插入的代码.当允许引用外部实体时,通过构造恶意内容,就可能导致任意文件读取.系统命令执行.内网端口探测.攻击内网 ...
- hbase知识
HBASE是一个高可靠性.高性能.面向列.可伸缩的分布式存储系统 HBASE的目标是存储并处理大型的数据,更具体来说是仅需使用普通的硬件配置,就能够处理由成千上万的行和列所组成的大型数据. HBASE ...
- excel导入 导出
PHP页面 //设置header header("content-Type:text/html;charset=utf-8"); //设置文件大小的限制 ini_set(" ...
- dubbo初认知(dubbo和springCloud关系,在微服务架构中的作用等)(持续更新中)
一:dubbo是什么? dobbuo是阿里开源的一个高性能优秀的服务框架, 可通过高性能的 RPC 实现服务的输出和输入功能,使得应用可以和 高性能的rpc实现输入和输出的功能,可以了 Spring ...
- 二十、MVC的WEB框架(Spring MVC)
一.Spring MVC 运行原理:客户端请求提交到DispatcherServlet,由DispatcherServlet控制器查询HandlerMapping,找到并分发到指定的Controlle ...
- 十二、持久层框架(MyBatis)
一.PageHelper分页插件的使用 PageHelper是一款MyBatis的分页插件,只需要简单的配置,然后直接调用方法就可以. 1.配置PageHelper插件 在mybatis-config ...
- 3.Liunx网络管理命令
大纲: 1.网络信息:hostname.netstat.ifconfig ,route 2.网络配置:netconfig 3.网络测试:ping
- Oracle Shared Pool之Library Cache
1. Shared Pool组成 Shared Pool由许多区间(Extent)组成,这些区间又由多个连续的内存块(Chunk)组成,这些内存块大小不一.从逻辑功能角度,Shared pool主要包 ...