hadoop入门之海量Web日志分析 用Hadoop提取KPI统计指标
转载自:
http://blog.fens.me/hadoop-mapreduce-log-kpi/
今天学习了这一篇博客,写得十分好,照着这篇博客敲了一遍。
发现几个问题,
一是这篇博客中采用的hadoop版本过低,如果在hadoop2.x上面跑的话,可能会出现结果文件没有写入任何数据,为了解决这个问题,我试着去参照官网http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html的API进行操作,发现官网里讲得十分详细,只要有一点英文基础的同行都可以看得懂,直白简单。hadoop2.x相比较hadoop1.x而言编写Mapper类,可以直接继承import org.apache.hadoop.mapreduce.Mapper;无需再实现Mapper接口了,其中关于map方法的写法也变了改成如下:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
KPI kpi = KPI.filterPVS(value.toString());
System.out.println(kpi);
if (kpi.isValid()) {
word.set(kpi.getIp());
context.write(word, one);
}
}
hadoop1.x的写法如下:
@Override
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
KPI kpi = KPI.filterPVs(value.toString());
if (kpi.isValid()) {
word.set(kpi.getRequest());
output.collect(word, one);
}
}
hadoop2.x的写法就必须改变了,相应的Reducer中的reduce方法随之改变。一开始没有发现文中的github网址去百度了一下费了很大劲找到了一个150多M的文件,需要自取:
链接: https://pan.baidu.com/s/1hz5dTX69Hc_l9Aj-axvfqw 提取码: ssys 复制这段内容后打开百度网盘手机App,操作更方便哦,当然这个日志文件内容与博客的不一致,少了两个属性,请自行对照代码修改。
二、在hadoop2.x上面运行,在main方法里配置运行参数我这次使用的hadoop2.9.2这个版本的,需要用到winuitil.exe和hadoop.dll这两个工具。已经上传到百度网盘上面,地址如下,链接: https://pan.baidu.com/s/1RTSeGjV2VwWxRAvsUMkkrA 提取码: dkxt ,有三个文件分别是hadoop.2.9.2,eclipse插件,以及winutil,需要把hadoo2.6x里面的文件全部复制到hadoop.2.9.2/bin文件夹下,其中hadoop2.6.x中的haoop.dll需要复制到c:/Windows/System32目录下。关闭所有应用重启计算机,在main方法中设置如下系统属性:
System.setProperty("HADOOP_HOME", "E:\\hadoop\\hadoop2.6");
System.setProperty("hadoop.home.dir", "E:\\hadoop\\hadoop-2.9.2");
System.setProperty("HADOOP_USER_NAME", "hadoop");
设置好以后运行会报错:Acess$0之类的错误:遇到这种情况,在项目src下新建NativeIO.java文件,修改如下:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.io.nativeio; import java.io.File;
import java.io.FileDescriptor;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap; import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.CommonConfigurationKeys;
import org.apache.hadoop.fs.HardLink;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SecureIOUtils.AlreadyExistsException;
import org.apache.hadoop.util.NativeCodeLoader;
import org.apache.hadoop.util.Shell;
import org.apache.hadoop.util.PerformanceAdvisory; import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import sun.misc.Unsafe; import com.google.common.annotations.VisibleForTesting; /**
* JNI wrappers for various native IO-related calls not available in Java.
* These functions should generally be used alongside a fallback to another
* more portable mechanism.
*/
@InterfaceAudience.Private
@InterfaceStability.Unstable
public class NativeIO {
public static class POSIX {
// Flags for open() call from bits/fcntl.h - Set by JNI
public static int O_RDONLY = -1;
public static int O_WRONLY = -1;
public static int O_RDWR = -1;
public static int O_CREAT = -1;
public static int O_EXCL = -1;
public static int O_NOCTTY = -1;
public static int O_TRUNC = -1;
public static int O_APPEND = -1;
public static int O_NONBLOCK = -1;
public static int O_SYNC = -1; // Flags for posix_fadvise() from bits/fcntl.h - Set by JNI
/* No further special treatment. */
public static int POSIX_FADV_NORMAL = -1;
/* Expect random page references. */
public static int POSIX_FADV_RANDOM = -1;
/* Expect sequential page references. */
public static int POSIX_FADV_SEQUENTIAL = -1;
/* Will need these pages. */
public static int POSIX_FADV_WILLNEED = -1;
/* Don't need these pages. */
public static int POSIX_FADV_DONTNEED = -1;
/* Data will be accessed once. */
public static int POSIX_FADV_NOREUSE = -1; // Updated by JNI when supported by glibc. Leave defaults in case kernel
// supports sync_file_range, but glibc does not.
/* Wait upon writeout of all pages
in the range before performing the
write. */
public static int SYNC_FILE_RANGE_WAIT_BEFORE = 1;
/* Initiate writeout of all those
dirty pages in the range which are
not presently under writeback. */
public static int SYNC_FILE_RANGE_WRITE = 2;
/* Wait upon writeout of all pages in
the range after performing the
write. */
public static int SYNC_FILE_RANGE_WAIT_AFTER = 4; private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class); // Set to true via JNI if possible
public static boolean fadvisePossible = false; private static boolean nativeLoaded = false;
private static boolean syncFileRangePossible = true; static final String WORKAROUND_NON_THREADSAFE_CALLS_KEY =
"hadoop.workaround.non.threadsafe.getpwuid";
static final boolean WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT = true; private static long cacheTimeout = -1; private static CacheManipulator cacheManipulator = new CacheManipulator(); public static CacheManipulator getCacheManipulator() {
return cacheManipulator;
} public static void setCacheManipulator(CacheManipulator cacheManipulator) {
POSIX.cacheManipulator = cacheManipulator;
} /**
* Used to manipulate the operating system cache.
*/
@VisibleForTesting
public static class CacheManipulator {
public void mlock(String identifier, ByteBuffer buffer,
long len) throws IOException {
POSIX.mlock(buffer, len);
} public long getMemlockLimit() {
return NativeIO.getMemlockLimit();
} public long getOperatingSystemPageSize() {
return NativeIO.getOperatingSystemPageSize();
} public void posixFadviseIfPossible(String identifier,
FileDescriptor fd, long offset, long len, int flags)
throws NativeIOException {
NativeIO.POSIX.posixFadviseIfPossible(identifier, fd, offset,
len, flags);
} public boolean verifyCanMlock() {
return NativeIO.isAvailable();
}
} /**
* A CacheManipulator used for testing which does not actually call mlock.
* This allows many tests to be run even when the operating system does not
* allow mlock, or only allows limited mlocking.
*/
@VisibleForTesting
public static class NoMlockCacheManipulator extends CacheManipulator {
public void mlock(String identifier, ByteBuffer buffer,
long len) throws IOException {
LOG.info("mlocking " + identifier);
} public long getMemlockLimit() {
return 1125899906842624L;
} public long getOperatingSystemPageSize() {
return 4096;
} public boolean verifyCanMlock() {
return true;
}
} static {
if (NativeCodeLoader.isNativeCodeLoaded()) {
try {
Configuration conf = new Configuration();
workaroundNonThreadSafePasswdCalls = conf.getBoolean(
WORKAROUND_NON_THREADSAFE_CALLS_KEY,
WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT); initNative();
nativeLoaded = true; cacheTimeout = conf.getLong(
CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_KEY,
CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_DEFAULT) *
1000;
LOG.debug("Initialized cache for IDs to User/Group mapping with a " +
" cache timeout of " + cacheTimeout/1000 + " seconds."); } catch (Throwable t) {
// This can happen if the user has an older version of libhadoop.so
// installed - in this case we can continue without native IO
// after warning
PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
}
}
} /**
* Return true if the JNI-based native IO extensions are available.
*/
public static boolean isAvailable() {
return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;
} private static void assertCodeLoaded() throws IOException {
if (!isAvailable()) {
throw new IOException("NativeIO was not loaded");
}
} /** Wrapper around open(2) */
public static native FileDescriptor open(String path, int flags, int mode) throws IOException;
/** Wrapper around fstat(2) */
private static native Stat fstat(FileDescriptor fd) throws IOException; /** Native chmod implementation. On UNIX, it is a wrapper around chmod(2) */
private static native void chmodImpl(String path, int mode) throws IOException; public static void chmod(String path, int mode) throws IOException {
if (!Shell.WINDOWS) {
chmodImpl(path, mode);
} else {
try {
chmodImpl(path, mode);
} catch (NativeIOException nioe) {
if (nioe.getErrorCode() == 3) {
throw new NativeIOException("No such file or directory",
Errno.ENOENT);
} else {
LOG.warn(String.format("NativeIO.chmod error (%d): %s",
nioe.getErrorCode(), nioe.getMessage()));
throw new NativeIOException("Unknown error", Errno.UNKNOWN);
}
}
}
} /** Wrapper around posix_fadvise(2) */
static native void posix_fadvise(
FileDescriptor fd, long offset, long len, int flags) throws NativeIOException; /** Wrapper around sync_file_range(2) */
static native void sync_file_range(
FileDescriptor fd, long offset, long nbytes, int flags) throws NativeIOException; /**
* Call posix_fadvise on the given file descriptor. See the manpage
* for this syscall for more information. On systems where this
* call is not available, does nothing.
*
* @throws NativeIOException if there is an error with the syscall
*/
static void posixFadviseIfPossible(String identifier,
FileDescriptor fd, long offset, long len, int flags)
throws NativeIOException {
if (nativeLoaded && fadvisePossible) {
try {
posix_fadvise(fd, offset, len, flags);
} catch (UnsatisfiedLinkError ule) {
fadvisePossible = false;
}
}
} /**
* Call sync_file_range on the given file descriptor. See the manpage
* for this syscall for more information. On systems where this
* call is not available, does nothing.
*
* @throws NativeIOException if there is an error with the syscall
*/
public static void syncFileRangeIfPossible(
FileDescriptor fd, long offset, long nbytes, int flags)
throws NativeIOException {
if (nativeLoaded && syncFileRangePossible) {
try {
sync_file_range(fd, offset, nbytes, flags);
} catch (UnsupportedOperationException uoe) {
syncFileRangePossible = false;
} catch (UnsatisfiedLinkError ule) {
syncFileRangePossible = false;
}
}
} static native void mlock_native(
ByteBuffer buffer, long len) throws NativeIOException; /**
* Locks the provided direct ByteBuffer into memory, preventing it from
* swapping out. After a buffer is locked, future accesses will not incur
* a page fault.
*
* See the mlock(2) man page for more information.
*
* @throws NativeIOException
*/
static void mlock(ByteBuffer buffer, long len)
throws IOException {
assertCodeLoaded();
if (!buffer.isDirect()) {
throw new IOException("Cannot mlock a non-direct ByteBuffer");
}
mlock_native(buffer, len);
} /**
* Unmaps the block from memory. See munmap(2).
*
* There isn't any portable way to unmap a memory region in Java.
* So we use the sun.nio method here.
* Note that unmapping a memory region could cause crashes if code
* continues to reference the unmapped code. However, if we don't
* manually unmap the memory, we are dependent on the finalizer to
* do it, and we have no idea when the finalizer will run.
*
* @param buffer The buffer to unmap.
*/
public static void munmap(MappedByteBuffer buffer) {
if (buffer instanceof sun.nio.ch.DirectBuffer) {
sun.misc.Cleaner cleaner =
((sun.nio.ch.DirectBuffer)buffer).cleaner();
cleaner.clean();
}
} /** Linux only methods used for getOwner() implementation */
private static native long getUIDforFDOwnerforOwner(FileDescriptor fd) throws IOException;
private static native String getUserName(long uid) throws IOException; /**
* Result type of the fstat call
*/
public static class Stat {
private int ownerId, groupId;
private String owner, group;
private int mode; // Mode constants - Set by JNI
public static int S_IFMT = -1; /* type of file */
public static int S_IFIFO = -1; /* named pipe (fifo) */
public static int S_IFCHR = -1; /* character special */
public static int S_IFDIR = -1; /* directory */
public static int S_IFBLK = -1; /* block special */
public static int S_IFREG = -1; /* regular */
public static int S_IFLNK = -1; /* symbolic link */
public static int S_IFSOCK = -1; /* socket */
public static int S_ISUID = -1; /* set user id on execution */
public static int S_ISGID = -1; /* set group id on execution */
public static int S_ISVTX = -1; /* save swapped text even after use */
public static int S_IRUSR = -1; /* read permission, owner */
public static int S_IWUSR = -1; /* write permission, owner */
public static int S_IXUSR = -1; /* execute/search permission, owner */ Stat(int ownerId, int groupId, int mode) {
this.ownerId = ownerId;
this.groupId = groupId;
this.mode = mode;
} Stat(String owner, String group, int mode) {
if (!Shell.WINDOWS) {
this.owner = owner;
} else {
this.owner = stripDomain(owner);
}
if (!Shell.WINDOWS) {
this.group = group;
} else {
this.group = stripDomain(group);
}
this.mode = mode;
} @Override
public String toString() {
return "Stat(owner='" + owner + "', group='" + group + "'" +
", mode=" + mode + ")";
} public String getOwner() {
return owner;
}
public String getGroup() {
return group;
}
public int getMode() {
return mode;
}
} /**
* Returns the file stat for a file descriptor.
*
* @param fd file descriptor.
* @return the file descriptor file stat.
* @throws IOException thrown if there was an IO error while obtaining the file stat.
*/
public static Stat getFstat(FileDescriptor fd) throws IOException {
Stat stat = null;
if (!Shell.WINDOWS) {
stat = fstat(fd);
stat.owner = getName(IdCache.USER, stat.ownerId);
stat.group = getName(IdCache.GROUP, stat.groupId);
} else {
try {
stat = fstat(fd);
} catch (NativeIOException nioe) {
if (nioe.getErrorCode() == 6) {
throw new NativeIOException("The handle is invalid.",
Errno.EBADF);
} else {
LOG.warn(String.format("NativeIO.getFstat error (%d): %s",
nioe.getErrorCode(), nioe.getMessage()));
throw new NativeIOException("Unknown error", Errno.UNKNOWN);
}
}
}
return stat;
} private static String getName(IdCache domain, int id) throws IOException {
Map<Integer, CachedName> idNameCache = (domain == IdCache.USER)
? USER_ID_NAME_CACHE : GROUP_ID_NAME_CACHE;
String name;
CachedName cachedName = idNameCache.get(id);
long now = System.currentTimeMillis();
if (cachedName != null && (cachedName.timestamp + cacheTimeout) > now) {
name = cachedName.name;
} else {
name = (domain == IdCache.USER) ? getUserName(id) : getGroupName(id);
if (LOG.isDebugEnabled()) {
String type = (domain == IdCache.USER) ? "UserName" : "GroupName";
LOG.debug("Got " + type + " " + name + " for ID " + id +
" from the native implementation");
}
cachedName = new CachedName(name, now);
idNameCache.put(id, cachedName);
}
return name;
} static native String getUserName(int uid) throws IOException;
static native String getGroupName(int uid) throws IOException; private static class CachedName {
final long timestamp;
final String name; public CachedName(String name, long timestamp) {
this.name = name;
this.timestamp = timestamp;
}
} private static final Map<Integer, CachedName> USER_ID_NAME_CACHE =
new ConcurrentHashMap<Integer, CachedName>(); private static final Map<Integer, CachedName> GROUP_ID_NAME_CACHE =
new ConcurrentHashMap<Integer, CachedName>(); private enum IdCache { USER, GROUP } public final static int MMAP_PROT_READ = 0x1;
public final static int MMAP_PROT_WRITE = 0x2;
public final static int MMAP_PROT_EXEC = 0x4; public static native long mmap(FileDescriptor fd, int prot,
boolean shared, long length) throws IOException; public static native void munmap(long addr, long length)
throws IOException;
} private static boolean workaroundNonThreadSafePasswdCalls = false; public static class Windows {
// Flags for CreateFile() call on Windows
public static final long GENERIC_READ = 0x80000000L;
public static final long GENERIC_WRITE = 0x40000000L; public static final long FILE_SHARE_READ = 0x00000001L;
public static final long FILE_SHARE_WRITE = 0x00000002L;
public static final long FILE_SHARE_DELETE = 0x00000004L; public static final long CREATE_NEW = 1;
public static final long CREATE_ALWAYS = 2;
public static final long OPEN_EXISTING = 3;
public static final long OPEN_ALWAYS = 4;
public static final long TRUNCATE_EXISTING = 5; public static final long FILE_BEGIN = 0;
public static final long FILE_CURRENT = 1;
public static final long FILE_END = 2; public static final long FILE_ATTRIBUTE_NORMAL = 0x00000080L; /**
* Create a directory with permissions set to the specified mode. By setting
* permissions at creation time, we avoid issues related to the user lacking
* WRITE_DAC rights on subsequent chmod calls. One example where this can
* occur is writing to an SMB share where the user does not have Full Control
* rights, and therefore WRITE_DAC is denied.
*
* @param path directory to create
* @param mode permissions of new directory
* @throws IOException if there is an I/O error
*/
public static void createDirectoryWithMode(File path, int mode)
throws IOException {
createDirectoryWithMode0(path.getAbsolutePath(), mode);
} /** Wrapper around CreateDirectory() on Windows */
private static native void createDirectoryWithMode0(String path, int mode)
throws NativeIOException; /** Wrapper around CreateFile() on Windows */
public static native FileDescriptor createFile(String path,
long desiredAccess, long shareMode, long creationDisposition)
throws IOException; /**
* Create a file for write with permissions set to the specified mode. By
* setting permissions at creation time, we avoid issues related to the user
* lacking WRITE_DAC rights on subsequent chmod calls. One example where
* this can occur is writing to an SMB share where the user does not have
* Full Control rights, and therefore WRITE_DAC is denied.
*
* This method mimics the semantics implemented by the JDK in
* {@link java.io.FileOutputStream}. The file is opened for truncate or
* append, the sharing mode allows other readers and writers, and paths
* longer than MAX_PATH are supported. (See io_util_md.c in the JDK.)
*
* @param path file to create
* @param append if true, then open file for append
* @param mode permissions of new directory
* @return FileOutputStream of opened file
* @throws IOException if there is an I/O error
*/
public static FileOutputStream createFileOutputStreamWithMode(File path,
boolean append, int mode) throws IOException {
long desiredAccess = GENERIC_WRITE;
long shareMode = FILE_SHARE_READ | FILE_SHARE_WRITE;
long creationDisposition = append ? OPEN_ALWAYS : CREATE_ALWAYS;
return new FileOutputStream(createFileWithMode0(path.getAbsolutePath(),
desiredAccess, shareMode, creationDisposition, mode));
} /** Wrapper around CreateFile() with security descriptor on Windows */
private static native FileDescriptor createFileWithMode0(String path,
long desiredAccess, long shareMode, long creationDisposition, int mode)
throws NativeIOException; /** Wrapper around SetFilePointer() on Windows */
public static native long setFilePointer(FileDescriptor fd,
long distanceToMove, long moveMethod) throws IOException; /** Windows only methods used for getOwner() implementation */
private static native String getOwner(FileDescriptor fd) throws IOException; /** Supported list of Windows access right flags */
public static enum AccessRight {
ACCESS_READ (0x0001), // FILE_READ_DATA
ACCESS_WRITE (0x0002), // FILE_WRITE_DATA
ACCESS_EXECUTE (0x0020); // FILE_EXECUTE private final int accessRight;
AccessRight(int access) {
accessRight = access;
} public int accessRight() {
return accessRight;
}
}; /** Windows only method used to check if the current process has requested
* access rights on the given path. */
private static native boolean access0(String path, int requestedAccess); /**
* Checks whether the current process has desired access rights on
* the given path.
*
* Longer term this native function can be substituted with JDK7
* function Files#isReadable, isWritable, isExecutable.
*
* @param path input path
* @param desiredAccess ACCESS_READ, ACCESS_WRITE or ACCESS_EXECUTE
* @return true if access is allowed
* @throws IOException I/O exception on error
*/
public static boolean access(String path, AccessRight desiredAccess)
throws IOException {
return true;
} /**
* Extends both the minimum and maximum working set size of the current
* process. This method gets the current minimum and maximum working set
* size, adds the requested amount to each and then sets the minimum and
* maximum working set size to the new values. Controlling the working set
* size of the process also controls the amount of memory it can lock.
*
* @param delta amount to increment minimum and maximum working set size
* @throws IOException for any error
* @see POSIX#mlock(ByteBuffer, long)
*/
public static native void extendWorkingSetSize(long delta) throws IOException; static {
if (NativeCodeLoader.isNativeCodeLoaded()) {
try {
initNative();
nativeLoaded = true;
} catch (Throwable t) {
// This can happen if the user has an older version of libhadoop.so
// installed - in this case we can continue without native IO
// after warning
PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
}
}
}
} private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class); private static boolean nativeLoaded = false; static {
if (NativeCodeLoader.isNativeCodeLoaded()) {
try {
initNative();
nativeLoaded = true;
} catch (Throwable t) {
// This can happen if the user has an older version of libhadoop.so
// installed - in this case we can continue without native IO
// after warning
PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
}
}
} /**
* Return true if the JNI-based native IO extensions are available.
*/
public static boolean isAvailable() {
return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;
} /** Initialize the JNI method ID and class ID cache */
private static native void initNative(); /**
* Get the maximum number of bytes that can be locked into memory at any
* given point.
*
* @return 0 if no bytes can be locked into memory;
* Long.MAX_VALUE if there is no limit;
* The number of bytes that can be locked into memory otherwise.
*/
static long getMemlockLimit() {
return isAvailable() ? getMemlockLimit0() : 0;
} private static native long getMemlockLimit0(); /**
* @return the operating system's page size.
*/
static long getOperatingSystemPageSize() {
try {
Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
Unsafe unsafe = (Unsafe)f.get(null);
return unsafe.pageSize();
} catch (Throwable e) {
LOG.warn("Unable to get operating system page size. Guessing 4096.", e);
return 4096;
}
} private static class CachedUid {
final long timestamp;
final String username;
public CachedUid(String username, long timestamp) {
this.timestamp = timestamp;
this.username = username;
}
}
private static final Map<Long, CachedUid> uidCache =
new ConcurrentHashMap<Long, CachedUid>();
private static long cacheTimeout;
private static boolean initialized = false; /**
* The Windows logon name has two part, NetBIOS domain name and
* user account name, of the format DOMAIN\UserName. This method
* will remove the domain part of the full logon name.
*
* @param Fthe full principal name containing the domain
* @return name with domain removed
*/
private static String stripDomain(String name) {
int i = name.indexOf('\\');
if (i != -1)
name = name.substring(i + 1);
return name;
} public static String getOwner(FileDescriptor fd) throws IOException {
ensureInitialized();
if (Shell.WINDOWS) {
String owner = Windows.getOwner(fd);
owner = stripDomain(owner);
return owner;
} else {
long uid = POSIX.getUIDforFDOwnerforOwner(fd);
CachedUid cUid = uidCache.get(uid);
long now = System.currentTimeMillis();
if (cUid != null && (cUid.timestamp + cacheTimeout) > now) {
return cUid.username;
}
String user = POSIX.getUserName(uid);
LOG.info("Got UserName " + user + " for UID " + uid
+ " from the native implementation");
cUid = new CachedUid(user, now);
uidCache.put(uid, cUid);
return user;
}
} /**
* Create a FileDescriptor that shares delete permission on the
* file opened at a given offset, i.e. other process can delete
* the file the FileDescriptor is reading. Only Windows implementation
* uses the native interface.
*/
public static FileDescriptor getShareDeleteFileDescriptor(
File f, long seekOffset) throws IOException {
if (!Shell.WINDOWS) {
RandomAccessFile rf = new RandomAccessFile(f, "r");
if (seekOffset > 0) {
rf.seek(seekOffset);
}
return rf.getFD();
} else {
// Use Windows native interface to create a FileDescriptor that
// shares delete permission on the file opened, and set it to the
// given offset.
//
FileDescriptor fd = NativeIO.Windows.createFile(
f.getAbsolutePath(),
NativeIO.Windows.GENERIC_READ,
NativeIO.Windows.FILE_SHARE_READ |
NativeIO.Windows.FILE_SHARE_WRITE |
NativeIO.Windows.FILE_SHARE_DELETE,
NativeIO.Windows.OPEN_EXISTING);
if (seekOffset > 0)
NativeIO.Windows.setFilePointer(fd, seekOffset, NativeIO.Windows.FILE_BEGIN);
return fd;
}
} /**
* Create the specified File for write access, ensuring that it does not exist.
* @param f the file that we want to create
* @param permissions we want to have on the file (if security is enabled)
*
* @throws AlreadyExistsException if the file already exists
* @throws IOException if any other error occurred
*/
public static FileOutputStream getCreateForWriteFileOutputStream(File f, int permissions)
throws IOException {
if (!Shell.WINDOWS) {
// Use the native wrapper around open(2)
try {
FileDescriptor fd = NativeIO.POSIX.open(f.getAbsolutePath(),
NativeIO.POSIX.O_WRONLY | NativeIO.POSIX.O_CREAT
| NativeIO.POSIX.O_EXCL, permissions);
return new FileOutputStream(fd);
} catch (NativeIOException nioe) {
if (nioe.getErrno() == Errno.EEXIST) {
throw new AlreadyExistsException(nioe);
}
throw nioe;
}
} else {
// Use the Windows native APIs to create equivalent FileOutputStream
try {
FileDescriptor fd = NativeIO.Windows.createFile(f.getCanonicalPath(),
NativeIO.Windows.GENERIC_WRITE,
NativeIO.Windows.FILE_SHARE_DELETE
| NativeIO.Windows.FILE_SHARE_READ
| NativeIO.Windows.FILE_SHARE_WRITE,
NativeIO.Windows.CREATE_NEW);
NativeIO.POSIX.chmod(f.getCanonicalPath(), permissions);
return new FileOutputStream(fd);
} catch (NativeIOException nioe) {
if (nioe.getErrorCode() == 80) {
// ERROR_FILE_EXISTS
// 80 (0x50)
// The file exists
throw new AlreadyExistsException(nioe);
}
throw nioe;
}
}
} private synchronized static void ensureInitialized() {
if (!initialized) {
cacheTimeout =
new Configuration().getLong("hadoop.security.uid.cache.secs",
4*60*60) * 1000;
LOG.info("Initialized cache for UID to User mapping with a cache" +
" timeout of " + cacheTimeout/1000 + " seconds.");
initialized = true;
}
} /**
* A version of renameTo that throws a descriptive exception when it fails.
*
* @param src The source path
* @param dst The destination path
*
* @throws NativeIOException On failure.
*/
public static void renameTo(File src, File dst)
throws IOException {
if (!nativeLoaded) {
if (!src.renameTo(dst)) {
throw new IOException("renameTo(src=" + src + ", dst=" +
dst + ") failed.");
}
} else {
renameTo0(src.getAbsolutePath(), dst.getAbsolutePath());
}
} /**
* Creates a hardlink "dst" that points to "src".
*
* This is deprecated since JDK7 NIO can create hardlinks via the
* {@link java.nio.file.Files} API.
*
* @param src source file
* @param dst hardlink location
* @throws IOException
*/
@Deprecated
public static void link(File src, File dst) throws IOException {
if (!nativeLoaded) {
HardLink.createHardLink(src, dst);
} else {
link0(src.getAbsolutePath(), dst.getAbsolutePath());
}
} /**
* A version of renameTo that throws a descriptive exception when it fails.
*
* @param src The source path
* @param dst The destination path
*
* @throws NativeIOException On failure.
*/
private static native void renameTo0(String src, String dst)
throws NativeIOException; private static native void link0(String src, String dst)
throws NativeIOException; /**
* Unbuffered file copy from src to dst without tainting OS buffer cache
*
* In POSIX platform:
* It uses FileChannel#transferTo() which internally attempts
* unbuffered IO on OS with native sendfile64() support and falls back to
* buffered IO otherwise.
*
* It minimizes the number of FileChannel#transferTo call by passing the the
* src file size directly instead of a smaller size as the 3rd parameter.
* This saves the number of sendfile64() system call when native sendfile64()
* is supported. In the two fall back cases where sendfile is not supported,
* FileChannle#transferTo already has its own batching of size 8 MB and 8 KB,
* respectively.
*
* In Windows Platform:
* It uses its own native wrapper of CopyFileEx with COPY_FILE_NO_BUFFERING
* flag, which is supported on Windows Server 2008 and above.
*
* Ideally, we should use FileChannel#transferTo() across both POSIX and Windows
* platform. Unfortunately, the wrapper(Java_sun_nio_ch_FileChannelImpl_transferTo0)
* used by FileChannel#transferTo for unbuffered IO is not implemented on Windows.
* Based on OpenJDK 6/7/8 source code, Java_sun_nio_ch_FileChannelImpl_transferTo0
* on Windows simply returns IOS_UNSUPPORTED.
*
* Note: This simple native wrapper does minimal parameter checking before copy and
* consistency check (e.g., size) after copy.
* It is recommended to use wrapper function like
* the Storage#nativeCopyFileUnbuffered() function in hadoop-hdfs with pre/post copy
* checks.
*
* @param src The source path
* @param dst The destination path
* @throws IOException
*/
public static void copyFileUnbuffered(File src, File dst) throws IOException {
if (nativeLoaded && Shell.WINDOWS) {
copyFileUnbuffered0(src.getAbsolutePath(), dst.getAbsolutePath());
} else {
FileInputStream fis = new FileInputStream(src);
FileChannel input = null;
try {
input = fis.getChannel();
try (FileOutputStream fos = new FileOutputStream(dst);
FileChannel output = fos.getChannel()) {
long remaining = input.size();
long position = 0;
long transferred = 0;
while (remaining > 0) {
transferred = input.transferTo(position, remaining, output);
remaining -= transferred;
position += transferred;
}
}
} finally {
IOUtils.cleanupWithLogger(LOG, input, fis);
}
}
} private static native void copyFileUnbuffered0(String src, String dst)
throws NativeIOException;
}
三、关于这个使用maven构建的项目,我在运行时因为使用公司内网,速度很慢,所以改变策略。创建java项目,然后把hadoop2.9.2里面的share目录下的common、hdfs、httpfs、yarn、mapreduce目录下的jar文件都拷了进来,运行中出了不少bug。
hadoop-hdfs-2.9.2.jar
hadoop-hdfs-client-2.9.2.jar
hadoop-mapreduce-client-app-2.9.2.jar
hadoop-mapreduce-client-common-2.9.2.jar
hadoop-mapreduce-client-core-2.9.2.jar
hadoop-mapreduce-client-hs-2.9.2.jar
hadoop-mapreduce-client-jobclient-2.9.2-tests.jar
hadoop-mapreduce-client-shuffle-2.9.2.jar
hadoop-yarn-api-2.9.2.jar
hadoop-yarn-applications-distributedshell-2.9.2.jar
hadoop-yarn-applications-unmanaged-am-launcher-2.9.2.jar
hadoop-yarn-client-2.9.2.jar
activation-1.1.jar
aopalliance-1.0.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
asm-3.2.jar
avro-1.7.7.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.2.jar
commons-codec-1.4.jar
commons-collections-3.2.2.jar
commons-compress-1.4.1.jar
commons-configuration-1.6.jar
commons-digester-1.8.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-lang3-3.4.jar
commons-logging-1.1.3.jar
commons-math3-3.1.1.jar
commons-net-3.1.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
ehcache-3.3.1.jar
fst-2.50.jar
geronimo-jcache_1.0_spec-1.0-alpha-1.jar
gson-2.2.4.jar
guava-11.0.2.jar
guice-3.0.jar
guice-servlet-3.0.jar
HikariCP-java7-2.4.12.jar
htrace-core4-4.1.0-incubating.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
jackson-core-asl-1.9.13.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-xc-1.9.13.jar
java-util-1.9.0.jar
java-xmlbuilder-0.4.jar
javax.inject-1.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jcip-annotations-1.0-1.jar
jersey-client-1.9.jar
jersey-core-1.9.jar
jersey-guice-1.9.jar
jersey-json-1.9.jar
jersey-server-1.9.jar
jets3t-0.9.0.jar
jettison-1.1.jar
jetty-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
jsch-0.1.54.jar
json-io-2.5.1.jar
json-smart-1.3.1.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
leveldbjni-all-1.8.jar
log4j-1.2.17.jar
metrics-core-3.0.1.jar
mssql-jdbc-6.2.1.jre7.jar
netty-3.6.2.Final.jar
nimbus-jose-jwt-4.41.1.jar
paranamer-2.3.jar
protobuf-java-2.5.0.jar
servlet-api-2.5.jar
snappy-java-1.0.5.jar
stax-api-1.0-2.jar
stax2-api-3.1.4.jar
woodstox-core-5.0.3.jar
xmlenc-0.52.jar
xz-1.0.jar
zookeeper-3.4.6.jar
hadoop-common-2.9.2.jar
slf4j-api-1.7.25.jar
slf4j-log4j12-1.7.25.jar
hadoop-yarn-server-nodemanager-2.9.2.jar
hadoop-yarn-server-resourcemanager-2.9.2.jar
hadoop-yarn-server-router-2.9.2.jar
hadoop-yarn-server-sharedcachemanager-2.9.2.jar
hadoop-yarn-server-timeline-pluginstorage-2.9.2.jar
hadoop-yarn-server-web-proxy-2.9.2.jar
hadoop-yarn-ui-2.9.2.war
hadoop-annotations-2.9.2.jar
hadoop-auth-2.9.2.jar
hadoop-nfs-2.9.2.jar
hamcrest-core-1.3.jar
junit-4.11.jar
hadoop-mapreduce-client-jobclient-2.9.2.jar
mockito-all-1.8.5.jar
ojdbc7.jar
orai18n.jar
hadoop-yarn-common-2.9.2.jar
hadoop-yarn-registry-2.9.2.jar
hadoop-yarn-server-applicationhistoryservice-2.9.2.jar
hadoop-yarn-server-common-2.9.2.jar
前言
Web日志包含着网站最重要的信息,通过日志分析,我们可以知道网站的访问量,哪个网页访问人数最多,哪个网页最有价值等。一般中型的网站(10W的PV以上),每天会产生1G以上Web日志文件。大型或超大型的网站,可能每小时就会产生10G的数据量。
对于日志的这种规模的数据,用Hadoop进行日志分析,是最适合不过的了。
目录
- Web日志分析概述
- 需求分析:KPI指标设计
- 算法模型:Hadoop并行算法
- 架构设计:日志KPI系统架构
- 程序开发1:用Maven构建Hadoop项目
- 程序开发2:MapReduce程序实现
1. Web日志分析概述
Web日志由Web服务器产生,可能是Nginx, Apache, Tomcat等。从Web日志中,我们可以获取网站每类页面的PV值(PageView,页面访问量)、独立IP数;稍微复杂一些的,可以计算得出用户所检索的关键词排行榜、用户停留时间最高的页面等;更复杂的,构建广告点击模型、分析用户行为特征等等。
在Web日志中,每条日志通常代表着用户的一次访问行为,例如下面就是一条nginx日志:
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939
"http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
拆解为以下8个变量
- remote_addr: 记录客户端的ip地址, 222.68.172.190
- remote_user: 记录客户端用户名称, –
- time_local: 记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
- request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1”
- status: 记录请求状态,成功是200, 200
- body_bytes_sent: 记录发送给客户端文件主体内容大小, 19939
- http_referer: 用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
- http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36”
注:要更多的信息,则要用其它手段去获取,通过js代码单独发送请求,使用cookies记录用户的访问信息。
利用这些日志信息,我们可以深入挖掘网站的秘密了。
少量数据的情况
少量数据的情况(10Mb,100Mb,10G),在单机处理尚能忍受的时候,我可以直接利用各种Unix/Linux工具,awk、grep、sort、join等都是日志分析的利器,再配合perl, python,正则表达工,基本就可以解决所有的问题。
例如,我们想从上面提到的nginx日志中得到访问量最高前10个IP,实现很简单:
~ cat access.log.10 | awk '{a[$1]++} END {for(b in a) print b"\t"a[b]}' | sort -k2 -r | head -n 10
163.177.71.12 972
101.226.68.137 972
183.195.232.138 971
50.116.27.194 97
14.17.29.86 96
61.135.216.104 94
61.135.216.105 91
61.186.190.41 9
59.39.192.108 9
220.181.51.212 9
海量数据的情况
当数据量每天以10G、100G增长的时候,单机处理能力已经不能满足需求。我们就需要增加系统的复杂性,用计算机集群,存储阵列来解决。在Hadoop出现之前,海量数据存储,和海量日志分析都是非常困难的。只有少数一些公司,掌握着高效的并行计算,分步式计算,分步式存储的核心技术。
Hadoop的出现,大幅度的降低了海量数据处理的门槛,让小公司甚至是个人都能力,搞定海量数据。并且,Hadoop非常适用于日志分析系统。
2.需求分析:KPI指标设计
下面我们将从一个公司案例出发来全面的解释,如何用进行海量Web日志分析,提取KPI数据。
案例介绍
某电子商务网站,在线团购业务。每日PV数100w,独立IP数5w。用户通常在工作日上午10:00-12:00和下午15:00-18:00访问量最大。日间主要是通过PC端浏览器访问,休息日及夜间通过移动设备访问较多。网站搜索浏量占整个网站的80%,PC用户不足1%的用户会消费,移动用户有5%会消费。
通过简短的描述,我们可以粗略地看出,这家电商网站的经营状况,并认识到愿意消费的用户从哪里来,有哪些潜在的用户可以挖掘,网站是否存在倒闭风险等。
KPI指标设计
- PV(PageView): 页面访问量统计
- IP: 页面独立IP的访问量统计
- Time: 用户每小时PV的统计
- Source: 用户来源域名的统计
- Browser: 用户的访问设备统计
注:商业保密限制,无法提供电商网站的日志。
下面的内容,将以我的个人网站为例提取数据进行分析。
百度统计,对我个人网站做的统计!http://www.fens.me
基本统计指标:
用户的访问设备统计指标:
从商业的角度,个人网站的特征与电商网站不太一样,没有转化率,同时跳出率也比较高。从技术的角度,同样都关注KPI指标设计。
3.算法模型:Hadoop并行算法
并行算法的设计:
注:找到第一节有定义的8个变量
PV(PageView): 页面访问量统计
- Map过程{key:$request,value:1}
- Reduce过程{key:$request,value:求和(sum)}
IP: 页面独立IP的访问量统计
- Map: {key:$request,value:$remote_addr}
- Reduce: {key:$request,value:去重再求和(sum(unique))}
Time: 用户每小时PV的统计
- Map: {key:$time_local,value:1}
- Reduce: {key:$time_local,value:求和(sum)}
Source: 用户来源域名的统计
- Map: {key:$http_referer,value:1}
- Reduce: {key:$http_referer,value:求和(sum)}
Browser: 用户的访问设备统计
- Map: {key:$http_user_agent,value:1}
- Reduce: {key:$http_user_agent,value:求和(sum)}
4.架构设计:日志KPI系统架构
上图中,左边是Application业务系统,右边是Hadoop的HDFS, MapReduce。
- 日志是由业务系统产生的,我们可以设置web服务器每天产生一个新的目录,目录下面会产生多个日志文件,每个日志文件64M。
- 设置系统定时器CRON,夜间在0点后,向HDFS导入昨天的日志文件。
- 完成导入后,设置系统定时器,启动MapReduce程序,提取并计算统计指标。
- 完成计算后,设置系统定时器,从HDFS导出统计指标数据到数据库,方便以后的即使查询。
上面这幅图,我们可以看得更清楚,数据是如何流动的。蓝色背景的部分是在Hadoop中的,接下来我们的任务就是完成MapReduce的程序实现。
5.程序开发1:用Maven构建Hadoop项目
请参考文章:用Maven构建Hadoop项目
win7的开发环境 和 Hadoop的运行环境 ,在上面文章中已经介绍过了。
我们需要放日志文件,上传的HDFS里/user/hdfs/log_kpi/目录,参考下面的命令操作
~ hadoop fs -mkdir /user/hdfs/log_kpi
~ hadoop fs -copyFromLocal /home/conan/datafiles/access.log.10 /user/hdfs/log_kpi/
我已经把整个MapReduce的实现都放到了github上面:
https://github.com/bsspirit/maven_hadoop_template/releases/tag/kpi_v1
6.程序开发2:MapReduce程序实现
开发流程:
- 对日志行的解析
- Map函数实现
- Reduce函数实现
- 启动程序实现
1). 对日志行的解析
新建文件:org.conan.myhadoop.mr.kpi.KPI.java
package org.conan.myhadoop.mr.kpi;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
/*
* KPI Object
*/
public class KPI {
private String remote_addr;// 记录客户端的ip地址
private String remote_user;// 记录客户端用户名称,忽略属性"-"
private String time_local;// 记录访问时间与时区
private String request;// 记录请求的url与http协议
private String status;// 记录请求状态;成功是200
private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
private String http_referer;// 用来记录从那个页面链接访问过来的
private String http_user_agent;// 记录客户浏览器的相关信息
private boolean valid = true;// 判断数据是否合法
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append("valid:" + this.valid);
sb.append("\nremote_addr:" + this.remote_addr);
sb.append("\nremote_user:" + this.remote_user);
sb.append("\ntime_local:" + this.time_local);
sb.append("\nrequest:" + this.request);
sb.append("\nstatus:" + this.status);
sb.append("\nbody_bytes_sent:" + this.body_bytes_sent);
sb.append("\nhttp_referer:" + this.http_referer);
sb.append("\nhttp_user_agent:" + this.http_user_agent);
return sb.toString();
}
public String getRemote_addr() {
return remote_addr;
}
public void setRemote_addr(String remote_addr) {
this.remote_addr = remote_addr;
}
public String getRemote_user() {
return remote_user;
}
public void setRemote_user(String remote_user) {
this.remote_user = remote_user;
}
public String getTime_local() {
return time_local;
}
public Date getTime_local_Date() throws ParseException {
SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
return df.parse(this.time_local);
}
public String getTime_local_Date_hour() throws ParseException{
SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
return df.format(this.getTime_local_Date());
}
public void setTime_local(String time_local) {
this.time_local = time_local;
}
public String getRequest() {
return request;
}
public void setRequest(String request) {
this.request = request;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getBody_bytes_sent() {
return body_bytes_sent;
}
public void setBody_bytes_sent(String body_bytes_sent) {
this.body_bytes_sent = body_bytes_sent;
}
public String getHttp_referer() {
return http_referer;
}
public String getHttp_referer_domain(){
if(http_referer.length()<8){
return http_referer;
}
String str=this.http_referer.replace("\"", "").replace("http://", "").replace("https://", "");
return str.indexOf("/")>0?str.substring(0, str.indexOf("/")):str;
}
public void setHttp_referer(String http_referer) {
this.http_referer = http_referer;
}
public String getHttp_user_agent() {
return http_user_agent;
}
public void setHttp_user_agent(String http_user_agent) {
this.http_user_agent = http_user_agent;
}
public boolean isValid() {
return valid;
}
public void setValid(boolean valid) {
this.valid = valid;
}
public static void main(String args[]) {
String line = "222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";
System.out.println(line);
KPI kpi = new KPI();
String[] arr = line.split(" ");
kpi.setRemote_addr(arr[0]);
kpi.setRemote_user(arr[1]);
kpi.setTime_local(arr[3].substring(1));
kpi.setRequest(arr[6]);
kpi.setStatus(arr[8]);
kpi.setBody_bytes_sent(arr[9]);
kpi.setHttp_referer(arr[10]);
kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
System.out.println(kpi);
try {
SimpleDateFormat df = new SimpleDateFormat("yyyy.MM.dd:HH:mm:ss", Locale.US);
System.out.println(df.format(kpi.getTime_local_Date()));
System.out.println(kpi.getTime_local_Date_hour());
System.out.println(kpi.getHttp_referer_domain());
} catch (ParseException e) {
e.printStackTrace();
}
}
}
从日志文件中,取一行通过main函数写一个简单的解析测试。
控制台输出:
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
valid:true
remote_addr:222.68.172.190
remote_user:-
time_local:18/Sep/2013:06:49:57
request:/images/my.jpg
status:200
body_bytes_sent:19939
http_referer:"http://www.angularjs.cn/A00n"
http_user_agent:"Mozilla/5.0 (Windows
2013.09.18:06:49:57
2013091806
www.angularjs.cn
我们看到日志行,被正确的解析成了kpi对象的属性。我们把解析过程,单独封装成一个方法。
private static KPI parser(String line) {
System.out.println(line);
KPI kpi = new KPI();
String[] arr = line.split(" ");
if (arr.length > 11) {
kpi.setRemote_addr(arr[0]);
kpi.setRemote_user(arr[1]);
kpi.setTime_local(arr[3].substring(1));
kpi.setRequest(arr[6]);
kpi.setStatus(arr[8]);
kpi.setBody_bytes_sent(arr[9]);
kpi.setHttp_referer(arr[10]);
if (arr.length > 12) {
kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
} else {
kpi.setHttp_user_agent(arr[11]);
}
if (Integer.parseInt(kpi.getStatus()) >= 400) {// 大于400,HTTP错误
kpi.setValid(false);
}
} else {
kpi.setValid(false);
}
return kpi;
}
对map方法,reduce方法,启动方法,我们单独写一个类来实现
下面将分别介绍MapReduce的实现类:
- PV:org.conan.myhadoop.mr.kpi.KPIPV.java
- IP: org.conan.myhadoop.mr.kpi.KPIIP.java
- Time: org.conan.myhadoop.mr.kpi.KPITime.java
- Browser: org.conan.myhadoop.mr.kpi.KPIBrowser.java
1). PV:org.conan.myhadoop.mr.kpi.KPIPV.java
package org.conan.myhadoop.mr.kpi;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class KPIPV {
public static class KPIPVMapper extends MapReduceBase implements Mapper<object, text,="" intwritable=""> {
private IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException {
KPI kpi = KPI.filterPVs(value.toString());
if (kpi.isValid()) {
word.set(kpi.getRequest());
output.collect(word, one);
}
}
}
public static class KPIPVReducer extends MapReduceBase implements Reducer<text, intwritable,="" text,="" intwritable=""> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterator values, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
result.set(sum);
output.collect(key, result);
}
}
public static void main(String[] args) throws Exception {
String input = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/";
String output = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv";
JobConf conf = new JobConf(KPIPV.class);
conf.setJobName("KPIPV");
conf.addResource("classpath:/hadoop/core-site.xml");
conf.addResource("classpath:/hadoop/hdfs-site.xml");
conf.addResource("classpath:/hadoop/mapred-site.xml");
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(KPIPVMapper.class);
conf.setCombinerClass(KPIPVReducer.class);
conf.setReducerClass(KPIPVReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
System.exit(0);
}
}
在程序中会调用KPI类的方法
KPI kpi = KPI.filterPVs(value.toString());
通过filterPVs方法,我们可以实现对PV,更多的控制。
在KPK.java中,增加filterPVs方法
/**
* 按page的pv分类
*/
public static KPI filterPVs(String line) {
KPI kpi = parser(line);
Set pages = new HashSet();
pages.add("/about");
pages.add("/black-ip-list/");
pages.add("/cassandra-clustor/");
pages.add("/finance-rhive-repurchase/");
pages.add("/hadoop-family-roadmap/");
pages.add("/hadoop-hive-intro/");
pages.add("/hadoop-zookeeper-intro/");
pages.add("/hadoop-mahout-roadmap/");
if (!pages.contains(kpi.getRequest())) {
kpi.setValid(false);
}
return kpi;
}
在filterPVs方法,我们定义了一个pages的过滤,就是只对这个页面进行PV统计。
我们运行一下KPIPV.java
2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-9 11:53:28 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757
2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757
2013-10-9 11:53:30 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_m_000000_0' done.
2013-10-9 11:53:30 org.apache.hadoop.mapred.Task initialize
信息: Using ResourceCalculatorPlugin : null
2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息:
2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 213 bytes
2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息:
2013-10-9 11:53:30 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息:
2013-10-9 11:53:30 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0001_r_000000_0 is allowed to commit now
2013-10-9 11:53:30 org.apache.hadoop.mapred.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv
2013-10-9 11:53:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: map 100% reduce 0%
2013-10-9 11:53:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-9 11:53:33 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_r_000000_0' done.
2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: map 100% reduce 100%
2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0001
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Counters: 20
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: File Input Format Counters
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Bytes Read=3025757
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: File Output Format Counters
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Bytes Written=183
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: FileSystemCounters
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: FILE_BYTES_READ=545
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: HDFS_BYTES_READ=6051514
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: FILE_BYTES_WRITTEN=83472
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: HDFS_BYTES_WRITTEN=183
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Map-Reduce Framework
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Map output materialized bytes=217
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Map input records=14619
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Reduce shuffle bytes=0
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Spilled Records=16
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Map output bytes=2004
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Total committed heap usage (bytes)=376569856
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Map input bytes=3025757
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: SPLIT_RAW_BYTES=110
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Combine input records=76
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Reduce input records=8
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Reduce input groups=8
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Combine output records=8
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Reduce output records=8
2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
信息: Map output records=76
用hadoop命令查看HDFS文件
~ hadoop fs -cat /user/hdfs/log_kpi/pv/part-00000
/about 5
/black-ip-list/ 2
/cassandra-clustor/ 3
/finance-rhive-repurchase/ 13
/hadoop-family-roadmap/ 13
/hadoop-hive-intro/ 14
/hadoop-mahout-roadmap/ 20
/hadoop-zookeeper-intro/ 6
这样我们就得到了,刚刚日志文件中的,指定页面的PV值。
指定页面,就像网站的站点地图一样,如果没有指定所有访问链接都会被找出来,通过“站点地图”的指定,我们可以更容易地找到,我们所需要的信息。
后面,其他的统计指标的提取思路,和PV的实现过程都是类似的,大家可以直接下载源代码,运行看到结果!!
后面我会把我代码上传到github上面:
https://github.com/blench/
hadoop入门之海量Web日志分析 用Hadoop提取KPI统计指标的更多相关文章
- 海量Web日志分析 用Hadoop提取KPI统计指标
http://blog.fens.me/hadoop-mapreduce-log-kpi/ http://dongxicheng.org/search-engine/scribe-installati ...
- 海量WEB日志分析
Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, ...
- Hadoop应用开发实战案例 第2周 Web日志分析项目 张丹
课程内容 本文链接: 张丹博客 http://www.fens.me 用Maven构建Hadoop项目 http://blog.fens.me/hadoop-maven-eclipse/程序源代码下载 ...
- linux系统web日志分析脚本
linux系统web日志分析这方面工具比较多,比如logwatch或awstats等使用perl语言开发,功能都非常强大.但这些软件都需要进行一些配置,很多朋友往往在技术方面没有投入太多力量,即便参照 ...
- Hadoop:实战Web日志分析
示例场景 日志说明 有两台Web服务器,日志文件存放在/usr/local/nginx/logs/目录,日志默认为nginx定义格式.如: 123.13.17.13 - - [25/Aug/2016: ...
- Hadoop学习笔记—20.网站日志分析项目案例(二)数据清洗
网站日志分析项目案例(一)项目介绍:http://www.cnblogs.com/edisonchou/p/4449082.html 网站日志分析项目案例(二)数据清洗:当前页面 网站日志分析项目案例 ...
- Hadoop学习笔记—20.网站日志分析项目案例
1.1 项目来源 本次要实践的数据日志来源于国内某技术学习论坛,该论坛由某培训机构主办,汇聚了众多技术学习者,每天都有人发帖.回帖,如图1所示. 图1 项目来源网站-技术学习论坛 本次实践的目的就在于 ...
- [spark案例学习] WEB日志分析
数据准备 数据下载:美国宇航局肯尼迪航天中心WEB日志 我们先来看看数据:首先将日志加载到RDD,并显示出前20行(默认). import sys import os log_file_path =' ...
- 可视化实时Web日志分析工具-goaccess
说到web服务器就不得不说Nginx,目前已成为企业建站的首选.但由于种种历史原因,Nginx日志分析工具相较于传统的apache.lighthttp等还是少很多. 今天就和大家分享一个非常强大的实时 ...
随机推荐
- Windows Server - SVN 服务器搭建与项目配置、客户端安装与配置
本教程以Windows Server 2012 R12 为例搭建SVN服务器,安装部署完成后,客户端可通过SVN客户端访问SVN服务器上的项目,也可以访问网上其他SVN服务器上的项目. 一.首先准备三 ...
- JS函数提升和变量提升
1.1什么是函数提升和变量的提升? JS引擎在运行整个JS代码的过程中,分为俩步. 第一步是读取和解析JS代码,第二部是执行. 在引擎解析JS代码的时候,当解析器遇见变量声明(var 变量名)和函数声 ...
- docker运行原理与使用总结
docker运行原理概述 Client-Server架构 docker守护进程运行在宿主机上systemctl start docker daemon进程通过socket从客户端(docker命令)接 ...
- 【转载】Windows api数据类型
最近在接触windows api函数,看到了很多之前没有看到过的数据类型,发现“个人图书馆”中有个帖子说的挺详细的,特地搬运过来 Windows 数据类型 Delphi 数据类型 描述 LPSTR P ...
- 2019nc#10
题号 标题 已通过代码 题解/讨论 通过率 团队的状态 A Blackjack 点击查看 背包DP 32/109 补好了 B Coffee Chicken 点击查看 进入讨论 738/2992 通过 ...
- lightoj 1145 - Dice (I)(dp+空间优化+前缀和)
题目链接:http://www.lightoj.com/volume_showproblem.php?problem=1145 题解:首先只要是dp的值只和上一个状态有关系那么就可以优化一维,然后这题 ...
- hdu1255 覆盖的面积(线段树面积交)
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1255 面积交与面积并相似相比回了面积并,面积交一定会有思路,当然就是cover标记大于等于两次时. 但 ...
- codeforces 789 C. Functions again(dp求区间和最大)
题目链接:http://codeforces.com/contest/789/problem/C 题意:就是给出一个公式 然后给出一串数求一个区间使得f(l,r)最大. 这题需要一个小小的处理 可以设 ...
- CSU 1803 2016 湖南省2016省赛
1803: 2016 Submit Page Summary Time Limit: 5 Sec Memory Limit: 128 Mb Submitted: 1416 ...
- 前端利器躬行记(4)——webpack进阶
webpack是一个非常强大的工具,除了前文所介绍的基础概念之外,还有各种进阶应用,例如Source Map.模块热替换.集成等,本文会对这些内容做依次讲解. 一. runtime和manifest ...