hadoop入门之海量Web日志分析用Hadoop提取KPI统计指标

转载自：
http://blog.fens.me/hadoop-mapreduce-log-kpi/

今天学习了这一篇博客，写得十分好，照着这篇博客敲了一遍。

发现几个问题，

一是这篇博客中采用的hadoop版本过低，如果在hadoop2.x上面跑的话，可能会出现结果文件没有写入任何数据，为了解决这个问题，我试着去参照官网http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html的API进行操作，发现官网里讲得十分详细，只要有一点英文基础的同行都可以看得懂，直白简单。hadoop2.x相比较hadoop1.x而言编写Mapper类，可以直接继承import org.apache.hadoop.mapreduce.Mapper;无需再实现Mapper接口了，其中关于map方法的写法也变了改成如下：

		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

			// TODO Auto-generated method stub

			KPI kpi = KPI.filterPVS(value.toString());

			System.out.println(kpi);

			if (kpi.isValid()) {

				word.set(kpi.getIp());

				context.write(word, one);

			}

		}

hadoop1.x的写法如下：

 @Override

        public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

            KPI kpi = KPI.filterPVs(value.toString());

            if (kpi.isValid()) {

                word.set(kpi.getRequest());

                output.collect(word, one);

            }

        }

hadoop2.x的写法就必须改变了，相应的Reducer中的reduce方法随之改变。一开始没有发现文中的github网址去百度了一下费了很大劲找到了一个150多M的文件，需要自取：

链接: https://pan.baidu.com/s/1hz5dTX69Hc_l9Aj-axvfqw 提取码: ssys 复制这段内容后打开百度网盘手机App，操作更方便哦，当然这个日志文件内容与博客的不一致，少了两个属性，请自行对照代码修改。

二、在hadoop2.x上面运行，在main方法里配置运行参数我这次使用的hadoop2.9.2这个版本的，需要用到winuitil.exe和hadoop.dll这两个工具。已经上传到百度网盘上面，地址如下，链接: https://pan.baidu.com/s/1RTSeGjV2VwWxRAvsUMkkrA 提取码: dkxt ，有三个文件分别是hadoop.2.9.2,eclipse插件，以及winutil,需要把hadoo2.6x里面的文件全部复制到hadoop.2.9.2/bin文件夹下，其中hadoop2.6.x中的haoop.dll需要复制到c:/Windows/System32目录下。关闭所有应用重启计算机，在main方法中设置如下系统属性：

　　　　　　　　　 System.setProperty("HADOOP_HOME", "E:\\hadoop\\hadoop2.6");

		System.setProperty("hadoop.home.dir", "E:\\hadoop\\hadoop-2.9.2");

		System.setProperty("HADOOP_USER_NAME", "hadoop");

设置好以后运行会报错：Acess$0之类的错误：遇到这种情况，在项目src下新建NativeIO.java文件，修改如下：

/**

 * Licensed to the Apache Software Foundation (ASF) under one

 * or more contributor license agreements.  See the NOTICE file

 * distributed with this work for additional information

 * regarding copyright ownership.  The ASF licenses this file

 * to you under the Apache License, Version 2.0 (the

 * "License"); you may not use this file except in compliance

 * with the License.  You may obtain a copy of the License at

 *

 *     http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS,

 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

 * See the License for the specific language governing permissions and

 * limitations under the License.

 */

package org.apache.hadoop.io.nativeio;

import java.io.File;

import java.io.FileDescriptor;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.lang.reflect.Field;

import java.nio.ByteBuffer;

import java.nio.MappedByteBuffer;

import java.nio.channels.FileChannel;

import java.util.Map;

import java.util.concurrent.ConcurrentHashMap;

import org.apache.hadoop.classification.InterfaceAudience;

import org.apache.hadoop.classification.InterfaceStability;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.CommonConfigurationKeys;

import org.apache.hadoop.fs.HardLink;

import org.apache.hadoop.io.IOUtils;

import org.apache.hadoop.io.SecureIOUtils.AlreadyExistsException;

import org.apache.hadoop.util.NativeCodeLoader;

import org.apache.hadoop.util.Shell;

import org.apache.hadoop.util.PerformanceAdvisory;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import sun.misc.Unsafe;

import com.google.common.annotations.VisibleForTesting;

/**

 * JNI wrappers for various native IO-related calls not available in Java.

 * These functions should generally be used alongside a fallback to another

 * more portable mechanism.

 */

@InterfaceAudience.Private

@InterfaceStability.Unstable

public class NativeIO {

  public static class POSIX {

    // Flags for open() call from bits/fcntl.h - Set by JNI

    public static int O_RDONLY = -1;

    public static int O_WRONLY = -1;

    public static int O_RDWR = -1;

    public static int O_CREAT = -1;

    public static int O_EXCL = -1;

    public static int O_NOCTTY = -1;

    public static int O_TRUNC = -1;

    public static int O_APPEND = -1;

    public static int O_NONBLOCK = -1;

    public static int O_SYNC = -1;

    // Flags for posix_fadvise() from bits/fcntl.h - Set by JNI

    /* No further special treatment.  */

    public static int POSIX_FADV_NORMAL = -1;

    /* Expect random page references.  */

    public static int POSIX_FADV_RANDOM = -1;

    /* Expect sequential page references.  */

    public static int POSIX_FADV_SEQUENTIAL = -1;

    /* Will need these pages.  */

    public static int POSIX_FADV_WILLNEED = -1;

    /* Don't need these pages.  */

    public static int POSIX_FADV_DONTNEED = -1;

    /* Data will be accessed once.  */

    public static int POSIX_FADV_NOREUSE = -1;

    // Updated by JNI when supported by glibc.  Leave defaults in case kernel

    // supports sync_file_range, but glibc does not.

    /* Wait upon writeout of all pages

       in the range before performing the

       write.  */

    public static int SYNC_FILE_RANGE_WAIT_BEFORE = 1;

    /* Initiate writeout of all those

       dirty pages in the range which are

       not presently under writeback.  */

    public static int SYNC_FILE_RANGE_WRITE = 2;

    /* Wait upon writeout of all pages in

       the range after performing the

       write.  */

    public static int SYNC_FILE_RANGE_WAIT_AFTER = 4;

    private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class);

    // Set to true via JNI if possible

    public static boolean fadvisePossible = false;

    private static boolean nativeLoaded = false;

    private static boolean syncFileRangePossible = true;

    static final String WORKAROUND_NON_THREADSAFE_CALLS_KEY =

      "hadoop.workaround.non.threadsafe.getpwuid";

    static final boolean WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT = true;

    private static long cacheTimeout = -1;

    private static CacheManipulator cacheManipulator = new CacheManipulator();

    public static CacheManipulator getCacheManipulator() {

      return cacheManipulator;

    }

    public static void setCacheManipulator(CacheManipulator cacheManipulator) {

      POSIX.cacheManipulator = cacheManipulator;

    }

    /**

     * Used to manipulate the operating system cache.

     */

    @VisibleForTesting

    public static class CacheManipulator {

      public void mlock(String identifier, ByteBuffer buffer,

          long len) throws IOException {

        POSIX.mlock(buffer, len);

      }

      public long getMemlockLimit() {

        return NativeIO.getMemlockLimit();

      }

      public long getOperatingSystemPageSize() {

        return NativeIO.getOperatingSystemPageSize();

      }

      public void posixFadviseIfPossible(String identifier,

        FileDescriptor fd, long offset, long len, int flags)

            throws NativeIOException {

        NativeIO.POSIX.posixFadviseIfPossible(identifier, fd, offset,

            len, flags);

      }

      public boolean verifyCanMlock() {

        return NativeIO.isAvailable();

      }

    }

    /**

     * A CacheManipulator used for testing which does not actually call mlock.

     * This allows many tests to be run even when the operating system does not

     * allow mlock, or only allows limited mlocking.

     */

    @VisibleForTesting

    public static class NoMlockCacheManipulator extends CacheManipulator {

      public void mlock(String identifier, ByteBuffer buffer,

          long len) throws IOException {

        LOG.info("mlocking " + identifier);

      }

      public long getMemlockLimit() {

        return 1125899906842624L;

      }

      public long getOperatingSystemPageSize() {

        return 4096;

      }

      public boolean verifyCanMlock() {

        return true;

      }

    }

    static {

      if (NativeCodeLoader.isNativeCodeLoaded()) {

        try {

          Configuration conf = new Configuration();

          workaroundNonThreadSafePasswdCalls = conf.getBoolean(

            WORKAROUND_NON_THREADSAFE_CALLS_KEY,

            WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT);

          initNative();

          nativeLoaded = true;

          cacheTimeout = conf.getLong(

            CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_KEY,

            CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_DEFAULT) *

            1000;

          LOG.debug("Initialized cache for IDs to User/Group mapping with a " +

            " cache timeout of " + cacheTimeout/1000 + " seconds.");

        } catch (Throwable t) {

          // This can happen if the user has an older version of libhadoop.so

          // installed - in this case we can continue without native IO

          // after warning

          PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);

        }

      }

    }

    /**

     * Return true if the JNI-based native IO extensions are available.

     */

    public static boolean isAvailable() {

      return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;

    }

    private static void assertCodeLoaded() throws IOException {

      if (!isAvailable()) {

        throw new IOException("NativeIO was not loaded");

      }

    }

    /** Wrapper around open(2) */

    public static native FileDescriptor open(String path, int flags, int mode) throws IOException;

    /** Wrapper around fstat(2) */

    private static native Stat fstat(FileDescriptor fd) throws IOException;

    /** Native chmod implementation. On UNIX, it is a wrapper around chmod(2) */

    private static native void chmodImpl(String path, int mode) throws IOException;

    public static void chmod(String path, int mode) throws IOException {

      if (!Shell.WINDOWS) {

        chmodImpl(path, mode);

      } else {

        try {

          chmodImpl(path, mode);

        } catch (NativeIOException nioe) {

          if (nioe.getErrorCode() == 3) {

            throw new NativeIOException("No such file or directory",

                Errno.ENOENT);

          } else {

            LOG.warn(String.format("NativeIO.chmod error (%d): %s",

                nioe.getErrorCode(), nioe.getMessage()));

            throw new NativeIOException("Unknown error", Errno.UNKNOWN);

          }

        }

      }

    }

    /** Wrapper around posix_fadvise(2) */

    static native void posix_fadvise(

      FileDescriptor fd, long offset, long len, int flags) throws NativeIOException;

    /** Wrapper around sync_file_range(2) */

    static native void sync_file_range(

      FileDescriptor fd, long offset, long nbytes, int flags) throws NativeIOException;

    /**

     * Call posix_fadvise on the given file descriptor. See the manpage

     * for this syscall for more information. On systems where this

     * call is not available, does nothing.

     *

     * @throws NativeIOException if there is an error with the syscall

     */

    static void posixFadviseIfPossible(String identifier,

        FileDescriptor fd, long offset, long len, int flags)

        throws NativeIOException {

      if (nativeLoaded && fadvisePossible) {

        try {

          posix_fadvise(fd, offset, len, flags);

        } catch (UnsatisfiedLinkError ule) {

          fadvisePossible = false;

        }

      }

    }

    /**

     * Call sync_file_range on the given file descriptor. See the manpage

     * for this syscall for more information. On systems where this

     * call is not available, does nothing.

     *

     * @throws NativeIOException if there is an error with the syscall

     */

    public static void syncFileRangeIfPossible(

        FileDescriptor fd, long offset, long nbytes, int flags)

        throws NativeIOException {

      if (nativeLoaded && syncFileRangePossible) {

        try {

          sync_file_range(fd, offset, nbytes, flags);

        } catch (UnsupportedOperationException uoe) {

          syncFileRangePossible = false;

        } catch (UnsatisfiedLinkError ule) {

          syncFileRangePossible = false;

        }

      }

    }

    static native void mlock_native(

        ByteBuffer buffer, long len) throws NativeIOException;

    /**

     * Locks the provided direct ByteBuffer into memory, preventing it from

     * swapping out. After a buffer is locked, future accesses will not incur

     * a page fault.

     *

     * See the mlock(2) man page for more information.

     *

     * @throws NativeIOException

     */

    static void mlock(ByteBuffer buffer, long len)

        throws IOException {

      assertCodeLoaded();

      if (!buffer.isDirect()) {

        throw new IOException("Cannot mlock a non-direct ByteBuffer");

      }

      mlock_native(buffer, len);

    }

    /**

     * Unmaps the block from memory. See munmap(2).

     *

     * There isn't any portable way to unmap a memory region in Java.

     * So we use the sun.nio method here.

     * Note that unmapping a memory region could cause crashes if code

     * continues to reference the unmapped code.  However, if we don't

     * manually unmap the memory, we are dependent on the finalizer to

     * do it, and we have no idea when the finalizer will run.

     *

     * @param buffer    The buffer to unmap.

     */

    public static void munmap(MappedByteBuffer buffer) {

      if (buffer instanceof sun.nio.ch.DirectBuffer) {

        sun.misc.Cleaner cleaner =

            ((sun.nio.ch.DirectBuffer)buffer).cleaner();

        cleaner.clean();

      }

    }

    /** Linux only methods used for getOwner() implementation */

    private static native long getUIDforFDOwnerforOwner(FileDescriptor fd) throws IOException;

    private static native String getUserName(long uid) throws IOException;

    /**

     * Result type of the fstat call

     */

    public static class Stat {

      private int ownerId, groupId;

      private String owner, group;

      private int mode;

      // Mode constants - Set by JNI

      public static int S_IFMT = -1;    /* type of file */

      public static int S_IFIFO  = -1;  /* named pipe (fifo) */

      public static int S_IFCHR  = -1;  /* character special */

      public static int S_IFDIR  = -1;  /* directory */

      public static int S_IFBLK  = -1;  /* block special */

      public static int S_IFREG  = -1;  /* regular */

      public static int S_IFLNK  = -1;  /* symbolic link */

      public static int S_IFSOCK = -1;  /* socket */

      public static int S_ISUID = -1;  /* set user id on execution */

      public static int S_ISGID = -1;  /* set group id on execution */

      public static int S_ISVTX = -1;  /* save swapped text even after use */

      public static int S_IRUSR = -1;  /* read permission, owner */

      public static int S_IWUSR = -1;  /* write permission, owner */

      public static int S_IXUSR = -1;  /* execute/search permission, owner */

      Stat(int ownerId, int groupId, int mode) {

        this.ownerId = ownerId;

        this.groupId = groupId;

        this.mode = mode;

      }

      Stat(String owner, String group, int mode) {

        if (!Shell.WINDOWS) {

          this.owner = owner;

        } else {

          this.owner = stripDomain(owner);

        }

        if (!Shell.WINDOWS) {

          this.group = group;

        } else {

          this.group = stripDomain(group);

        }

        this.mode = mode;

      }

      @Override

      public String toString() {

        return "Stat(owner='" + owner + "', group='" + group + "'" +

          ", mode=" + mode + ")";

      }

      public String getOwner() {

        return owner;

      }

      public String getGroup() {

        return group;

      }

      public int getMode() {

        return mode;

      }

    }

    /**

     * Returns the file stat for a file descriptor.

     *

     * @param fd file descriptor.

     * @return the file descriptor file stat.

     * @throws IOException thrown if there was an IO error while obtaining the file stat.

     */

    public static Stat getFstat(FileDescriptor fd) throws IOException {

      Stat stat = null;

      if (!Shell.WINDOWS) {

        stat = fstat(fd);

        stat.owner = getName(IdCache.USER, stat.ownerId);

        stat.group = getName(IdCache.GROUP, stat.groupId);

      } else {

        try {

          stat = fstat(fd);

        } catch (NativeIOException nioe) {

          if (nioe.getErrorCode() == 6) {

            throw new NativeIOException("The handle is invalid.",

                Errno.EBADF);

          } else {

            LOG.warn(String.format("NativeIO.getFstat error (%d): %s",

                nioe.getErrorCode(), nioe.getMessage()));

            throw new NativeIOException("Unknown error", Errno.UNKNOWN);

          }

        }

      }

      return stat;

    }

    private static String getName(IdCache domain, int id) throws IOException {

      Map<Integer, CachedName> idNameCache = (domain == IdCache.USER)

        ? USER_ID_NAME_CACHE : GROUP_ID_NAME_CACHE;

      String name;

      CachedName cachedName = idNameCache.get(id);

      long now = System.currentTimeMillis();

      if (cachedName != null && (cachedName.timestamp + cacheTimeout) > now) {

        name = cachedName.name;

      } else {

        name = (domain == IdCache.USER) ? getUserName(id) : getGroupName(id);

        if (LOG.isDebugEnabled()) {

          String type = (domain == IdCache.USER) ? "UserName" : "GroupName";

          LOG.debug("Got " + type + " " + name + " for ID " + id +

            " from the native implementation");

        }

        cachedName = new CachedName(name, now);

        idNameCache.put(id, cachedName);

      }

      return name;

    }

    static native String getUserName(int uid) throws IOException;

    static native String getGroupName(int uid) throws IOException;

    private static class CachedName {

      final long timestamp;

      final String name;

      public CachedName(String name, long timestamp) {

        this.name = name;

        this.timestamp = timestamp;

      }

    }

    private static final Map<Integer, CachedName> USER_ID_NAME_CACHE =

      new ConcurrentHashMap<Integer, CachedName>();

    private static final Map<Integer, CachedName> GROUP_ID_NAME_CACHE =

      new ConcurrentHashMap<Integer, CachedName>();

    private enum IdCache { USER, GROUP }

    public final static int MMAP_PROT_READ = 0x1;

    public final static int MMAP_PROT_WRITE = 0x2;

    public final static int MMAP_PROT_EXEC = 0x4; 

    public static native long mmap(FileDescriptor fd, int prot,

        boolean shared, long length) throws IOException;

    public static native void munmap(long addr, long length)

        throws IOException;

  }

  private static boolean workaroundNonThreadSafePasswdCalls = false;

  public static class Windows {

    // Flags for CreateFile() call on Windows

    public static final long GENERIC_READ = 0x80000000L;

    public static final long GENERIC_WRITE = 0x40000000L;

    public static final long FILE_SHARE_READ = 0x00000001L;

    public static final long FILE_SHARE_WRITE = 0x00000002L;

    public static final long FILE_SHARE_DELETE = 0x00000004L;

    public static final long CREATE_NEW = 1;

    public static final long CREATE_ALWAYS = 2;

    public static final long OPEN_EXISTING = 3;

    public static final long OPEN_ALWAYS = 4;

    public static final long TRUNCATE_EXISTING = 5;

    public static final long FILE_BEGIN = 0;

    public static final long FILE_CURRENT = 1;

    public static final long FILE_END = 2;

    public static final long FILE_ATTRIBUTE_NORMAL = 0x00000080L;

    /**

     * Create a directory with permissions set to the specified mode.  By setting

     * permissions at creation time, we avoid issues related to the user lacking

     * WRITE_DAC rights on subsequent chmod calls.  One example where this can

     * occur is writing to an SMB share where the user does not have Full Control

     * rights, and therefore WRITE_DAC is denied.

     *

     * @param path directory to create

     * @param mode permissions of new directory

     * @throws IOException if there is an I/O error

     */

    public static void createDirectoryWithMode(File path, int mode)

        throws IOException {

      createDirectoryWithMode0(path.getAbsolutePath(), mode);

    }

    /** Wrapper around CreateDirectory() on Windows */

    private static native void createDirectoryWithMode0(String path, int mode)

        throws NativeIOException;

    /** Wrapper around CreateFile() on Windows */

    public static native FileDescriptor createFile(String path,

        long desiredAccess, long shareMode, long creationDisposition)

        throws IOException;

    /**

     * Create a file for write with permissions set to the specified mode.  By

     * setting permissions at creation time, we avoid issues related to the user

     * lacking WRITE_DAC rights on subsequent chmod calls.  One example where

     * this can occur is writing to an SMB share where the user does not have

     * Full Control rights, and therefore WRITE_DAC is denied.

     *

     * This method mimics the semantics implemented by the JDK in

     * {@link java.io.FileOutputStream}.  The file is opened for truncate or

     * append, the sharing mode allows other readers and writers, and paths

     * longer than MAX_PATH are supported.  (See io_util_md.c in the JDK.)

     *

     * @param path file to create

     * @param append if true, then open file for append

     * @param mode permissions of new directory

     * @return FileOutputStream of opened file

     * @throws IOException if there is an I/O error

     */

    public static FileOutputStream createFileOutputStreamWithMode(File path,

        boolean append, int mode) throws IOException {

      long desiredAccess = GENERIC_WRITE;

      long shareMode = FILE_SHARE_READ | FILE_SHARE_WRITE;

      long creationDisposition = append ? OPEN_ALWAYS : CREATE_ALWAYS;

      return new FileOutputStream(createFileWithMode0(path.getAbsolutePath(),

          desiredAccess, shareMode, creationDisposition, mode));

    }

    /** Wrapper around CreateFile() with security descriptor on Windows */

    private static native FileDescriptor createFileWithMode0(String path,

        long desiredAccess, long shareMode, long creationDisposition, int mode)

        throws NativeIOException;

    /** Wrapper around SetFilePointer() on Windows */

    public static native long setFilePointer(FileDescriptor fd,

        long distanceToMove, long moveMethod) throws IOException;

    /** Windows only methods used for getOwner() implementation */

    private static native String getOwner(FileDescriptor fd) throws IOException;

    /** Supported list of Windows access right flags */

    public static enum AccessRight {

      ACCESS_READ (0x0001),      // FILE_READ_DATA

      ACCESS_WRITE (0x0002),     // FILE_WRITE_DATA

      ACCESS_EXECUTE (0x0020);   // FILE_EXECUTE

      private final int accessRight;

      AccessRight(int access) {

        accessRight = access;

      }

      public int accessRight() {

        return accessRight;

      }

    };

    /** Windows only method used to check if the current process has requested

     *  access rights on the given path. */

    private static native boolean access0(String path, int requestedAccess);

    /**

     * Checks whether the current process has desired access rights on

     * the given path.

     *

     * Longer term this native function can be substituted with JDK7

     * function Files#isReadable, isWritable, isExecutable.

     *

     * @param path input path

     * @param desiredAccess ACCESS_READ, ACCESS_WRITE or ACCESS_EXECUTE

     * @return true if access is allowed

     * @throws IOException I/O exception on error

     */

    public static boolean access(String path, AccessRight desiredAccess)

        throws IOException {

      return true;

    }

    /**

     * Extends both the minimum and maximum working set size of the current

     * process.  This method gets the current minimum and maximum working set

     * size, adds the requested amount to each and then sets the minimum and

     * maximum working set size to the new values.  Controlling the working set

     * size of the process also controls the amount of memory it can lock.

     *

     * @param delta amount to increment minimum and maximum working set size

     * @throws IOException for any error

     * @see POSIX#mlock(ByteBuffer, long)

     */

    public static native void extendWorkingSetSize(long delta) throws IOException;

    static {

      if (NativeCodeLoader.isNativeCodeLoaded()) {

        try {

          initNative();

          nativeLoaded = true;

        } catch (Throwable t) {

          // This can happen if the user has an older version of libhadoop.so

          // installed - in this case we can continue without native IO

          // after warning

          PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);

        }

      }

    }

  }

  private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class);

  private static boolean nativeLoaded = false;

  static {

    if (NativeCodeLoader.isNativeCodeLoaded()) {

      try {

        initNative();

        nativeLoaded = true;

      } catch (Throwable t) {

        // This can happen if the user has an older version of libhadoop.so

        // installed - in this case we can continue without native IO

        // after warning

        PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);

      }

    }

  }

  /**

   * Return true if the JNI-based native IO extensions are available.

   */

  public static boolean isAvailable() {

    return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;

  }

  /** Initialize the JNI method ID and class ID cache */

  private static native void initNative();

  /**

   * Get the maximum number of bytes that can be locked into memory at any

   * given point.

   *

   * @return 0 if no bytes can be locked into memory;

   *         Long.MAX_VALUE if there is no limit;

   *         The number of bytes that can be locked into memory otherwise.

   */

  static long getMemlockLimit() {

    return isAvailable() ? getMemlockLimit0() : 0;

  }

  private static native long getMemlockLimit0();

  /**

   * @return the operating system's page size.

   */

  static long getOperatingSystemPageSize() {

    try {

      Field f = Unsafe.class.getDeclaredField("theUnsafe");

      f.setAccessible(true);

      Unsafe unsafe = (Unsafe)f.get(null);

      return unsafe.pageSize();

    } catch (Throwable e) {

      LOG.warn("Unable to get operating system page size.  Guessing 4096.", e);

      return 4096;

    }

  }

  private static class CachedUid {

    final long timestamp;

    final String username;

    public CachedUid(String username, long timestamp) {

      this.timestamp = timestamp;

      this.username = username;

    }

  }

  private static final Map<Long, CachedUid> uidCache =

      new ConcurrentHashMap<Long, CachedUid>();

  private static long cacheTimeout;

  private static boolean initialized = false;

  /**

   * The Windows logon name has two part, NetBIOS domain name and

   * user account name, of the format DOMAIN\UserName. This method

   * will remove the domain part of the full logon name.

   *

   * @param Fthe full principal name containing the domain

   * @return name with domain removed

   */

  private static String stripDomain(String name) {

    int i = name.indexOf('\\');

    if (i != -1)

      name = name.substring(i + 1);

    return name;

  }

  public static String getOwner(FileDescriptor fd) throws IOException {

    ensureInitialized();

    if (Shell.WINDOWS) {

      String owner = Windows.getOwner(fd);

      owner = stripDomain(owner);

      return owner;

    } else {

      long uid = POSIX.getUIDforFDOwnerforOwner(fd);

      CachedUid cUid = uidCache.get(uid);

      long now = System.currentTimeMillis();

      if (cUid != null && (cUid.timestamp + cacheTimeout) > now) {

        return cUid.username;

      }

      String user = POSIX.getUserName(uid);

      LOG.info("Got UserName " + user + " for UID " + uid

          + " from the native implementation");

      cUid = new CachedUid(user, now);

      uidCache.put(uid, cUid);

      return user;

    }

  }

  /**

   * Create a FileDescriptor that shares delete permission on the

   * file opened at a given offset, i.e. other process can delete

   * the file the FileDescriptor is reading. Only Windows implementation

   * uses the native interface.

   */

  public static FileDescriptor getShareDeleteFileDescriptor(

      File f, long seekOffset) throws IOException {

    if (!Shell.WINDOWS) {

      RandomAccessFile rf = new RandomAccessFile(f, "r");

      if (seekOffset > 0) {

        rf.seek(seekOffset);

      }

      return rf.getFD();

    } else {

      // Use Windows native interface to create a FileDescriptor that

      // shares delete permission on the file opened, and set it to the

      // given offset.

      //

      FileDescriptor fd = NativeIO.Windows.createFile(

          f.getAbsolutePath(),

          NativeIO.Windows.GENERIC_READ,

          NativeIO.Windows.FILE_SHARE_READ |

              NativeIO.Windows.FILE_SHARE_WRITE |

              NativeIO.Windows.FILE_SHARE_DELETE,

          NativeIO.Windows.OPEN_EXISTING);

      if (seekOffset > 0)

        NativeIO.Windows.setFilePointer(fd, seekOffset, NativeIO.Windows.FILE_BEGIN);

      return fd;

    }

  }

  /**

   * Create the specified File for write access, ensuring that it does not exist.

   * @param f the file that we want to create

   * @param permissions we want to have on the file (if security is enabled)

   *

   * @throws AlreadyExistsException if the file already exists

   * @throws IOException if any other error occurred

   */

  public static FileOutputStream getCreateForWriteFileOutputStream(File f, int permissions)

      throws IOException {

    if (!Shell.WINDOWS) {

      // Use the native wrapper around open(2)

      try {

        FileDescriptor fd = NativeIO.POSIX.open(f.getAbsolutePath(),

            NativeIO.POSIX.O_WRONLY | NativeIO.POSIX.O_CREAT

                | NativeIO.POSIX.O_EXCL, permissions);

        return new FileOutputStream(fd);

      } catch (NativeIOException nioe) {

        if (nioe.getErrno() == Errno.EEXIST) {

          throw new AlreadyExistsException(nioe);

        }

        throw nioe;

      }

    } else {

      // Use the Windows native APIs to create equivalent FileOutputStream

      try {

        FileDescriptor fd = NativeIO.Windows.createFile(f.getCanonicalPath(),

            NativeIO.Windows.GENERIC_WRITE,

            NativeIO.Windows.FILE_SHARE_DELETE

                | NativeIO.Windows.FILE_SHARE_READ

                | NativeIO.Windows.FILE_SHARE_WRITE,

            NativeIO.Windows.CREATE_NEW);

        NativeIO.POSIX.chmod(f.getCanonicalPath(), permissions);

        return new FileOutputStream(fd);

      } catch (NativeIOException nioe) {

        if (nioe.getErrorCode() == 80) {

          // ERROR_FILE_EXISTS

          // 80 (0x50)

          // The file exists

          throw new AlreadyExistsException(nioe);

        }

        throw nioe;

      }

    }

  }

  private synchronized static void ensureInitialized() {

    if (!initialized) {

      cacheTimeout =

          new Configuration().getLong("hadoop.security.uid.cache.secs",

              4*60*60) * 1000;

      LOG.info("Initialized cache for UID to User mapping with a cache" +

          " timeout of " + cacheTimeout/1000 + " seconds.");

      initialized = true;

    }

  }

  /**

   * A version of renameTo that throws a descriptive exception when it fails.

   *

   * @param src                  The source path

   * @param dst                  The destination path

   *

   * @throws NativeIOException   On failure.

   */

  public static void renameTo(File src, File dst)

      throws IOException {

    if (!nativeLoaded) {

      if (!src.renameTo(dst)) {

        throw new IOException("renameTo(src=" + src + ", dst=" +

          dst + ") failed.");

      }

    } else {

      renameTo0(src.getAbsolutePath(), dst.getAbsolutePath());

    }

  }

  /**

   * Creates a hardlink "dst" that points to "src".

   *

   * This is deprecated since JDK7 NIO can create hardlinks via the

   * {@link java.nio.file.Files} API.

   *

   * @param src source file

   * @param dst hardlink location

   * @throws IOException

   */

  @Deprecated

  public static void link(File src, File dst) throws IOException {

    if (!nativeLoaded) {

      HardLink.createHardLink(src, dst);

    } else {

      link0(src.getAbsolutePath(), dst.getAbsolutePath());

    }

  }

  /**

   * A version of renameTo that throws a descriptive exception when it fails.

   *

   * @param src                  The source path

   * @param dst                  The destination path

   *

   * @throws NativeIOException   On failure.

   */

  private static native void renameTo0(String src, String dst)

      throws NativeIOException;

  private static native void link0(String src, String dst)

      throws NativeIOException;

  /**

   * Unbuffered file copy from src to dst without tainting OS buffer cache

   *

   * In POSIX platform:

   * It uses FileChannel#transferTo() which internally attempts

   * unbuffered IO on OS with native sendfile64() support and falls back to

   * buffered IO otherwise.

   *

   * It minimizes the number of FileChannel#transferTo call by passing the the

   * src file size directly instead of a smaller size as the 3rd parameter.

   * This saves the number of sendfile64() system call when native sendfile64()

   * is supported. In the two fall back cases where sendfile is not supported,

   * FileChannle#transferTo already has its own batching of size 8 MB and 8 KB,

   * respectively.

   *

   * In Windows Platform:

   * It uses its own native wrapper of CopyFileEx with COPY_FILE_NO_BUFFERING

   * flag, which is supported on Windows Server 2008 and above.

   *

   * Ideally, we should use FileChannel#transferTo() across both POSIX and Windows

   * platform. Unfortunately, the wrapper(Java_sun_nio_ch_FileChannelImpl_transferTo0)

   * used by FileChannel#transferTo for unbuffered IO is not implemented on Windows.

   * Based on OpenJDK 6/7/8 source code, Java_sun_nio_ch_FileChannelImpl_transferTo0

   * on Windows simply returns IOS_UNSUPPORTED.

   *

   * Note: This simple native wrapper does minimal parameter checking before copy and

   * consistency check (e.g., size) after copy.

   * It is recommended to use wrapper function like

   * the Storage#nativeCopyFileUnbuffered() function in hadoop-hdfs with pre/post copy

   * checks.

   *

   * @param src                  The source path

   * @param dst                  The destination path

   * @throws IOException

   */

  public static void copyFileUnbuffered(File src, File dst) throws IOException {

    if (nativeLoaded && Shell.WINDOWS) {

      copyFileUnbuffered0(src.getAbsolutePath(), dst.getAbsolutePath());

    } else {

      FileInputStream fis = new FileInputStream(src);

      FileChannel input = null;

      try {

        input = fis.getChannel();

        try (FileOutputStream fos = new FileOutputStream(dst);

             FileChannel output = fos.getChannel()) {

          long remaining = input.size();

          long position = 0;

          long transferred = 0;

          while (remaining > 0) {

            transferred = input.transferTo(position, remaining, output);

            remaining -= transferred;

            position += transferred;

          }

        }

      } finally {

        IOUtils.cleanupWithLogger(LOG, input, fis);

      }

    }

  }

  private static native void copyFileUnbuffered0(String src, String dst)

      throws NativeIOException;

}

　　三、关于这个使用maven构建的项目，我在运行时因为使用公司内网，速度很慢，所以改变策略。创建java项目，然后把hadoop2.9.2里面的share目录下的common、hdfs、httpfs、yarn、mapreduce目录下的jar文件都拷了进来，运行中出了不少bug。

hadoop-hdfs-2.9.2.jar

hadoop-hdfs-client-2.9.2.jar

hadoop-mapreduce-client-app-2.9.2.jar

hadoop-mapreduce-client-common-2.9.2.jar

hadoop-mapreduce-client-core-2.9.2.jar

hadoop-mapreduce-client-hs-2.9.2.jar

hadoop-mapreduce-client-jobclient-2.9.2-tests.jar

hadoop-mapreduce-client-shuffle-2.9.2.jar

hadoop-yarn-api-2.9.2.jar

hadoop-yarn-applications-distributedshell-2.9.2.jar

hadoop-yarn-applications-unmanaged-am-launcher-2.9.2.jar

hadoop-yarn-client-2.9.2.jar

activation-1.1.jar

aopalliance-1.0.jar

apacheds-i18n-2.0.0-M15.jar

apacheds-kerberos-codec-2.0.0-M15.jar

api-asn1-api-1.0.0-M20.jar

api-util-1.0.0-M20.jar

asm-3.2.jar

avro-1.7.7.jar

commons-beanutils-1.7.0.jar

commons-beanutils-core-1.8.0.jar

commons-cli-1.2.jar

commons-codec-1.4.jar

commons-collections-3.2.2.jar

commons-compress-1.4.1.jar

commons-configuration-1.6.jar

commons-digester-1.8.jar

commons-io-2.4.jar

commons-lang-2.6.jar

commons-lang3-3.4.jar

commons-logging-1.1.3.jar

commons-math3-3.1.1.jar

commons-net-3.1.jar

curator-client-2.7.1.jar

curator-framework-2.7.1.jar

curator-recipes-2.7.1.jar

ehcache-3.3.1.jar

fst-2.50.jar

geronimo-jcache_1.0_spec-1.0-alpha-1.jar

gson-2.2.4.jar

guava-11.0.2.jar

guice-3.0.jar

guice-servlet-3.0.jar

HikariCP-java7-2.4.12.jar

htrace-core4-4.1.0-incubating.jar

httpclient-4.5.2.jar

httpcore-4.4.4.jar

jackson-core-asl-1.9.13.jar

jackson-jaxrs-1.9.13.jar

jackson-mapper-asl-1.9.13.jar

jackson-xc-1.9.13.jar

java-util-1.9.0.jar

java-xmlbuilder-0.4.jar

javax.inject-1.jar

jaxb-api-2.2.2.jar

jaxb-impl-2.2.3-1.jar

jcip-annotations-1.0-1.jar

jersey-client-1.9.jar

jersey-core-1.9.jar

jersey-guice-1.9.jar

jersey-json-1.9.jar

jersey-server-1.9.jar

jets3t-0.9.0.jar

jettison-1.1.jar

jetty-6.1.26.jar

jetty-sslengine-6.1.26.jar

jetty-util-6.1.26.jar

jsch-0.1.54.jar

json-io-2.5.1.jar

json-smart-1.3.1.jar

jsp-api-2.1.jar

jsr305-3.0.0.jar

leveldbjni-all-1.8.jar

log4j-1.2.17.jar

metrics-core-3.0.1.jar

mssql-jdbc-6.2.1.jre7.jar

netty-3.6.2.Final.jar

nimbus-jose-jwt-4.41.1.jar

paranamer-2.3.jar

protobuf-java-2.5.0.jar

servlet-api-2.5.jar

snappy-java-1.0.5.jar

stax-api-1.0-2.jar

stax2-api-3.1.4.jar

woodstox-core-5.0.3.jar

xmlenc-0.52.jar

xz-1.0.jar

zookeeper-3.4.6.jar

hadoop-common-2.9.2.jar

slf4j-api-1.7.25.jar

slf4j-log4j12-1.7.25.jar

hadoop-yarn-server-nodemanager-2.9.2.jar

hadoop-yarn-server-resourcemanager-2.9.2.jar

hadoop-yarn-server-router-2.9.2.jar

hadoop-yarn-server-sharedcachemanager-2.9.2.jar

hadoop-yarn-server-timeline-pluginstorage-2.9.2.jar

hadoop-yarn-server-web-proxy-2.9.2.jar

hadoop-yarn-ui-2.9.2.war

hadoop-annotations-2.9.2.jar

hadoop-auth-2.9.2.jar

hadoop-nfs-2.9.2.jar

hamcrest-core-1.3.jar

junit-4.11.jar

hadoop-mapreduce-client-jobclient-2.9.2.jar

mockito-all-1.8.5.jar

ojdbc7.jar

orai18n.jar

hadoop-yarn-common-2.9.2.jar

hadoop-yarn-registry-2.9.2.jar

hadoop-yarn-server-applicationhistoryservice-2.9.2.jar

hadoop-yarn-server-common-2.9.2.jar

前言

Web日志包含着网站最重要的信息，通过日志分析，我们可以知道网站的访问量，哪个网页访问人数最多，哪个网页最有价值等。一般中型的网站(10W的PV以上)，每天会产生1G以上Web日志文件。大型或超大型的网站，可能每小时就会产生10G的数据量。

对于日志的这种规模的数据，用Hadoop进行日志分析，是最适合不过的了。

Web日志分析概述
需求分析：KPI指标设计
算法模型：Hadoop并行算法
架构设计：日志KPI系统架构
程序开发1：用Maven构建Hadoop项目
程序开发2：MapReduce程序实现

1. Web日志分析概述

Web日志由Web服务器产生，可能是Nginx, Apache, Tomcat等。从Web日志中，我们可以获取网站每类页面的PV值（PageView，页面访问量）、独立IP数；稍微复杂一些的，可以计算得出用户所检索的关键词排行榜、用户停留时间最高的页面等；更复杂的，构建广告点击模型、分析用户行为特征等等。

在Web日志中，每条日志通常代表着用户的一次访问行为，例如下面就是一条nginx日志：



222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939

 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1)

 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

拆解为以下8个变量

remote_addr: 记录客户端的ip地址, 222.68.172.190
remote_user: 记录客户端用户名称, –
time_local: 记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1”
status: 记录请求状态,成功是200, 200
body_bytes_sent: 记录发送给客户端文件主体内容大小, 19939
http_referer: 用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36”

注：要更多的信息，则要用其它手段去获取，通过js代码单独发送请求，使用cookies记录用户的访问信息。

利用这些日志信息，我们可以深入挖掘网站的秘密了。

少量数据的情况

少量数据的情况(10Mb,100Mb,10G)，在单机处理尚能忍受的时候，我可以直接利用各种Unix/Linux工具，awk、grep、sort、join等都是日志分析的利器，再配合perl, python，正则表达工，基本就可以解决所有的问题。

例如，我们想从上面提到的nginx日志中得到访问量最高前10个IP，实现很简单：



~ cat access.log.10 | awk '{a[$1]++} END {for(b in a) print b"\t"a[b]}' | sort -k2 -r | head -n 10

163.177.71.12   972

101.226.68.137  972

183.195.232.138 971

50.116.27.194   97

14.17.29.86     96

61.135.216.104  94

61.135.216.105  91

61.186.190.41   9

59.39.192.108   9

220.181.51.212  9

海量数据的情况

当数据量每天以10G、100G增长的时候，单机处理能力已经不能满足需求。我们就需要增加系统的复杂性，用计算机集群，存储阵列来解决。在Hadoop出现之前，海量数据存储，和海量日志分析都是非常困难的。只有少数一些公司，掌握着高效的并行计算，分步式计算，分步式存储的核心技术。

Hadoop的出现，大幅度的降低了海量数据处理的门槛，让小公司甚至是个人都能力，搞定海量数据。并且，Hadoop非常适用于日志分析系统。

2.需求分析：KPI指标设计

下面我们将从一个公司案例出发来全面的解释，如何用进行海量Web日志分析，提取KPI数据。

案例介绍
某电子商务网站，在线团购业务。每日PV数100w，独立IP数5w。用户通常在工作日上午10:00-12:00和下午15:00-18:00访问量最大。日间主要是通过PC端浏览器访问，休息日及夜间通过移动设备访问较多。网站搜索浏量占整个网站的80%，PC用户不足1%的用户会消费，移动用户有5%会消费。

通过简短的描述，我们可以粗略地看出，这家电商网站的经营状况，并认识到愿意消费的用户从哪里来，有哪些潜在的用户可以挖掘，网站是否存在倒闭风险等。

KPI指标设计

PV(PageView): 页面访问量统计
IP: 页面独立IP的访问量统计
Time: 用户每小时PV的统计
Source: 用户来源域名的统计
Browser: 用户的访问设备统计

注：商业保密限制，无法提供电商网站的日志。
下面的内容，将以我的个人网站为例提取数据进行分析。

百度统计，对我个人网站做的统计！http://www.fens.me

基本统计指标：

用户的访问设备统计指标：

从商业的角度，个人网站的特征与电商网站不太一样，没有转化率，同时跳出率也比较高。从技术的角度，同样都关注KPI指标设计。

3.算法模型：Hadoop并行算法

并行算法的设计：
注：找到第一节有定义的8个变量

PV(PageView): 页面访问量统计

Map过程{key:$request,value:1}
Reduce过程{key:$request,value:求和(sum)}

IP: 页面独立IP的访问量统计

Map: {key:$request,value:$remote_addr}
Reduce: {key:$request,value:去重再求和(sum(unique))}

Time: 用户每小时PV的统计

Map: {key:$time_local,value:1}
Reduce: {key:$time_local,value:求和(sum)}

Source: 用户来源域名的统计

Map: {key:$http_referer,value:1}
Reduce: {key:$http_referer,value:求和(sum)}

Browser: 用户的访问设备统计

Map: {key:$http_user_agent,value:1}
Reduce: {key:$http_user_agent,value:求和(sum)}

4.架构设计：日志KPI系统架构

上图中，左边是Application业务系统，右边是Hadoop的HDFS, MapReduce。

日志是由业务系统产生的，我们可以设置web服务器每天产生一个新的目录，目录下面会产生多个日志文件，每个日志文件64M。
设置系统定时器CRON，夜间在0点后，向HDFS导入昨天的日志文件。
完成导入后，设置系统定时器，启动MapReduce程序，提取并计算统计指标。
完成计算后，设置系统定时器，从HDFS导出统计指标数据到数据库，方便以后的即使查询。

上面这幅图，我们可以看得更清楚，数据是如何流动的。蓝色背景的部分是在Hadoop中的，接下来我们的任务就是完成MapReduce的程序实现。

5.程序开发1：用Maven构建Hadoop项目

请参考文章：用Maven构建Hadoop项目

win7的开发环境和 Hadoop的运行环境，在上面文章中已经介绍过了。

我们需要放日志文件，上传的HDFS里/user/hdfs/log_kpi/目录，参考下面的命令操作



~ hadoop fs -mkdir /user/hdfs/log_kpi

~ hadoop fs -copyFromLocal /home/conan/datafiles/access.log.10 /user/hdfs/log_kpi/

我已经把整个MapReduce的实现都放到了github上面：

https://github.com/bsspirit/maven_hadoop_template/releases/tag/kpi_v1

6.程序开发2：MapReduce程序实现

开发流程：

对日志行的解析
Map函数实现
Reduce函数实现
启动程序实现

1). 对日志行的解析
新建文件：org.conan.myhadoop.mr.kpi.KPI.java



package org.conan.myhadoop.mr.kpi;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.Date;

import java.util.Locale;

/*

 * KPI Object

 */

public class KPI {

    private String remote_addr;// 记录客户端的ip地址

    private String remote_user;// 记录客户端用户名称,忽略属性"-"

    private String time_local;// 记录访问时间与时区

    private String request;// 记录请求的url与http协议

    private String status;// 记录请求状态；成功是200

    private String body_bytes_sent;// 记录发送给客户端文件主体内容大小

    private String http_referer;// 用来记录从那个页面链接访问过来的

    private String http_user_agent;// 记录客户浏览器的相关信息

    private boolean valid = true;// 判断数据是否合法

    @Override

    public String toString() {

        StringBuilder sb = new StringBuilder();

        sb.append("valid:" + this.valid);

        sb.append("\nremote_addr:" + this.remote_addr);

        sb.append("\nremote_user:" + this.remote_user);

        sb.append("\ntime_local:" + this.time_local);

        sb.append("\nrequest:" + this.request);

        sb.append("\nstatus:" + this.status);

        sb.append("\nbody_bytes_sent:" + this.body_bytes_sent);

        sb.append("\nhttp_referer:" + this.http_referer);

        sb.append("\nhttp_user_agent:" + this.http_user_agent);

        return sb.toString();

    }

    public String getRemote_addr() {

        return remote_addr;

    }

    public void setRemote_addr(String remote_addr) {

        this.remote_addr = remote_addr;

    }

    public String getRemote_user() {

        return remote_user;

    }

    public void setRemote_user(String remote_user) {

        this.remote_user = remote_user;

    }

    public String getTime_local() {

        return time_local;

    }

    public Date getTime_local_Date() throws ParseException {

        SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);

        return df.parse(this.time_local);

    }

    public String getTime_local_Date_hour() throws ParseException{

        SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");

        return df.format(this.getTime_local_Date());

    }

    public void setTime_local(String time_local) {

        this.time_local = time_local;

    }

    public String getRequest() {

        return request;

    }

    public void setRequest(String request) {

        this.request = request;

    }

    public String getStatus() {

        return status;

    }

    public void setStatus(String status) {

        this.status = status;

    }

    public String getBody_bytes_sent() {

        return body_bytes_sent;

    }

    public void setBody_bytes_sent(String body_bytes_sent) {

        this.body_bytes_sent = body_bytes_sent;

    }

    public String getHttp_referer() {

        return http_referer;

    }

    public String getHttp_referer_domain(){

        if(http_referer.length()<8){

            return http_referer;

        }

        String str=this.http_referer.replace("\"", "").replace("http://", "").replace("https://", "");

        return str.indexOf("/")>0?str.substring(0, str.indexOf("/")):str;

    }

    public void setHttp_referer(String http_referer) {

        this.http_referer = http_referer;

    }

    public String getHttp_user_agent() {

        return http_user_agent;

    }

    public void setHttp_user_agent(String http_user_agent) {

        this.http_user_agent = http_user_agent;

    }

    public boolean isValid() {

        return valid;

    }

    public void setValid(boolean valid) {

        this.valid = valid;

    }

    public static void main(String args[]) {

        String line = "222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";

        System.out.println(line);

        KPI kpi = new KPI();

        String[] arr = line.split(" ");

        kpi.setRemote_addr(arr[0]);

        kpi.setRemote_user(arr[1]);

        kpi.setTime_local(arr[3].substring(1));

        kpi.setRequest(arr[6]);

        kpi.setStatus(arr[8]);

        kpi.setBody_bytes_sent(arr[9]);

        kpi.setHttp_referer(arr[10]);

        kpi.setHttp_user_agent(arr[11] + " " + arr[12]);

        System.out.println(kpi);

        try {

            SimpleDateFormat df = new SimpleDateFormat("yyyy.MM.dd:HH:mm:ss", Locale.US);

            System.out.println(df.format(kpi.getTime_local_Date()));

            System.out.println(kpi.getTime_local_Date_hour());

            System.out.println(kpi.getHttp_referer_domain());

        } catch (ParseException e) {

            e.printStackTrace();

        }

    }

}

从日志文件中，取一行通过main函数写一个简单的解析测试。

控制台输出：



222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

valid:true

remote_addr:222.68.172.190

remote_user:-

time_local:18/Sep/2013:06:49:57

request:/images/my.jpg

status:200

body_bytes_sent:19939

http_referer:"http://www.angularjs.cn/A00n"

http_user_agent:"Mozilla/5.0 (Windows

2013.09.18:06:49:57

2013091806

www.angularjs.cn

我们看到日志行，被正确的解析成了kpi对象的属性。我们把解析过程，单独封装成一个方法。



    private static KPI parser(String line) {

        System.out.println(line);

        KPI kpi = new KPI();

        String[] arr = line.split(" ");

        if (arr.length > 11) {

            kpi.setRemote_addr(arr[0]);

            kpi.setRemote_user(arr[1]);

            kpi.setTime_local(arr[3].substring(1));

            kpi.setRequest(arr[6]);

            kpi.setStatus(arr[8]);

            kpi.setBody_bytes_sent(arr[9]);

            kpi.setHttp_referer(arr[10]);

            if (arr.length > 12) {

                kpi.setHttp_user_agent(arr[11] + " " + arr[12]);

            } else {

                kpi.setHttp_user_agent(arr[11]);

            }

            if (Integer.parseInt(kpi.getStatus()) >= 400) {// 大于400，HTTP错误

                kpi.setValid(false);

            }

        } else {

            kpi.setValid(false);

        }

        return kpi;

    }

对map方法，reduce方法，启动方法，我们单独写一个类来实现

下面将分别介绍MapReduce的实现类：

PV:org.conan.myhadoop.mr.kpi.KPIPV.java
IP: org.conan.myhadoop.mr.kpi.KPIIP.java
Time: org.conan.myhadoop.mr.kpi.KPITime.java
Browser: org.conan.myhadoop.mr.kpi.KPIBrowser.java

1). PV:org.conan.myhadoop.mr.kpi.KPIPV.java



package org.conan.myhadoop.mr.kpi;

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class KPIPV { 

    public static class KPIPVMapper extends MapReduceBase implements Mapper<object, text,="" intwritable=""> {

        private IntWritable one = new IntWritable(1);

        private Text word = new Text();

        @Override

        public void map(Object key, Text value, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException {

            KPI kpi = KPI.filterPVs(value.toString());

            if (kpi.isValid()) {

                word.set(kpi.getRequest());

                output.collect(word, one);

            }

        }

    }

    public static class KPIPVReducer extends MapReduceBase implements Reducer<text, intwritable,="" text,="" intwritable=""> {

        private IntWritable result = new IntWritable();

        @Override

        public void reduce(Text key, Iterator values, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException {

            int sum = 0;

            while (values.hasNext()) {

                sum += values.next().get();

            }

            result.set(sum);

            output.collect(key, result);

        }

    }

    public static void main(String[] args) throws Exception {

        String input = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/";

        String output = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv";

        JobConf conf = new JobConf(KPIPV.class);

        conf.setJobName("KPIPV");

        conf.addResource("classpath:/hadoop/core-site.xml");

        conf.addResource("classpath:/hadoop/hdfs-site.xml");

        conf.addResource("classpath:/hadoop/mapred-site.xml");

        conf.setMapOutputKeyClass(Text.class);

        conf.setMapOutputValueClass(IntWritable.class);

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(KPIPVMapper.class);

        conf.setCombinerClass(KPIPVReducer.class);

        conf.setReducerClass(KPIPVReducer.class);

        conf.setInputFormat(TextInputFormat.class);

        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(input));

        FileOutputFormat.setOutputPath(conf, new Path(output));

        JobClient.runJob(conf);

        System.exit(0);

    }

}

在程序中会调用KPI类的方法

KPI kpi = KPI.filterPVs(value.toString());

通过filterPVs方法，我们可以实现对PV，更多的控制。

在KPK.java中，增加filterPVs方法



    /**

     * 按page的pv分类

     */

    public static KPI filterPVs(String line) {

        KPI kpi = parser(line);

        Set pages = new HashSet();

        pages.add("/about");

        pages.add("/black-ip-list/");

        pages.add("/cassandra-clustor/");

        pages.add("/finance-rhive-repurchase/");

        pages.add("/hadoop-family-roadmap/");

        pages.add("/hadoop-hive-intro/");

        pages.add("/hadoop-zookeeper-intro/");

        pages.add("/hadoop-mahout-roadmap/");

        if (!pages.contains(kpi.getRequest())) {

            kpi.setValid(false);

        }

        return kpi;

    }

在filterPVs方法，我们定义了一个pages的过滤，就是只对这个页面进行PV统计。

我们运行一下KPIPV.java



2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush

信息: Starting flush of map output

2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill

信息: Finished spill 0

2013-10-9 11:53:28 org.apache.hadoop.mapred.Task done

信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate

信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757

2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate

信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757

2013-10-9 11:53:30 org.apache.hadoop.mapred.Task sendDone

信息: Task 'attempt_local_0001_m_000000_0' done.

2013-10-9 11:53:30 org.apache.hadoop.mapred.Task initialize

信息:  Using ResourceCalculatorPlugin : null

2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate

信息:

2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge

信息: Merging 1 sorted segments

2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge

信息: Down to the last merge-pass, with 1 segments left of total size: 213 bytes

2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate

信息:

2013-10-9 11:53:30 org.apache.hadoop.mapred.Task done

信息: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate

信息:

2013-10-9 11:53:30 org.apache.hadoop.mapred.Task commit

信息: Task attempt_local_0001_r_000000_0 is allowed to commit now

2013-10-9 11:53:30 org.apache.hadoop.mapred.FileOutputCommitter commitTask

信息: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv

2013-10-9 11:53:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob

信息:  map 100% reduce 0%

2013-10-9 11:53:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate

信息: reduce > reduce

2013-10-9 11:53:33 org.apache.hadoop.mapred.Task sendDone

信息: Task 'attempt_local_0001_r_000000_0' done.

2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob

信息:  map 100% reduce 100%

2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob

信息: Job complete: job_local_0001

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息: Counters: 20

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:   File Input Format Counters

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Bytes Read=3025757

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:   File Output Format Counters

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Bytes Written=183

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:   FileSystemCounters

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     FILE_BYTES_READ=545

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     HDFS_BYTES_READ=6051514

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     FILE_BYTES_WRITTEN=83472

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     HDFS_BYTES_WRITTEN=183

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:   Map-Reduce Framework

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Map output materialized bytes=217

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Map input records=14619

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Reduce shuffle bytes=0

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Spilled Records=16

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Map output bytes=2004

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Total committed heap usage (bytes)=376569856

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Map input bytes=3025757

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     SPLIT_RAW_BYTES=110

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Combine input records=76

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Reduce input records=8

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Reduce input groups=8

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Combine output records=8

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Reduce output records=8

2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log

信息:     Map output records=76

用hadoop命令查看HDFS文件



~ hadoop fs -cat /user/hdfs/log_kpi/pv/part-00000

/about  5

/black-ip-list/ 2

/cassandra-clustor/     3

/finance-rhive-repurchase/      13

/hadoop-family-roadmap/ 13

/hadoop-hive-intro/     14

/hadoop-mahout-roadmap/ 20

/hadoop-zookeeper-intro/        6

这样我们就得到了，刚刚日志文件中的，指定页面的PV值。

指定页面，就像网站的站点地图一样，如果没有指定所有访问链接都会被找出来，通过“站点地图”的指定，我们可以更容易地找到，我们所需要的信息。

后面，其他的统计指标的提取思路，和PV的实现过程都是类似的，大家可以直接下载源代码，运行看到结果！！

后面我会把我代码上传到github上面：

https://github.com/blench/