原文地址:http://www.baeldung.com/java-read-lines-large-file

1. Overview

This tutorial will show how to read all the lines from a large file in Java in an efficient manner.

This article is part of the “Java – Back to Basic” tutorial here on Baeldung.

2. Reading In Memory

The standard way of reading the lines of the file is in-memory – both Guava and Apache Commons IO provide a quick way to do just that:

1
Files.readLines(new File(path), Charsets.UTF_8);
1
FileUtils.readLines(new File(path));

The problem with this approach is that all the file lines are kept in memory – which will quickly lead to OutOfMemoryError if the File is large enough.

For example – reading a ~1Gb file:

1
2
3
4
5
@Test
public void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException {
    String path = ...
    Files.readLines(new File(path), Charsets.UTF_8);
}

This starts off with a small amount of memory being consumed: (~0 Mb consumed)

1
2
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116 Mb

However, after the full file has been processed, we have at the end: (~2 Gb consumed)

1
2
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490 Mb

Which means that about 2.1 Gb of memory are consumed by the process – the reason is simple – the lines of the file are all being stored in memory now.

It should be obvious by this point that keeping in-memory the contents of the file will quickly exhaust the available memory – regardless of how much that actually is.

What’s more, we usually don’t need all of the lines in the file in memory at once – instead, we just need to be able to iterate through each one, do some processing and throw it away. So, this is exactly what we’re going to do – iterate through the lines without holding the in memory.

3. Streaming Through the File

Let’s now look at a solution – we’re going to use a java.util.Scanner to run through the contents of the file and retrieve lines serially, one by one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
FileInputStream inputStream = null;
Scanner sc = null;
try {
    inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
} finally {
    if (inputStream != null) {
        inputStream.close();
    }
    if (sc != null) {
        sc.close();
    }
}

This solution will iterate through all the lines in the file – allowing for processing of each line – without keeping references to them – and in conclusion, without keeping them in memory(~150 Mb consumed)

1
2
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605 Mb

4. Streaming with Apache Commons IO

The same can be achieved using the Commons IO library as well, by using the customLineIterator provided by the library:

1
2
3
4
5
6
7
8
9
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
} finally {
    LineIterator.closeQuietly(it);
}

Since the entire file is not fully in memory – this will also result in pretty conservative memory consumption numbers(~150 Mb consumed)

1
2
[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb
[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb

5. Conclusion

This quick article shows how to process lines in a large file without iteratively, without exhausting the available memory – which proves quite useful when working with these large files.

The implementation of all these examples and code snippets can be found in my github project – this is an Eclipse based project, so it should be easy to import and run as it is.

Java – Reading a Large File Efficiently--转的更多相关文章

  1. Loading Large Bitmaps Efficiently

    有效地加载大位图文件-Loading Large Bitmaps Efficiently 图像有各种不同的形状和大小.在许多情况下,他们往往比一个典型应用程序的用户界面(UI)所需要的资源更大.例如, ...

  2. java之io之file类的常用操作

    java io 中,file类是必须掌握的.它的常用api用法见实例. package com.westward.io; import java.io.File; import java.io.IOE ...

  3. linux出现bash: ./java: cannot execute binary file 问题的解决办法

    问题现象描述: 到orcal官网上下载了两个jdk: (1)jdk-7u9-linux-i586.tar.gz ------------>32位 (2)jdk-7u9-linux-x64.tar ...

  4. java: cannot execute binary file

    转自:http://jxwpx.blog.51cto.com/15242/222572 java: cannot execute binary file 如果遇到这个错,一般是操作系统位数出问题了. ...

  5. -bash: /tyrone/jdk/jdk1.8.0_91/bin/java: cannot execute binary file

    问题描述:今天在linux环境下安装了一下JDK,安装成功后,打算输入java -version去测试一下,结果却出错了. 错误信息:-bash: /tyrone/jdk/jdk1.8.0_91/bi ...

  6. Github Upload Large File 上传超大文件

    Github中单个文件的大小限制是100MB,为了能突破这个限制,我们需要使用Git Large File Storage这个工具,参见这个官方帖子,但是按照其给的步骤,博主未能成功上传超大文件,那么 ...

  7. Reading Lines from File in C++

    Reading Lines from File in C++ In C++, istringstream has been used to read lines from a file. code: ...

  8. 使用JAVA API 解析ORC File

    使用JAVA API 解析ORC File orc File 的解析过程中,使用FileInputFormat的getSplits(conf, 1)函数, 然后使用 RecordReaderreade ...

  9. java.lang.IllegalStateException: Zip File is closed

    最近在研究利用sax读取excel大文件时,出现了以下的错误: java.lang.IllegalStateException: Zip File is closed at org.apache.po ...

随机推荐

  1. Android中Gallery和ImageSwitcher同步自动(滚动)播放图片库

    本文主要内容是如何让Gallery和ImageSwitcher控件能够同步自动播放图片集 ,看起来较难,然而,实现的方法非常简单, 请跟我慢慢来.总的来说,本文要实现的效果如下图:(截图效果不怎么好) ...

  2. Android之RadioGroup+ViewPager制作的底部导航栏

    在日常开发中我们常常会用到类似微信或者QQ的底部导航.实现这样的效果有多种,今天就为大家介绍一种实现简单,可控性好的底部导航的实现方法. 首先创建activity_main.xml布局文件,里面主要由 ...

  3. javascript学习笔记总结

    1 有些浏览器可能不支持JavaScript,我们可以使用如下的方法对它们隐藏JavaScript代码. <html> <body> <script type=" ...

  4. 【Uva 1627】Team them up!

    [Link]: [Description] 给你n个人; 有一些人之间有认识关系 a认识b,b不一定认识a 让你把这n个人分成两组 使得这两组中的每一组: 组内的人与人之间都相互认识. 并且,使得两组 ...

  5. Maven学习总结(19)——深入理解Maven相关配置

    MAVEN2的配置文件有两个settings.xml和pom.xml settings.xml:保存的是本地所有项目所共享的全局配置信息,默认在maven安装目录的conf目录下,如果没有安装mave ...

  6. JavaWeb-04(BOM&amp;DOM)

    JavaWeb-04 JavaWeb-BOM&DOM BOM 一.知识回想 * BOM 概述 * BOM 的各个对象 * window对象 innerHeight,innerWidth doc ...

  7. 时间格式化函数strftime

     #include <time.h> #include <stdio.h> #include <string.h> int main() {   char ti ...

  8. .Net强类型视图

    1.控制器 Controllers/StoreController.cs using System; using System.Collections.Generic; using System.Li ...

  9. POJ 2391 Floyd+二分+拆点最大流

    题意: 思路: 先Floyd一遍两两点之间的最短路 二分答案 建图 跑Dinic 只要不像我一样作死#define int long long 估计都没啥事-- 我T到死辣--.. 最后才改过来-- ...

  10. pgrep---以名称为依据从运行进程队列中查找进程

    pgrep命令以名称为依据从运行进程队列中查找进程,并显示查找到的进程id.每一个进程ID以一个十进制数表示,通过一个分割字符串和下一个ID分开,默认的分割字符串是一个新行.对于每个属性选项,用户可以 ...