How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. nginx反向代理和负载均衡的简单部署

    1.   安装 1)         从Nginx官网下载页面(http://nginx.org/en/download.html)下载Nginx最新版本(目前是1.5.13版本)安装包: 2)    ...

  2. solr 中文分词相关(转载)

    smartcn和ik的对比,来自http://www.cnblogs.com/hadoopdev/p/3465556.html 一.引言: 年的时候,就曾经有项目涉及到相关的应用(Lunce构建全文搜 ...

  3. k8s部署dashboard

    1.首先去github上找到kubernetes 2.然后找到get started 3.复制yaml文件地址,并wget到服务器上并部署即可 PS:本文把自己部署的yaml文件贴出来:recomme ...

  4. C语言之volatile

    emOsprey  鱼鹰谈单片机 2月21日 预计阅读时间: 4 分钟 和 const 不同(关于 const 可以看 const 小节),当一个变量声明为 volatile,说明这个变量会被意想不到 ...

  5. 牛客练习赛3 贝伦卡斯泰露——队列&&爆搜

    题目 链接 题意:给出一个长度为 $n$ 的数列 $A_i$,问是否能将这个数列分解为两个长度为n/2的子序列,满足: 两个子序列不互相重叠(是值不能有共同元素,但位置可以交错). 两个子序列中的数要 ...

  6. webstorm 2016.3 注册方法

    用license server 的方式吧,activation code 的方式没有找到方法. license server 里写http://idea.iteblog.com/key.php 用li ...

  7. php上传大文件的解决方案

    1.使用PHP的创始人 Rasmus Lerdorf 写的APC扩展模块来实现(http://pecl.php.net/package/apc) APC实现方法: 安装APC,参照官方文档安装,可以使 ...

  8. Non-standard serial port baud rate setting

    ////combuad_recv.cpp #include <stdio.h> /*标准输入输出定义*/ #include <stdlib.h> /*标准函数库定义*/ #in ...

  9. more/less

    more less

  10. vue-quill-editor的用法

    1. main.js引入vue-quill-editor import VueQuillEditor from 'vue-quill-editor' import 'quill/dist/quill. ...