How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. zabbix内存溢出解决方法

    1406:20180802:183248.783 __mem_malloc: skipped 0 asked 48 skip_min 4294967295 skip_max 0 1406:201808 ...

  2. shell字符串处理

    一.字符串切片: ${#var}:返回字符串变量var的长度${var:offset}:返回字符串变量var中从第offset个字符后(不包括第offset个字符)的字符开始,到最后的部分,offse ...

  3. map(callback)将一组元素转换成其他数组(不论是否是元素数组)

    map(callback) 概述 将一组元素转换成其他数组(不论是否是元素数组) 你可以用这个函数来建立一个列表,不论是值.属性还是CSS样式,或者其他特别形式.这都可以用'$.map()'来方便的建 ...

  4. 第二章 Unicode简介

    /*------------------------------------------------------------- screensize.cpp -- Displays screen si ...

  5. Codevs 1331 西行寺幽幽子(高精度)

    1331 西行寺幽幽子 时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题目描述 Description 在幻想乡,西行寺幽幽子是以贪吃闻名的亡灵.不过幽幽子可不是只 ...

  6. 对AM信号FFT的matlab仿真

    普通调幅波AM的频谱,大信号包络检波频谱分析 u(t)=Ucm(1+macos t)cos ct ma称为调幅系数 它的频谱由载波,上下边频组成 , 包络检波中二极管截去负半周再用电容低通滤波,可 ...

  7. Flask-配置参数

    Flask配置 Flask 是一个非常灵活且短小精干的web框架 , 那么灵活性从什么地方体现呢? 有一个神奇的东西叫 Flask配置 , 这个东西怎么用呢? 它能给我们带来怎么样的方便呢? 首先展示 ...

  8. Java线程优先级及守护线程(二)

    简述 在操作系统中,线程是可以划分优先级的,优先级较高的线程,得到CPU优先执行的几率就较高一些.设置线程的优先级,有助于帮助线程规划期选择下一个哪一个线程优先执行,但是线程优先级高不代表一定会优先执 ...

  9. 前端工程师需要掌握的 Babel 知识

    在前端圈子里,对于 Babel,大家肯定都比较熟悉了.如果哪天少了它,对于前端工程师来说肯定是个噩梦.Babel 的工作原理是怎样的可能了解的人就不太多了.本文将主要介绍 Babel 的工作原理以及怎 ...

  10. 8.8 JQuery框架

    8.8 JQuery框架 一.JQuery是一个javascript的框架,是对javascript的一种封装. 通过JQuery可以非常方便的操作html的元素\要使用Jquery需要导入一个第三方 ...