How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. Eclipse创建Servers没有Apache选项

    help->install new software加入网址是http://download.eclipse.org/releases/Neon,最后一个是你eclipse的版本.得到一系列的插 ...

  2. Android Vitals各性能指标介绍

    Android vitals 简介 谷歌推荐使用Android vitals来帮助开发者提高App的稳定性和性能表现. 作为这个方案的一部分, Play Console提供了Android Vital ...

  3. 集合(一)-Java中Arrays.sort()自定义数组的升序和降序排序

    默认升序 package peng; import java.util.Arrays;  public class Testexample { public static void main(Stri ...

  4. hlslcc

    https://cdn2.unrealengine.com/Resources/files/UE4_OpenGL4_GDC2014-514746542.pdf ue的跨平台编译器 hlsl cross ...

  5. 单例模式(Singleton)的同步锁synchronized

    单例模式,有“懒汉式”和“饿汉式”两种. 懒汉式 单例类的实例在第一次被引用时候才被初始化. public class Singleton { private static Singleton ins ...

  6. 在cubemx中使用freertos中的注意事项

    就是使用信号量等rtos自带特性的时候,务必先初始化然后在发生信号量或接收. 而且在中断中发送信号量或队列的时候,务必把使能中断的语句放在初始化freertos之后,尤其是cubemx生成的代码,默认 ...

  7. cookie和Session是啥?

    HTTP是无状态(stateless)协议 http协议是无状态协议即不保存状态. 无状态协议的优点: 由于不需要保存记录,所以减少服务器的CPU和内存的资源的消耗.毕竟客户端一多起来保存记录的话对于 ...

  8. hybird(h5)页面自动化测试

  9. [Luogu] 宝藏

    https://www.luogu.org/problemnew/show/P3959 模拟退火解法 发现prim求最小生成树是明显错误的 因为prim每次要取出边权最小的点 然而在这道T中这样做不一 ...

  10. Codevs 2188 最长上升子序列(变式)

    2188 最长上升子序列 时间限制: 1 s 空间限制: 32000 KB 题目等级 : 钻石 Diamond 题目描述 Description LIS问题是最经典的动态规划基础问题之一.如果要求一个 ...