How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}

  

How to remove duplicate lines in a large text file?的更多相关文章

  1. notepad++ remove duplicate line

    To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...

  2. Compare, sort, and delete duplicate lines in Notepad ++

    Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...

  3. [LeetCode] Remove Duplicate Letters 移除重复字母

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  4. 316. Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

  5. Remove Duplicate Letters I & II

    Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...

  6. LeetCode Remove Duplicate Letters

    原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...

  7. leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)

    https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...

  8. Remove Duplicate Letters

    316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...

  9. [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters

    Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...

随机推荐

  1. Ubuntu安装libssl-dev失败(依靠aptitude管理降级软件)并记录dpkg展示安装软件列表

    Ubuntu 12.04LTS下直接安装 libssl-dev 失败 提示错误: $ sudo apt-get install libssl-dev Reading package lists... ...

  2. (转)VMware虚拟机三种网络模式的区别及配置方法;

    我的一点实际经验理解桥接和NAT 桥接是虚拟机完全作为一个独立的地址接在局域网中,NAT是虚拟机依赖宿主主机地址转换的一种方式 例子我的虚拟机如果用桥接模式,连接外部网站如百度时会提示此pc没有装公司 ...

  3. axios 用 params/data 发送参数给 springboot controller,如何才能正确获取

    今天有人遇到接口调用不通的情况,粗略看了一下是axios跨域请求引起了.找到问题,处理就简单多了. 但是我看其代码,发现比较有意思 export function agentlist(query) { ...

  4. java动态代理框架

             java动态代理是一个挺有意思的东西,他有时候可以被使用的很灵活.像rpc的调用,调用方只是定义的一个接口,动态代理让他匹配上对应的不同接口:mybatis内部的实现,编码时,只是实 ...

  5. 【转载】@Component, @Repository, @Service的区别

    @Component, @Repository, @Service的区别 官网引用 引用spring的官方文档中的一段描述: 在Spring2.0之前的版本中,@Repository注解可以标记在任何 ...

  6. springboot2.0入门(一)----springboot 简介

    一.springboot解决了什么? 避免了繁杂的xml配置,框架自动帮我们完成了相关的配置,当我们需要进行相关插件集成的时候,只需要将相关的starter通过相关的maven依赖引进,并可以进行相关 ...

  7. sqlserver 删除表 外键

    Truncate table Menu --truncate不能对有外键的表 delete Menu delete RoleMenu SELECT * FROM sys.foreign_keys WH ...

  8. 1 Mybatis

    1 使用Maven导入mybatis依赖 在pom.xml中写上一下代码:这些代码的查找可在https://mvnrepository.com/open-source网站上寻找,导入mybatis时要 ...

  9. children([expr]) 取得一个包含匹配的元素集合中每一个元素的所有子元素的元素集合。

    children([expr]) 概述 取得一个包含匹配的元素集合中每一个元素的所有子元素的元素集合. 可以通过可选的表达式来过滤所匹配的子元素.注意:parents()将查找所有祖辈元素,而chil ...

  10. 「BZOJ 1698」「USACO 2007 Feb」Lilypad Pond 荷叶池塘「最短路」

    题解 从一个点P可以跳到另一个点Q,如果Q是水这条边就是1,如果Q是荷叶这条边权值是0.可以跑最短路并计数 问题是边权为0的最短路计数没有意义(只是荷叶的跳法不同),所以我们两个能通过荷叶间接连通的点 ...