How to remove duplicate lines in a large text file?
How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.
A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.
Java:
// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}
How to remove duplicate lines in a large text file?的更多相关文章
- notepad++ remove duplicate line
To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- [LeetCode] Remove Duplicate Letters 移除重复字母
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- 316. Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- Remove Duplicate Letters I & II
Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...
- LeetCode Remove Duplicate Letters
原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...
- leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)
https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...
- Remove Duplicate Letters
316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...
- [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
随机推荐
- zabbix内存溢出解决方法
1406:20180802:183248.783 __mem_malloc: skipped 0 asked 48 skip_min 4294967295 skip_max 0 1406:201808 ...
- shell字符串处理
一.字符串切片: ${#var}:返回字符串变量var的长度${var:offset}:返回字符串变量var中从第offset个字符后(不包括第offset个字符)的字符开始,到最后的部分,offse ...
- map(callback)将一组元素转换成其他数组(不论是否是元素数组)
map(callback) 概述 将一组元素转换成其他数组(不论是否是元素数组) 你可以用这个函数来建立一个列表,不论是值.属性还是CSS样式,或者其他特别形式.这都可以用'$.map()'来方便的建 ...
- 第二章 Unicode简介
/*------------------------------------------------------------- screensize.cpp -- Displays screen si ...
- Codevs 1331 西行寺幽幽子(高精度)
1331 西行寺幽幽子 时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题目描述 Description 在幻想乡,西行寺幽幽子是以贪吃闻名的亡灵.不过幽幽子可不是只 ...
- 对AM信号FFT的matlab仿真
普通调幅波AM的频谱,大信号包络检波频谱分析 u(t)=Ucm(1+macos t)cos ct ma称为调幅系数 它的频谱由载波,上下边频组成 , 包络检波中二极管截去负半周再用电容低通滤波,可 ...
- Flask-配置参数
Flask配置 Flask 是一个非常灵活且短小精干的web框架 , 那么灵活性从什么地方体现呢? 有一个神奇的东西叫 Flask配置 , 这个东西怎么用呢? 它能给我们带来怎么样的方便呢? 首先展示 ...
- Java线程优先级及守护线程(二)
简述 在操作系统中,线程是可以划分优先级的,优先级较高的线程,得到CPU优先执行的几率就较高一些.设置线程的优先级,有助于帮助线程规划期选择下一个哪一个线程优先执行,但是线程优先级高不代表一定会优先执 ...
- 前端工程师需要掌握的 Babel 知识
在前端圈子里,对于 Babel,大家肯定都比较熟悉了.如果哪天少了它,对于前端工程师来说肯定是个噩梦.Babel 的工作原理是怎样的可能了解的人就不太多了.本文将主要介绍 Babel 的工作原理以及怎 ...
- 8.8 JQuery框架
8.8 JQuery框架 一.JQuery是一个javascript的框架,是对javascript的一种封装. 通过JQuery可以非常方便的操作html的元素\要使用Jquery需要导入一个第三方 ...