How to remove duplicate lines in a large text file?
How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.
A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.
Java:
// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt import java.io.*;
import java.util.HashSet; public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values
HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line); line = br.readLine(); } pw.flush(); // closing resources
br.close();
pw.close(); System.out.println("File operation performed successfully");
}
}
How to remove duplicate lines in a large text file?的更多相关文章
- notepad++ remove duplicate line
To remove duplicate lines just press Ctrl + F, select the “Replace” tab and in the “Find” field, pla ...
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- [LeetCode] Remove Duplicate Letters 移除重复字母
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- 316. Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- Remove Duplicate Letters I & II
Remove Duplicate Letters I Given a string which contains only lowercase letters, remove duplicate le ...
- LeetCode Remove Duplicate Letters
原题链接在这里:https://leetcode.com/problems/remove-duplicate-letters/ 题目: Given a string which contains on ...
- leetcode@ [316] Remove Duplicate Letters (Stack & Greedy)
https://leetcode.com/problems/remove-duplicate-letters/ Given a string which contains only lowercase ...
- Remove Duplicate Letters
316. Remove Duplicate Letters Total Accepted: 2367 Total Submissions: 12388 Difficulty: Medium Given ...
- [Swift]LeetCode316. 去除重复字母 | Remove Duplicate Letters
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
随机推荐
- Eclipse创建Servers没有Apache选项
help->install new software加入网址是http://download.eclipse.org/releases/Neon,最后一个是你eclipse的版本.得到一系列的插 ...
- Android Vitals各性能指标介绍
Android vitals 简介 谷歌推荐使用Android vitals来帮助开发者提高App的稳定性和性能表现. 作为这个方案的一部分, Play Console提供了Android Vital ...
- 集合(一)-Java中Arrays.sort()自定义数组的升序和降序排序
默认升序 package peng; import java.util.Arrays; public class Testexample { public static void main(Stri ...
- hlslcc
https://cdn2.unrealengine.com/Resources/files/UE4_OpenGL4_GDC2014-514746542.pdf ue的跨平台编译器 hlsl cross ...
- 单例模式(Singleton)的同步锁synchronized
单例模式,有“懒汉式”和“饿汉式”两种. 懒汉式 单例类的实例在第一次被引用时候才被初始化. public class Singleton { private static Singleton ins ...
- 在cubemx中使用freertos中的注意事项
就是使用信号量等rtos自带特性的时候,务必先初始化然后在发生信号量或接收. 而且在中断中发送信号量或队列的时候,务必把使能中断的语句放在初始化freertos之后,尤其是cubemx生成的代码,默认 ...
- cookie和Session是啥?
HTTP是无状态(stateless)协议 http协议是无状态协议即不保存状态. 无状态协议的优点: 由于不需要保存记录,所以减少服务器的CPU和内存的资源的消耗.毕竟客户端一多起来保存记录的话对于 ...
- hybird(h5)页面自动化测试
- [Luogu] 宝藏
https://www.luogu.org/problemnew/show/P3959 模拟退火解法 发现prim求最小生成树是明显错误的 因为prim每次要取出边权最小的点 然而在这道T中这样做不一 ...
- Codevs 2188 最长上升子序列(变式)
2188 最长上升子序列 时间限制: 1 s 空间限制: 32000 KB 题目等级 : 钻石 Diamond 题目描述 Description LIS问题是最经典的动态规划基础问题之一.如果要求一个 ...