How to remove duplicate lines in a large text file?
How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.
A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.
Java:
// Efficient Java program to remove // duplicates from input.txt and // save output to output.txt import java.io.*; import java.util.HashSet; public class FileOperation { public static void main(String[] args) throws IOException { // PrintWriter object for output.txt PrintWriter pw = new PrintWriter("output.txt"); // BufferedReader object for input.txt BufferedReader br = new BufferedReader(new FileReader("input.txt")); String line = br.readLine(); // set store unique values HashSet<String> hs = new HashSet<String>(); // loop for each line of input.txt while(line != null) { // write only if not // present in hashset if(hs.add(line)) pw.println(line); line = br.readLine(); } pw.flush(); // closing resources br.close(); pw.close(); System.out.println("File operation performed successfully"); } }