https://discuss.leetcode.com/topic/91430/c-clean-solution-answers-to-follow-upGiven a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txtf2.txt ... fn.txt with content f1_contentf2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Note:

  1. No order is required for the final output.
  2. You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
  3. The number of files given is in the range of [1,20000].
  4. You may assume no files or directories share the same name in the same directory.
  5. You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.

Follow-up beyond contest:

  1. Imagine you are given a real file system, how will you search files? DFS or BFS?
  2. If the file content is very large (GB level), how will you modify your solution?
  3. If you can only read the file by 1kb each time, how will you modify your solution?
  4. What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
  5. How to make sure the duplicated files you find are not false positive?

思路:

首先就是将字符串处理成完整路径的形式,然后用map统计相同内容的文件路径。

void parse(string orign,string& fileName,string& content)
{
int index = orign.find_first_of('(');
fileName = orign.substr(, index);
content = orign.substr(index + ,orign.length()-index-);
}
void getFullPath(string p,vector<string>&path,vector<string>&conVec)
{
stringstream ss(p);
string pathPrefix;
ss >> pathPrefix;
string file;
while (ss >> file)
{
string fileName, content;
parse(file,fileName, content);
path.push_back(pathPrefix + "/"+fileName);
conVec.push_back(content);
}
}
vector<vector<string>> findDuplicate(vector<string>& paths)
{
vector<string>pathVec, conVec;
for (auto p:paths)
{
getFullPath(p,pathVec,conVec);
}
map<string, set<string>>mp2;
for (int i = ; i < pathVec.size();i++)
{
mp2[conVec[i]].insert(pathVec[i]);
// cout << pathVec[i] << " " << conVec[i] << endl;
}
vector<vector<string>>ret;
for (auto it :mp2)
{
if (it.second.size() == )continue;
vector<string> temp(it.second.begin(),it.second.end());
ret.push_back(temp);
}
return ret;
}

看到相同思路的人写的,但是感觉大神的要简洁的多的多。。

vector<vector<string>> findDuplicate(vector<string>& paths) {
unordered_map<string, vector<string>> files;
vector<vector<string>> result; for (auto path : paths) {
stringstream ss(path);
string root;
string s;
getline(ss, root, ' ');
while (getline(ss, s, ' ')) {
string fileName = root + '/' + s.substr(, s.find('('));
string fileContent = s.substr(s.find('(') + , s.find(')') - s.find('(') - );
files[fileContent].push_back(fileName);
}
} for (auto file : files) {
if (file.second.size() > )
result.push_back(file.second);
} return result;
}

参考:

https://discuss.leetcode.com/topic/91430/c-clean-solution-answers-to-follow-up

[leetcode-609-Find Duplicate File in System]的更多相关文章

  1. LC 609. Find Duplicate File in System

    Given a list of directory info including directory path, and all the files with contents in this dir ...

  2. 【LeetCode】609. Find Duplicate File in System 解题报告(Python & C++)

    作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 目录 题目描述 题目大意 解题方法 日期 题目地址:https://leetcode.c ...

  3. 【leetcode】609. Find Duplicate File in System

    题目如下: Given a list of directory info including directory path, and all the files with contents in th ...

  4. 609. Find Duplicate File in System

    Given a list of directory info including directory path, and all the files with contents in this dir ...

  5. [LeetCode] Find Duplicate File in System 在系统中寻找重复文件

    Given a list of directory info including directory path, and all the files with contents in this dir ...

  6. LeetCode Find Duplicate File in System

    原题链接在这里:https://leetcode.com/problems/find-duplicate-file-in-system/description/ 题目: Given a list of ...

  7. [Swift]LeetCode609. 在系统中查找重复文件 | Find Duplicate File in System

    Given a list of directory info including directory path, and all the files with contents in this dir ...

  8. Find Duplicate File in System

    Given a list of directory info including directory path, and all the files with contents in this dir ...

  9. HDU 3269 P2P File Sharing System(模拟)(2009 Asia Ningbo Regional Contest)

    Problem Description Peer-to-peer(P2P) computing technology has been widely used on the Internet to e ...

随机推荐

  1. MVC个层次之间的联系

    MVC顾名思义分为三层: M:Model层   Model层中  包含 DAO层和Javabean层: V:view 意为视图层也叫表示层,也可以直接理解为是JSP,用于前端显示: C:  ‘控制层’ ...

  2. SpringBoot非官方教程 | 第二十二篇: 创建含有多module的springboot工程

    转载请标明出处: 原文首发于:https://www.fangzhipeng.com/springboot/2017/07/11/springbot22-modules/ 本文出自方志朋的博客 这篇文 ...

  3. ETO的公开赛T3《寻星》 题解(BY 超級·考場WA怪 )

    题解 寻星 题意:给定一个有向带权图,定义从一点到另一点的某条路径长为路径上所有边权的最大值,并给定四个点编号w,t1,t2,t3. 求出一个点s,使它在到t1,t2,t3三点最短路径最大值最大或者根 ...

  4. BZOJ2754: [SCOI2012]喵星球上的点名(AC自动机)

    Time Limit: 20 Sec  Memory Limit: 128 MBSubmit: 2816  Solved: 1246[Submit][Status][Discuss] Descript ...

  5. fjutacm 3700 这是一道数论题 : dijkstra O(mlogn) 二进制分类 O(k) 总复杂度 O(k * m * logn)

    /** problem: http://www.fjutacm.com/Problem.jsp?pid=3700 按二进制将k个待查点分类分别跑dijkstra **/ #include<std ...

  6. TiDB集群手动安装

    TIDB的安装 TiDB 是 PingCAP 公司受 Google Spanner / F1 论文启发而设计的开源分布式 HTAP (Hybrid Transactional and Analytic ...

  7. SpringBoot配置redis和分布式session-redis

    springboot项目 和传统项目 配置redis的区别,更加简单方便,在分布式系统中,解决sesssion共享问题,可以用spring session redis. 1.pom.xml <d ...

  8. python 摘要算法

    一.概述: 摘要算法主要特征是加密过程不需要密钥,并且加密的数据无法解密,只有输入相同的明文数据经过相同的摘要算法才能得到相同的密文.摘要算法主要应用在“数字签名”领域.接下来会讲述RSA公司的MD5 ...

  9. 一次 group by + order by 性能优化分析

    一次 group by + order by 性能优化分析 最近通过一个日志表做排行的时候发现特别卡,最后问题得到了解决,梳理一些索引和MySQL执行过程的经验,但是最后还是有5个谜题没解开,希望大家 ...

  10. dos命令操作数据库(上)

    1.cd到mysql安装目录bin目录: 2.输入id.用户名和密码: 3.查看数据库实例: 4.创建一个实例: 5.删除一个实例: 6.创建一个表: 7.删除一个表: 8.表结构: 9.修改表: 你 ...