题目

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

原题链接:https://oj.leetcode.com/problems/repeated-dna-sequences/

straight-forward method(TLE)

算法分析

直接字符串匹配;设计next数组,存字符串中每个字母在其中后续出现的位置;遍历时以next数组为起始。

简化考虑长度为4的字符串

case1:

src A C G T A C G T

next [4] [5] [6] [7] [-1] [-1] [-1] [-1]

那么匹配ACGT字符串的过程,匹配next[0]之后的3位字符即可

case2:

src A C G T A A C G T

next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]

多个A字符后继,那么需要匹配所有后继,匹配next[0]不符合之后,还要匹配next[next[0]]

case3:

src A A A A A A

next [1] [2] [3] [4] [5] [-1]

重复的情况,在next[0]匹配成功时,可以把next[next[0]]置为-1,即以next[0]开始的长度为4的字符串已经成功匹配过了,无需再次匹配了;当然这么做只能减少重复的情况,并不能消除重复,因此仍需要使用一个set存储匹配成功的结果,方便去重

时间复杂度

构造next数组的复杂度O(n^2),遍历的复杂度O(n^2);总时间复杂度O(n^2)

代码实现

 #include <string>
#include <vector>
#include <set> class Solution {
public:
std::vector<std::string> findRepeatedDnaSequences(std::string s); ~Solution(); private:
std::size_t* next;
}; std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
std::vector<std::string> rel; if (s.length() <= ) {
return rel;
} next = new std::size_t[s.length()]; // cal next array
for (int pos = ; pos < s.length(); ++pos) {
next[pos] = s.find_first_of(s[pos], pos + );
} std::set<std::string> tmpRel; for (int pos = ; pos < s.length(); ++pos) {
std::size_t nextPos = next[pos];
while (nextPos != std::string::npos) {
int ic = pos;
int in = nextPos;
int count = ;
while (in != s.length() && count < && s[++ic] == s[++in]) {
++count;
}
if (count == ) {
tmpRel.insert(s.substr(pos, ));
next[nextPos] = std::string::npos;
}
nextPos = next[nextPos];
}
} for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) {
rel.push_back(*itr);
} return rel;
} Solution::~Solution() {
delete [] next;
}

hash table plus bit manipulation method

(view the Show Tags and Runtime 10ms !)

算法分析

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

20位的二进制数,至多有2^20种组合,因此hash table的大小为2^20,即1024 * 1024,将hash table设计为bool hashTable[1024 * 1024];

遍历字符串的设计

每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。例如

src CAAAAAAAAAC

subStr CAAAAAAAAA

int 0100000000

subStr AAAAAAAAAC

int 0000000001

时间复杂度

字符串遍历O(n),hash tableO(1);总时间复杂度O(n)

代码实现

 #include <string>
#include <vector>
#include <unordered_set>
#include <cstring> bool hashMap[*]; class Solution {
public:
std::vector<std::string> findRepeatedDnaSequences(std::string s);
}; std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
std::vector<std::string> rel;
if (s.length() <= ) {
return rel;
} // map char to code
unsigned char convert[];
convert[] = ; // 'A' - 'A' 00
convert[] = ; // 'C' - 'A' 01
convert[] = ; // 'G' - 'A' 10
convert[] = ; // 'T' - 'A' 11 // initial process
// as ten length string
memset(hashMap, false, sizeof(hashMap)); int hashValue = ; for (int pos = ; pos < ; ++pos) {
hashValue <<= ;
hashValue |= convert[s[pos] - 'A'];
} hashMap[hashValue] = true; std::unordered_set<int> strHashValue; //
for (int pos = ; pos < s.length(); ++pos) {
hashValue <<= ;
hashValue |= convert[s[pos] - 'A'];
hashValue &= ~(0x300000); if (hashMap[hashValue]) {
if (strHashValue.find(hashValue) == strHashValue.end()) {
rel.push_back(s.substr(pos - , ));
strHashValue.insert(hashValue);
}
} else {
hashMap[hashValue] = true;
}
} return rel;
}

Leetcode:Repeated DNA Sequences详细题解的更多相关文章

  1. [LeetCode] Repeated DNA Sequences 求重复的DNA序列

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  2. [Leetcode] Repeated DNA Sequences

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  3. LeetCode() Repeated DNA Sequences 看的非常的过瘾!

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  4. [LeetCode] Repeated DNA Sequences hash map

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  5. lc面试准备:Repeated DNA Sequences

    1 题目 All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: &quo ...

  6. LeetCode 187. 重复的DNA序列(Repeated DNA Sequences)

    187. 重复的DNA序列 187. Repeated DNA Sequences 题目描述 All DNA is composed of a series of nucleotides abbrev ...

  7. 【LeetCode】Repeated DNA Sequences 解题报告

    [题目] All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: &quo ...

  8. [LeetCode] 187. Repeated DNA Sequences 求重复的DNA序列

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  9. 【LeetCode】187. Repeated DNA Sequences 解题报告(Python)

    作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 题目地址: https://leetcode.com/problems/repeated ...

随机推荐

  1. android软键盘的用法总结

    1.软键盘的显示原理 软键盘其实是一个Dialog.InputMethodService为我们的输入法创建了一个Dialog,并且对某些参数进行了设置,使之能够在底部 或者全屏显示.当我们点击输入框时 ...

  2. ThinkPHP 中M方法和D方法的具体区别(转)

    M方法和D方法的区别 ThinkPHP 中M方法和D方法都用于实例化一个模型类,M方法 用于高效实例化一个基础模型类,而 D方法 用于实例化一个用户定义模型类. 使用M方法 如果是如下情况,请考虑使用 ...

  3. android 41 Environment

    assets通常存储音频视频文件,但不要太大. Environment可以获取sd卡的相关信息,sd卡的根路径:/storage/sdcard activity.java package com.sx ...

  4. Java theory and practice: Thread pools and work queues--reference

    Why thread pools? Many server applications, such as Web servers, database servers, file servers, or ...

  5. iOS-CALayer中position与anchorPoint详解

    iOS-CALayer中position与anchorPoint详解 属性介绍 CALayer通过四个属性来确定大小和位置, 分别为:frame.bounds.position.anchorPoint ...

  6. git 删除远程master 分支

    ➜  fekit-extension-yo git:(dev) git push origin :master remote: error: By default, deleting the curr ...

  7. CentOS 6.4 编译 Hadoop 2.5.1

    为了防止无良网站的爬虫抓取文章,特此标识,转载请注明文章出处.LaplaceDemon/SJQ. http://www.cnblogs.com/shijiaqi1066/p/4058956.html ...

  8. Asp.net Mvc 第一回 安装,并使ASP.NET MVC页面运行起来

    直接上图吧: 1.到官方网站下载:http://www.asp.net/mvc/ Codeplex开源站点:http://www.codeplex.com/aspnet(下载源代码及Futures包) ...

  9. c# 关于dispose

    只有针对非托管资源才需要调用dispose,包含托管资源包装了非托管资源这样的情况.也只有非托管资源调用dispose才会立即进行资源清理,托管资源即使调用dispose也还是交由gc自动完成,并非立 ...

  10. anjularjs slider控件替代方案

    做项目需要一个slider控件,找了很久没有找到合适的指令集,无意间看到可以直接用range替代,不过样式有点丑. <label> <input type="range&q ...