[LeetCode] string整体做hash key，窗口思想复杂度O(n)。附来自LeetCode的4例题(标题有字数限制，写不下所有例题题目 T.T)

引言

在字符串类型的题目中，常常在解题的时候涉及到大量的字符串的两两比较，比如要统计某一个字符串出现的次数。如果每次比较都通过挨个字符比较的方式，那么毫无疑问是非常占用时间的，因此在一些情况下，我们可以将字符串本身作为hashmap的key，往往会大大节省时间。

这篇博文中涉及的另一个技巧，是使用窗口的思想，这种思想不单单在字符串类型的题目中使用。这个技巧简单说来就是维护两个int，start和end，作为窗口的左右端。在不断的向前移动end和start的时候，记录下窗口包含的最大长度和最小长度，时间复杂度为O(n)，这种技巧在求最大或者最小子集的时候很常用。

例题 1, anagrams (string as key思想)

题目：

Given an array of strings, return all groups of strings that are anagrams.

Note: All inputs will be in lower-case.

class Solution {

public:

    vector<string> anagrams(vector<string> &strs) {

    }

};

题意本身并不是很清晰，开始我的代码总是报Output Limit Exceeded，后来搜了相关文章，明白了题目真正要求的输出格式。

For example:

Input:　　["tea","and","ate","eat","den"]

Output: ["tea","ate","eat"]

开始，我的思路是，将每一个string 都和其他比较，互为anagram的就记录到vector<string> res中。最后返回res。

这样宏观上来需要O(n²)次，n是输入vector的元素个数；对于内部判断anagram，我自己用数组实现dictionary[26]，记录每一个character出现的次数，两个string如果正好可以让dictionary的全部元素回归0，则互为anagram，这样内部判断的时间是O(m)，m是string的长度。

写这段代码时，我对输出的理解还存在错误，以为对于所有anagram group，只要将这个group中的第一个放入返回的vector<string>中即可。所以下面代码中，如果res中后面的元素已经判定和res中靠前的string互为anagram，后面的元素会被从res中移除。

初次实现的代码如下：

class Solution {

public:

    vector<string> anagrams(vector<string> &strs) {

        vector<string> res;

        if(strs.size() == ) return res;

        dic = new int[];

        for(vector<string>::iterator it = strs.begin(); it < strs.end(); ++it){

            res.push_back(*it);

        }

        for(int i = ; i < res.size(); ++i){

            for(int j = i+; j < res.size(); ++j){

                initDic(dic, , res[i]);

                int k = ;

                for(; k < res[j].length(); ++k){ //判断 res[i] 和res[j] 是否为anagrams

                    dic[res[j][k] - 'a']--;

                    if(dic[res[j][k] - 'a'] < ) break;

                }

                if(k == res[j].length() && judgeDic(dic, )){

                    res.erase(res.begin() + j); //移除和res中的元素互为anagram的

                    --j;

                }

            }

        }

        return res;

    }

private:

    int* dic;

    void initDic(int* dic, int n, string str){

        for(int i = ; i < n; ++i){

            dic[i] = ;

        }

        for(int j = ; j < str.length(); ++j){

            dic[str[j] - 'a']++;

        }

    }

    bool judgeDic(int* dic, int n){

        int i = ;

        for(; i < n; ++i){

            if(dic[i] != ) break;

        }

        return (i == n);

    }

};

这样做，超时。

原因就在于宏观上的O(n²)，应该有优化的余地。Annie Kim's Blog中介绍了空间换时间的做法，即定义一个map<string, int>，然后遍历strs的元素，对于strs中的每一个string s，先将s的内容排序，再将排好序的s当作key。

这样虽然排序本身需要O(mlogm)的时间(m是string的长度)，但是宏观上，只需要O(n)的时间(n是输入vector的元素个数)，因为map的访问是O(1)。

因此整体上时间复杂度可能会下降(测试用例的n较大时)。

但是这个思路的缺点在于：因为是将string 排序后本身作为key，因此如果题目增加难度，比如string中包含标点和空格，那么这种方法就不能准确判断两个string是否anagram了。另外，如果string非常长，用来做key也不是很方便。

我结合我自己的思路做了一些修改，修改后的思路中，key不是排完序的string，而是依然利用我开始代码里面的dic[26]：先从头到尾扫一遍string，然后给dic对应位置+1，然后将dic元素本身的排列作为key。这样，(1) 在有空格和标点的情况下，依然可以判断两个string是否是anagram，如果有大写字母或者数字，只需要扩张dic的大小即可；而且Key的长度为定值，这里总是26。(2) 不再需要O(mlogm)的时间复杂度，需要O(m+26) = O(m)的复杂度。

实现代码如下：

class Solution {

public:

    vector<string> anagrams(vector<string> &strs) {

        vector<string> res;

        if(strs.size() == ) return res;

        map<string, int> rec;

        dic = new int[];

        for(int i = ; i < strs.size(); ++i){

            string key = generateKeyByDic(dic, , strs[i]);

            if(rec.find(key) == rec.end()){

                rec.insert(make_pair(key, i));

            }else{

                if(rec[key] >= ){

                    res.push_back(strs[rec[key]]);

                    rec[key] = -;

                }

                res.push_back(strs[i]);

            }

        }

        return res;

    }

private:

    int* dic;

    string generateKeyByDic(int* dic, int n, string str){

        for(int i = ; i < n; ++i){

            dic[i] = ;

        }

        for(int j = ; j < str.length(); ++j){

            if(str[j] <= 'z' && str[j] >= 'a')

                dic[str[j] - 'a']++;

        }

        string key(, '');

        for(int k = ; k < ; ++k){

            key[k] = dic[k] + '';

        }

        return key;

    }

};

100 / 100 test cases passed. Runtime: 224 ms

而是用sorted string做key的方法，数据是 100 / 100 test cases passed. Runtime: 228 ms

时间上并没有提高多少，原因应该是test case的string长度都不算大，故O(mlogm)和O(m+26) 差别不大。

不论是引用的思路，还是我的思路，核心都是使用了map<string, int>，当需要在一堆字符串中找出包含相同字符的 group，这种空间换时间的方法可以考虑。

例题 2, Longest Substring Without Repeating Characters (窗口思想)

Given a string, find the length of the longest substring without repeating characters. For example, the longest substring without repeating letters for "abcabcbb" is "abc", which the length is 3. For "bbbbb" the longest substring is "b", with the length of 1.

class Solution {

public:

    int lengthOfLongestSubstring(string s) {

    }

};

这道题需要使用窗口的思想，定义start，end作为窗口的两端，开始时start = end = 0；再定义一个Map，用来检测窗口中是否有重复字符。

这样可以在O(n)时间复杂度和O(n)空间复杂度下接触。当然如果字符类型只是限于ASCII表的话，空间复杂度就是constant了。

class Solution {

public:

    int lengthOfLongestSubstring(string s) {

        int len = s.length();

        if(len == ) return ;

        int start = , end = , max = ;

        int* map = new int[]; //自定义Map

        for(int i = ; i < ; ++i) map[i] = ;

        while(end < len){

            if(map[s[end] - '\0'] == ){

                map[s[end] - '\0']++; //右移end扩大窗口

                if((end - start + ) > max) max = (end - start + );

                ++end;

            }else{

                for(; map[s[end] - '\0'] > ; map[s[start] - '\0']--, ++start); //右移start缩小窗口

            }

        }

        return max;

    }

};

例题 3, Minimum Window Substring (窗口思想)

Given a string S and a string T, find the minimum window in S which will contain all the characters in T in complexity O(n).

For example,
S = "ADOBECODEBANC"
T = "ABC"

Minimum window is "BANC".

Note:
If there is no such window in S that covers all characters in T, return the emtpy string "".

If there are multiple such windows, you are guaranteed that there will always be only one unique minimum window in S.

class Solution {

public:

    string minWindow(string S, string T) {

    }

};

窗口依旧，我们可以发现窗口思想应用场景的一些特点：一般都是要求一个子串或者子数组整体满足一定条件，比如要和最大，要包含什么字符之类的；而且所求的结果肯定和这个子串或者子数组长度有关。

先不断向右移动end直到当前窗口已经包含T中所有字符，然后向右移动start 直到再移动start的话窗口就不再包含T所有字符了，这个时候记录下窗口大小(end - start) 并和 min 比较即可。最后返回min。

注意：这道题中“S that covers all characters in T”其实意思不够明确，提交代码后，发现如果一个char在T中出现了两次，S也必须出现这样的char两次。我在看题时就有这个疑问，就先按照char个数不算的方式做了，提交后char的个数也是计入的。

这道题也使用了map，不过这里的map只是把单个字符当作key，而不是这篇博文中所讨论的把整个string作为key的技巧。

代码：

class Solution {

public:

    string minWindow(string S, string T) {

        if(T.length() ==  || S.length() == ) return "";

        int chT[];

        int chS[];

        int i, j, k, cntT = T.length(), cntS = ;

        for(i = ; i < ; chT[i] = , chS[i] = , ++i);

        for(i = ; i < T.length(); ++chT[T[i] - '\0'], ++i);

        for(i = ; i < S.length(); ++i){

            if(chT[S[i] - '\0'] >  && chS[S[i] - '\0'] < chT[S[i] - '\0']) ++cntS;

            ++chS[S[i] - '\0'];

            if(cntS == cntT) break;

        }

        if(i == S.length()) return ""; //至此，找到了第一个包含T中所有charactor的S 字串

        int end = i, st = ; char toFind; //st指针右移，直到窗口因为缺少T中某个charactor(把这个ch记为toFind)而不再满足要求，就开始右移end指针，直到又找到了toFind

        int minlen = end - st + , minst = st;

        while(end < S.length()){for(++st; st <= end; ++st){

                --chS[S[st-] - '\0'];

                if(chT[S[st-] - '\0'] >  && chS[S[st-] - '\0'] < chT[S[st-] - '\0']){ toFind = S[st-]; break;}

                else if((end - st + ) < minlen){

                    minlen = (end - st + );

                    minst = st;

                }

            }

            for(++end; end < S.length(); ++end){

                ++chS[S[end] - '\0'];

                if(toFind == S[end]) break;

            }

        }

        return S.substr(minst, minlen);

    }

};

例题 4, Substring with Concatenation of All Words (窗口思想 + string as key思想)

You are given a string, S, and a list of words, L, that are all of the same length. Find all starting indices of substring(s) in S that is a concatenation of each word in L exactly once and without any intervening characters.

For example, given:
S: "barfoothefoobarman"
L: ["foo", "bar"]

You should return the indices: [0,9].
(order does not matter).

class Solution {

public:

    vector<int> findSubstring(string S, vector<string> &L) {

    }

};

这道题如果拿S中德每个字符挨个去和L的字符串中的字符比较，会非常复杂。

好在题目给定了一个条件：L的所有字符串长度一致。

这就让我们可以顺着这个思路往下想：假设L的字符串长度用unit表示，那么我们先把L中的字符串作为key存到hashmap里，之后只要从S上每次取出unit长度的子串，用hashmap来判断是否L所有的字符串都被涵盖即可。

基于这个思路，我写下如下代码：

class Solution {

public:

    vector<int> findSubstring(string S, vector<string> &L) {

        vector<int> v;

        if(S.length() ==  || L.size() == ) return v;

        int len = ; int unit = L[].length();

        if(S.length() < (unit * L.size())) return v;

        for(int j = ; j <= (S.length() - len); ++j){    //除去末尾的那些，遍历S中每个字符打头的长度为len的子字符串，看是否正好涵盖L的所有字符串。

            if(judge(S, j, L, unit)) v.push_back(j);

        }

        return v;

    }

private:

    map<string, int> m;

    bool judge(string S, int start, vector<string> &L, int unit){

        m.clear();

        for(vector<string>::iterator i = L.begin(); i < L.end(); ++i){

            if(m.find(*i) == m.end()){

                m.insert(pair<string, int>(*i, ));

            }else{

                m[*i]++;

            }

        }

        for(int i = ; i < L.size(); ++i, start += unit){

            if(m.find(S.substr(start, unit)) == m.end()) return false;

            if(m[S.substr(start, unit)] <= ) return false;

            m[S.substr(start, unit)]--;

        }

        return true;

    }

};

这种解法思路比较清晰，但是超时。

这种解法需要对S中每一个字符都调用judge函数，每一次judge函数都要遍历L中的元素，接着再以unit为步长在S上面最多走L.size() 步。时间复杂度可能会达到 O(S.length() * (L.size() + L.size()))。

我们仔细想一下：以题目中的例子来说，也就是 S: "barfoothefoobarman" L: ["foo", "bar"]，假设当前我们要判断 "foothe"是不是涵盖L所有元素，结果当然是false，因为"the"在L中没有value，那么，"thefoo"其实也不需要判断了。因此基于当前一些判断false的结果，后面一部分判断过程可以跳过。

这时候我们引入窗口的思想。假设窗口的起始状态是"bar"，然后end右移unit位，如果新被窗口包含进来的部分属于L，而且正好窗口大小和L大小相同了，那么当前start的值被记录要返回的vector中。若新被窗口包含进来的部分根本不属于L，那么start就可以直接移到end了，end则赋值为start+unit。若新被窗口包含进来的部分属于L，但是不巧，窗口中已有的部分已经把L这种字符串占满了，那么start就要右移了，一直移到这种字符串有一个不再被包含在窗口中。

不要忘了，这种窗口的思想，start和end的步长都是unit。所以，为了保证所有的正解都能被包含到。我们需要定义Unit个窗口，每一个窗口的start开始位置分别从0到unit。

根据这种思想写出的代码：

class Solution {

public:

    vector<int> findSubstring(string S, vector<string> &L) {

        vector<int> v;

        if(S.length() ==  || L.size() == ) return v;

        int unit = L[].length();

        int len = unit * L.size();

        if(S.length() < (unit * L.size())) return v;

        map<string, int> m;

        map<string, int> m2;

        for(vector<string>::iterator i = L.begin(); i < L.end(); ++i){

            ++m[*i];

        }

        for(int i = ; i < unit; ++i){

            int start = i, end = i + unit;

            m2.clear();

            while(end <= S.length()){

                string tmps = S.substr(end-unit, unit);

                if(m.find(tmps) != m.end()){

                    ++m2[tmps];

                    if(m2[tmps] > m[tmps]){ //L所包含的这种字串已经少于窗口包含的

                        while(S.substr(start, unit) != tmps){

                            m2[S.substr(start, unit)]--;

                            start += unit;

                        }

                        m2[S.substr(start, unit)]--;

                        start += unit;

                    }else if((end - start) == len){   //If contains all string in L

                        v.push_back(start);

                        m2[S.substr(start, unit)]--;

                        start += unit;

                    }

                    end += unit;

                }else{  //L不包含新被划进窗口的字串

                    m2.clear();

                    start = end;

                    end = start + unit;

                }

            }

        }

        return v;

    }

};

这种解法定义了两个Map，原来只用一个map的解法，每次都要为Map初始化，每一次初始化都需要遍历L，需要占用时间。

二刷的代码，只用了一个Map，整体思路基本一致。

class Solution {

public:

    vector<int> findSubstring(string S, vector<string> &L) {

        vector<int> res;

        if(L.size() == ) return res;

        if(S.length() == ) return res;

        int seg = L[].length();

        if(seg > S.length() || seg == ) return res;

        int st = , count = , i = , j = ;

        map<string, int> map;

        string str;

        for(i = ; i < seg; ++i){

            map.clear();

            count = ;

            for(vector<string>::iterator it = L.begin(); it != L.end(); ++map[*it], ++it);

            for(st = i; st < S.length(); st += seg){

                str = S.substr(st, seg);

                if(map.find(str) != map.end() && map[str] > ){

                    map[str]--;

                    ++count;

                    if(count == L.size()){  //找到一个结果

                        st -= ((count-) * seg);

                        res.push_back(st);

                        map[S.substr(st, seg)]++;   //把符合条件的子序列中最前端的unit移除

                        st += ((count-) * seg);

                        --count;

                    }

                }else if(count > ){    //虽然当前不匹配，但只要之前还有成功匹配的部，都要考虑以之前的匹配部分为起点，挨个尝试。

                    st -= (count * seg);

                    map[S.substr(st, seg)]++;

                    --count;

                    st += (count * seg);

                }

            }

        }

        return res;

    }

};

这种解法的时间复杂度基本可以算作O(S.length()) 了。

结语

窗口思想和把字符串整体当key的思想都可以大大简化字符串搜索类的问题。

窗口思想应用场景的特点为：一般都是寻找子串或者子数组，要求这个子串或者子数组整体满足一定条件，比如要求和最大，或者要求包含另一个子串的字符，等等；而且所求的结果肯定和这个子串或者子数组长度有关，比如必须返回符合

另一个在数组上使用窗口思想的例子可以见这篇文章：数组上求和。