[LeetCode 1520] Maximum Number of Non-Overlapping Substrings
Given a string s
of lowercase letters, you need to find the maximum number of non-empty substrings of s
that meet the following conditions:
- The substrings do not overlap, that is for any two substrings
s[i..j]
ands[k..l]
, eitherj < k
ori > l
is true. - A substring that contains a certain character
c
must also contain all occurrences ofc
.
Find the maximum number of substrings that meet the above conditions. If there are multiple solutions with the same number of substrings, return the one with minimum total length. It can be shown that there exists a unique solution of minimum total length.
Notice that you can return the substrings in any order.
Example 1:
Input: s = "adefaddaccc"
Output: ["e","f","ccc"]
Explanation: The following are all the possible substrings that meet the conditions:
[
"adefaddaccc"
"adefadda",
"ef",
"e",
"f",
"ccc",
]
If we choose the first string, we cannot choose anything else and we'd get only 1. If we choose "adefadda", we are left with "ccc" which is the only one that doesn't overlap, thus obtaining 2 substrings. Notice also, that it's not optimal to choose "ef" since it can be split into two. Therefore, the optimal way is to choose ["e","f","ccc"] which gives us 3 substrings. No other solution of the same number of substrings exist.
Example 2:
Input: s = "abbaccd"
Output: ["d","bb","cc"]
Explanation: Notice that while the set of substrings ["d","abba","cc"] also has length 3, it's considered incorrect since it has larger total length.
Constraints:
1 <= s.length <= 10^5
s
contains only lowercase English letters.
The key idea to solve this problem is greedy.
1. If there are two over-lapping valid substrings, choose the one that is shorter.
2. For two over-lapping valid substrings, the one that appears later is the shorter one. In fact, the one that appears later is a substring of the one that appears first.
Proof of the above claim: denote s1 as the first appearing substring, s2 as the second appearing substring, s1.start <= s2.start; If we have s2.end > s1.end, then s[s1.end + 1, s2.end] is the suffix s_suffix that does not belong to s1. Because both s1 and s2 are valid, that means all characters that appear in s1 can not appear in s2; all letters that appear in s_suffix do not have any other appearance outside s_suffix. This means we can simply cut s_suffix off s2, s2 will still be a valid substring. So if there is s2 such that s2.end > s1.end, then there is definitely a shorter s2 such that s2.end <= s1.end.
3. Claim 2 tells us that we searching for a valid substring, if we've already found one, there is no need to add more characters to it. We should start a new search for the next valid substring.
Solution that I came up during the contest.
1. preprocess all letters' indices and save them in tree sets.
2. go through the input string s, and do the following:
a. if the current character is the 1st appearance of itself, it means we get a possible start of a valid substring and there will only be 26 ending charater possibilities.
b. check each possibility and pick the character that makes substring the shortest.
c. check against previously found valid substrings, if there is overlap, delete the previous one and keep the current one. This is achieved by using a max heap.
d. all all the remaining valid substrings to the final answer list.
To check if a substring can end with character C, we do the following.
1. if given C, then we are given the start and end indices already.
2. if end < start, we can not; otherwise, check all 26 letters to see if each one is either not in [start, end] at all or completely contained inside. If there is at least one appearance inside [start, end] and the first or last appearance is outside this window, we know that we can not use C as the ending letter. TreeSet supports O(logN) operation to achieve this check.
The runtime is O(N * 26 * 26 * logN) and space is O(N).
class Solution { public List<String> maxNumOfSubstrings(String s) { List<String> ans = new ArrayList<>(); PriorityQueue<int[]> maxPq = new PriorityQueue<>(Comparator.comparingInt(a -> -a[1])); TreeSet<Integer>[] idx = new TreeSet[26]; for(int i = 0; i < 26; i++) { idx[i] = new TreeSet<>(); } for(int i = 0; i < s.length(); i++) { idx[s.charAt(i) - 'a'].add(i); } for(int i = 0; i < s.length(); i++) { int d = s.charAt(i) - 'a'; if(idx[d].first() < i) { continue; } int l = idx[d].first(); int r = s.length(); //check which char we should use as the ending of the current substring for(int j = 0; j < 26; j++) { if(idx[j].size() > 0 && check(idx, l, idx[j].last())) { r = Math.min(r, idx[j].last()); } } if(r < s.length()) { while(maxPq.size() > 0 && maxPq.peek()[1] > r) { maxPq.poll(); } maxPq.add(new int[]{l, r}); } } while(maxPq.size() > 0) { int[] e = maxPq.poll(); ans.add(s.substring(e[0], e[1] + 1)); } return ans; } private boolean check(TreeSet<Integer>[] idx, int start, int end) { if(end < start) { return false; } for(int i = 0; i < 26; i++) { if(idx[i].size() == 0) { continue; } //if the current char i can not be completely contained within [start, end], return false int firstIdx = idx[i].first(); int lastIdx = idx[i].last(); Integer inBetweenIdx = idx[i].ceiling(start); if(inBetweenIdx != null && inBetweenIdx <= end && (firstIdx < start || lastIdx > end)) { return false; } } return true; } }
A better solution using the same ideas.
Instead of saving all indices, we can just save the first and last index of each letter. Then iterate over the input s, for each 1st appearance character, we do the following:
1. we start with an initial window of the current 1st appearance letter, call it [left, right];
2. starting from left, keep adding characters, until we find a valid substring. If a letter has appeared prior to left, we know it is impossible to get a valid substring starting from left. Otherwise, we add new letters to the current window and extend the right boundary if necessary.
3. step 2 either tells that it is impossible or it terminates with a valid window.
4. check new window with the previous valid window, if non-overlapping, add the previous window to the final answer and update the previous window to be the newly found window; if over-lapping, replace the previous window with the new window.
The runtime is O(N * 26) and space is O(26 * 2), which is O(1). The outer for loop only execute at most 26 times as there are only at most 26 first appearing letters.
class Solution { public List<String> maxNumOfSubstrings(String s) { int n = s.length(); int[][] bound = new int[26][2]; for(int i = 0; i < 26; i++) { bound[i][0] = n; bound[i][1] = -1; } for(int i = 0; i < n; i++) { int offset = s.charAt(i) - 'a'; bound[offset][0] = Math.min(i, bound[offset][0]); bound[offset][1] = Math.max(i, bound[offset][1]); } List<String> ans = new ArrayList<>(); int[] prev = null; for(int i = 0; i < n; i++) { int offset = s.charAt(i) - 'a'; if(bound[offset][0] == i) { int left = bound[offset][0]; int right = bound[offset][1]; boolean can = true; for(int j = left; j <= right; j++) { if(bound[s.charAt(j) - 'a'][0] < left) { can = false; break; } right = Math.max(right, bound[s.charAt(j) - 'a'][1]); } if(can) { if(prev == null) { prev = new int[]{left, right}; } else if(left > prev[1]) { ans.add(s.substring(prev[0], prev[1] + 1)); prev = new int[]{left, right}; } else { prev[0] = left; prev[1] = right; } } } } if(prev != null) { ans.add(s.substring(prev[0], prev[1] + 1)); } return ans; } }