Find Repeated DNA Sequences

Given a string S which represents DNA sequence, the task is to find all the 10-letter long substring that are repeated more than once. Returning the sequence can be done in any order.

DNA sequence is string which consists of the 4 characters A, C, G and T.

Examples:

Input: S = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”
Output: [“AAAAACCCCC”, “CCCCCAAAAA”]
Explanation: Both the substrings “AAAAACCCCC” and “CCCCCAAAAA” occur more than once in the string s.

Input: S = “AAAAAAAAAAAAA”
Output: [“AAAAAAAAAA”]
Explanation: Substring “AAAAAAAAAA” occurs more than once in the substring.

Approach: To solve the problem, follow the below idea:

The problem can be solved using two sets, say seen and repeated. The seen set stores the strings which occurs only once. When we encounter a substring which is already present in seen, then we push the substring to the repeated set. After iterating over all the substrings, print all the strings in the repeated set.

Step-by-step algorithm:

  • Starting from the first substring, iterate over all the substrings of length 10.
  • Maintain two sets, say seen and repeated.
  • For any substring str, check if str is present in seen.
  • If str is present in seen, then insert str to repeated.
  • Else if str is not present in seen, then insert str to seen.
  • After iterating over all the substrings, print all the strings in repeated.

Below is the implementation of the algorithm:

Python
def findRepeatedDnaSequences(s):
    seen = set()
    repeated = set()
    
    for i in range(len(s) - 9):
        sequence = s[i:i + 10]
        if sequence in seen:
            repeated.add(sequence)
        else:
            seen.add(sequence)
    
    return list(repeated)

# Example usage
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
print(findRepeatedDnaSequences(s))
JavaScript
function findRepeatedDnaSequences(s) {
    const seen = new Set();
    const repeated = new Set();

    for (let i = 0; i < s.length - 9; i++) {
        const sequence = s.slice(i, i + 10);
        if (seen.has(sequence)) {
            repeated.add(sequence);
        } else {
            seen.add(sequence);
        }
    }

    return Array.from(repeated);
}

// Example usage
const s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT";
console.log(findRepeatedDnaSequences(s));

// This code is contributed by Shivam

Output
['CCCCCAAAAA', 'AAAAACCCCC']

Time Complexity: O(10 * N), where N is the length of string.
Auxiliary Space: O(10 * N)