Find Sets That Contain At Least One Element From Other Sets
Solution 1:
Since your space is so constrained -- only 20 values from which to choose -- beat this thing to death with a blunt instrument:
- Convert each of your target sets (the ones to be covered) to a bit-map. In your given case, this will correspond to an integer of 20 bits, one bit position for each of the 20 values.
- Create a list of candidate covering bitmaps, the integers 0 through (2^20-1)
- Take the integers in order. Use bit operations to determine whether each target set has a
1
bit in common with the candidate. If all satisfy the basic condition, the candidate is validated. - When you validate a candidate, remove all super-set integers from the list of candidates.
- When you run out of candidates, your validates candidates are the desired collection. In the code below, I simply print each as it is identified.
Code:
from time import time
start = time()
s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}
# Convert each set to its bit-map
point_set = [7, 28, 48]
# make list of all possible covering bitmaps
cover = list(range(2**20))
while cover:
# Pop any item from remaining covering sets
candidate = cover.pop(0)
# Does this bitmap have a bit in common with each target set?if all((candidate & point) for point in point_set):
print(candidate)
# Remove all candidates that are supersets of the successful covering one.
superset = set([other for other in cover if (candidate & ~other) == 0])
cover = [item for item in cover if item not in superset]
print(time() - start, "lag time")
print(time() - start, "seconds")
Output -- I have not converted the candidate integers back to their constituent elements. This is a straightforward task.
Note that most of the time in this example is spent in exhausting the list of integers that were not supersets of a validated cover set, such as all multiples of 32 (the lower 6 bits are all zero, and thus are disjoint from any cover set).
This 33 seconds is on my aging desktop computer; your laptop or other platform is almost certainly faster. I trust that any improvement from a more efficient algorithm is easily offset in that this algorithm is quick to implement and easier to understand.
170.4029195308685303 lag time180.6517734527587891 lag time200.8456630706787109 lag time361.0555419921875 lag time411.2604553699493408 lag time421.381387710571289 lag time33.005757570266724 seconds
Solution 2:
I have come up with a solution based on the trie data structure as described here. Tries make it relatively fast to determine whether one of the stored sets is a subset of another given set (Savnik, 2013).
The solution then looks as follows:
- Create a trie
- Iterate through the given sets
- In each iteration, go through the sets in the trie and check if they are disjoint with the new set.
- If they are, continue; if not, add corresponding new sets to the trie unless they are supersets of sets in the trie.
The worst-case runtime is O(n m c), whereby m is the maximal number of solutions if we consider only n' <= n of the input sets, and c is the time factor from the subset lookups.
The code is below. I have implemented the algorithm based on the python package datrie, which is a wrapper around an efficent C implementation of a trie. The code below is in cython but can be converted to pure python easily by removing/exchangin cython specific commands.
The extended trie implementation:
from datrie cimport BaseTrie, BaseState, BaseIterator
cdefbint has_subset_c(BaseTrie trie, BaseState trieState, str setarr,
int index, int size):
cdefBaseState trieState2 = BaseState(trie)
cdefint i
trieState.copy_to(trieState2)
for i inrange(index, size):
if trieState2.walk(setarr[i]):
if trieState2.is_terminal() or has_subset_c(trie, trieState2, setarr,
i, size):
returnTrue
trieState.copy_to(trieState2)
returnFalse
cdefclass SetTrie():
def__init__(self, alphabet, initSet=[]):
ifnothasattr(alphabet, "__iter__"):
alphabet = range(alphabet)
self.trie = BaseTrie("".join(chr(i) for i in alphabet))
self.touched = Falsefor i in initSet:
self.trie[chr(i)] = 0ifnot self.touched:
self.touched = Truedefhas_subset(self, superset):
cdefBaseState trieState = BaseState(self.trie)
setarr = "".join(chr(i) for i in superset)
returnbool(has_subset_c(self.trie, trieState, setarr, 0, len(setarr)))
defextend(self, sets):
for s in sets:
self.trie["".join(chr(i) for i in s)] = 0ifnot self.touched:
self.touched = Truedefdelete_supersets(self):
cdefstr elem
cdefBaseState trieState = BaseState(self.trie)
cdefBaseIterator trieIter = BaseIterator(BaseState(self.trie))
if trieIter.next():
elem = trieIter.key()
while trieIter.next():
self.trie._delitem(elem)
ifnot has_subset_c(self.trie, trieState, elem, 0, len(elem)):
self.trie._setitem(elem, 0)
elem = trieIter.key()
if has_subset_c(self.trie, trieState, elem, 0, len(elem)):
val = self.trie.pop(elem)
ifnot has_subset_c(self.trie, trieState, elem, 0, len(elem)):
self.trie._setitem(elem, val)
defupdate_by_settrie(self, SetTrie setTrie, maxSize=inf, initialize=True):
cdefBaseIterator trieIter = BaseIterator(BaseState(setTrie.trie))
cdefstr s
if initialize andnot self.touched and trieIter.next():
for s in trieIter.key():
self.trie._setitem(s, 0)
self.touched = Truewhile trieIter.next():
self.update(set(trieIter.key()), maxSize, True)
defupdate(self, otherSet, maxSize=inf, isStrSet=False):
ifnot isStrSet:
otherSet = set(chr(i) for i in otherSet)
cdefstr subset, newSubset, elem
cdeflist disjointList = []
cdefBaseTrie trie = self.trie
cdefint l
cdefBaseIterator trieIter = BaseIterator(BaseState(self.trie))
if trieIter.next():
subset = trieIter.key()
while trieIter.next():
if otherSet.isdisjoint(subset):
disjointList.append(subset)
trie._delitem(subset)
subset = trieIter.key()
if otherSet.isdisjoint(subset):
disjointList.append(subset)
trie._delitem(subset)
cdefBaseState trieState = BaseState(self.trie)
for subset in disjointList:
l = len(subset)
if l < maxSize:
if l+1 > self.maxSizeBound:
self.maxSizeBound = l+1for elem in otherSet:
newSubset = subset + elem
trieState.rewind()
ifnot has_subset_c(self.trie, trieState, newSubset, 0,
len(newSubset)):
trie[newSubset] = 0defget_frozensets(self):
return (frozenset(ord(t) for t in subset) for subset in self.trie)
defclear(self):
self.touched = False
self.trie.clear()
defprune(self, maxSize):
cdefbint changed = False
cdefBaseIterator trieIter
cdefstr k
if self.maxSizeBound > maxSize:
self.maxSizeBound = maxSize
trieIter = BaseIterator(BaseState(self.trie))
k = ''while trieIter.next():
iflen(k) > maxSize:
self.trie._delitem(k)
changed = True
k = trieIter.key()
iflen(k) > maxSize:
self.trie._delitem(k)
changed = Truereturn changed
def__nonzero__(self):
return self.touched
def__repr__(self):
returnstr([set(ord(t) for t in subset) for subset in self.trie])
This can be used as follows:
def cover_sets(sets):
strie = SetTrie(range(10), *([i] for i in sets[0]))
for s in sets[1:]:
strie.update(s)
return strie.get_frozensets()
Timing:
from timeit importtimeits1= {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}
%timeit cover_sets([s1, s2, s3])
Result:
37.8 µs ± 2.97 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
Note that the trie implementation above works only with keys larger than (and not equal to) 0
. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
Post a Comment for "Find Sets That Contain At Least One Element From Other Sets"