當前位置：首頁 > 编程语言 > python >内容正文

python

python找最长的字符串_为Python找到最长重复字符串的有效方法（从Pearls编程）

發布時間：2023/12/13 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python找最长的字符串_为Python找到最长重复字符串的有效方法（从Pearls编程）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

我的解決方案是基于后綴數組。它是由最長公共前綴的兩倍前綴構成的。最壞情況下的復雜度是O(n(logn)^2)。任務”伊利亞特.mb.txt“在我的筆記本上花了4秒鐘。代碼在函數suffix_array和longest_common_substring中有很好的文檔記錄。后一個函數很短，可以很容易地修改，例如搜索10個最長的非重疊子串。如果重復字符串長度超過10000個字符，則此Python代碼比問題中的original C code (copy here)快。from itertools import groupby

from operator import itemgetter

def longest_common_substring(text):

"""Get the longest common substrings and their positions.

>>> longest_common_substring('banana')

{'ana': [1, 3]}

>>> text = "not so Agamemnon, who spoke fiercely to "

>>> sorted(longest_common_substring(text).items())

[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

This function can be easy modified for any criteria, e.g. for searching ten

longest non overlapping repeated substrings.

"""

sa, rsa, lcp = suffix_array(text)

maxlen = max(lcp)

result = {}

for i in range(1, len(text)):

if lcp[i] == maxlen:

j1, j2, h = sa[i - 1], sa[i], lcp[i]

assert text[j1:j1 + h] == text[j2:j2 + h]

substring = text[j1:j1 + h]

if not substring in result:

result[substring] = [j1]

result[substring].append(j2)

return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):

"""Analyze all common strings in the text.

Short substrings of the length _step a are first pre-sorted. The are the

results repeatedly merged so that the garanteed number of compared

characters bytes is doubled in every iteration until all substrings are

sorted exactly.

Arguments:

text: The text to be analyzed.

_step: Is only for optimization and testing. It is the optimal length

of substrings used for initial pre-sorting. The bigger value is

faster if there is enough memory. Memory requirements are

approximately (estimate for 32 bit Python 3.3):

len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

Return value: (tuple)

(sa, rsa, lcp)

sa: Suffix array for i in range(1, size):

assert text[sa[i-1]:] < text[sa[i]:]

rsa: Reverse suffix array for i in range(size):

assert rsa[sa[i]] == i

lcp: Longest common prefix for i in range(1, size):

assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]

if sa[i-1] + lcp[i] < len(text):

assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]

>>> suffix_array(text='banana')

([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'

The Longest Common String is 'ana': lcp[2] == 3 == len('ana')

It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]

"""

tx = text

size = len(tx)

step = min(max(_step, 1), len(tx))

sa = list(range(len(tx)))

sa.sort(key=lambda i: tx[i:i + step])

grpstart = size * [False] + [True] # a boolean map for iteration speedup.

# It helps to skip yet resolved values. The last value True is a sentinel.

rsa = size * [None]

stgrp, igrp = '', 0

for i, pos in enumerate(sa):

st = tx[pos:pos + step]

if st != stgrp:

grpstart[igrp] = (igrp < i - 1)

stgrp = st

igrp = i

rsa[pos] = igrp

sa[i] = pos

grpstart[igrp] = (igrp < size - 1 or size == 0)

while grpstart.index(True) < size:

# assert step <= size

nextgr = grpstart.index(True)

while nextgr < size:

igrp = nextgr

nextgr = grpstart.index(True, igrp + 1)

glist = []

for ig in range(igrp, nextgr):

pos = sa[ig]

if rsa[pos] != igrp:

break

newgr = rsa[pos + step] if pos + step < size else -1

glist.append((newgr, pos))

glist.sort()

for ig, g in groupby(glist, key=itemgetter(0)):

g = [x[1] for x in g]

sa[igrp:igrp + len(g)] = g

grpstart[igrp] = (len(g) > 1)

for pos in g:

rsa[pos] = igrp

igrp += len(g)

step *= 2

del grpstart

# create LCP array

lcp = size * [None]

h = 0

for i in range(size):

if rsa[i] > 0:

j = sa[rsa[i] - 1]

while i != size - h and j != size - h and tx[i + h] == tx[j + h]:

h += 1

lcp[rsa[i]] = h

if h > 0:

h -= 1

if size > 0:

lcp[0] = 0

return sa, rsa, lcp

與more complicated O(n log n)相比，我更喜歡這種解決方案，因為Python具有非常快速的列表排序(列表.排序)，可能比文章中的方法中必要的線性時間操作快，在非常特殊的隨機字符串和一個小字母表(典型的DNA基因組分析)假設下，應該是O(n)。我在Gog 2011中讀到，我的算法中最糟糕的情況O(n logn)在實踐中可以比許多O(n)算法更快，即不能使用CPU內存緩存。

如果文本包含8kb長的重復字符串，基于grow_chains的另一個答案中的代碼比問題的原始示例慢19倍。長時間重復的文本不是古典文學的典型，但它們經常出現在“獨立”學校的家庭作品集中。程序不應該凍結它。

總結

以上是生活随笔為你收集整理的python找最长的字符串_为Python找到最长重复字符串的有效方法（从Pearls编程）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：银行卡掉了怎么查卡号？这几种方法任你选择
下一篇：好玩的脚本代码大全_Github | 推