I Want To Extract A Certain Number Of Words Surrounding A Given Word In A Long String(paragraph) In Python 2.7

May 08, 2024 Post a Comment

I am trying to extract a selected number of words surrounding a given word. I will give example to make it clear: string = 'Education shall be directed to the full development of t

Solution 1:

This will extract all occurrences of the target word in your text, with context:

import re

text = ("Education shall be directed to the full development of the human personality ""and to the strengthening of respect for human rights and fundamental freedoms.")

def search(target, text, context=6):
    # It's easier to use re.findall to split the string, 
    # as we get rid of the punctuation
    words = re.findall(r'\w+', text)

    matches = (i for (i,w) inenumerate(words) if w.lower() == target)
    forindexin matches:
        if index < context //2:yield words[0:context+1]
        elif index > len(words) - context//2 - 1:yield words[-(context+1):]
        else:
            yield words[index - context//2:index + context//2 + 1]print(list(search('the', text)))
# [['be', 'directed', 'to', 'the', 'full', 'development', 'of'], 
#  ['full', 'development', 'of', 'the', 'human', 'personality', 'and'], 
#  ['personality', 'and', 'to', 'the', 'strengthening', 'of', 'respect']]

print(list(search('shall', text)))
# [['Education', 'shall', 'be', 'directed', 'to', 'the', 'full']]

print(list(search('freedoms', text)))
# [['respect', 'for', 'human', 'rights', 'and', 'fundamental', 'freedoms']]

Solution 2:

Tricky with potential for off-by-one errors but I think this meets your spec. I have left removal of punctuation, probably best to remove it before sending the string for analysis. I assumed case was not important.

test_str = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."

def get_surrounding_words(search_word, s, n_words):
    words = s.lower().split(' ')
    try:
        i = words.index(search_word)
    except ValueError:
        return []
    # Word is near startif i < n_words/2:
        words.pop(i)
        return words[:n_words]
    # Word is near endelif i >= len(words) - n_words/2:
        words.pop(i)
        return words[-n_words:]
    # Word is in middleelse:
        words.pop(i)
        return words[i-n_words/2:i+n_words/2]

def test(word):
    print('{}: {}'.format(word, get_surrounding_words(word, test_str, 6)))

test('notfound')
test('development')
test('shall')
test('education')
test('fundamental')
test('for')
test('freedoms')

Solution 3:

import sys, os

args = sys.argv[1:]
if len(args) !=2:
   os.exit("Use with <string> <query>")
text = args[0]
query = args[1]
words = text.split()
op = []
left=3right=3
try:
    index = words.index(query)
    if index <=left:
        start=0else:
        start= index -left

    if start+left+right+1> len(words):
        start= len(words) -left-right-1
        if start<0:
            start=0

    while len(op) <left+rightandstart< len(words):
        if start!= index:
            op.append(words[start])
        start+=1except ValueError:
    pass
print op

How do this work?
1. find the word in the string
2. See if we can make left+right words from the index the
3. Take left+right number of words and save them in op
4. print op

Solution 4:

A simple approach to your problem. First separates all the words and then selects words from left and right.

def custom_search(sentence, word, n):     
    given_string = sentencegiven_word=wordtotal_required=nword_list= given_string.strip().split(" ")
    length_of_words = len(word_list)

    output_list = []
    given_word_position = word_list.index(given_word)
    word_from_left = 0
    word_from_right = 0if given_word_position + 1 > total_required / 2:
        word_from_left = total_required / 2if given_word_position + 1 + (total_required / 2) <= length_of_words:
            word_from_right = total_required / 2else:
            word_from_right = length_of_words - (given_word_position + 1)
            remaining_words = (total_required / 2) - word_from_right
            word_from_left += remaining_words

    else:
        word_from_right = total_required / 2
        word_from_left = given_word_position
        if word_from_left + word_from_right < total_required:
            remaining_words = (total_required / 2) - word_from_left
            word_from_right += remaining_wordsrequired_words= []
    for i in range(given_word_position - word_from_left, word_from_right + 
    given_word_position + 1):
        if i != given_word_position:
            required_words.append(word_list[i])
    returnrequired_wordssentence="Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."
custom_search(sentence, "shall", 6)

>>[Education, be, directed, to , the , full] 


custom_search(sentence, "development", 6)

>>['to', 'the', 'full', 'of', 'the', 'human']

Solution 5:

I don't think regular expressions are necessary here. Assuming the text is well-constructed, just split it up into an array of words, and write a couple if-else statements to make sure it retrieves the necessary amount of surrounding words:

def search(text, word, n):
    # text is the string you are searching# word is the word you are looking for# n is the TOTAL number of words you want surrounding the word

    words    = text.split(" ")  # Create an array of words from the string
    position = words.index(word)   # Find the position of the desired word

    distance_from_end = len(words) - position  # How many words are after the word in the textif position < n // 2 + n % 2:  # If there aren't enough words before...return words[:position], words[position + 1:n + 1]

    elif distance_from_end < n // 2 + n % 2:  # If there aren't enough words after...return words[position - n + distance_from_end:position], words[position + 1:]

    else:  # Otherwise, extract an equal number of words from both sides (take from the right if odd)return words[position - n // 2 - n % 2:position], words[position + 1:position + 1 + n//2]string = "Education shall be directed to the full development of the human personality and to the \
strengthening of respect for human rights and fundamental freedoms."print search(string, "shall", 6)
# >> (['Education'], ['be', 'directed', 'to', 'the', 'full'])print search(string, "human", 5)
# >> (['development', 'of', 'the'], ['personality', 'and'])

In your example you didn't have the target word included in the output, so I kept it out as well. If you'd like the target word included simply combine the two arrays the function returns (join them at position).

Hope this helped!

Getting Started with Python