第一个Python小项目

编辑日期: 2024-11-28 文章阅读: 次

第一个Python小项目

上下文关键字（KWIC, Key Word In Context）是最常见的多行协调显示格式。此小项目描述：输入一系列句子，给定一个给定单词，每个句子中至少会出现一次给定单词。目标输出，给定单词按照KWIC显示，KWIC显示的基本要求：待查询单词居中，前面pre序列右对齐，后面post序列左对齐，待查询单词前和后长度相等，若输入句子无法满足要求，用空格填充。

输入参数：输入句子sentences, 待查询单词selword, 滑动窗口长度window_len

举例，输入如下六个句子，给定单词secure，输出如下字符串：

               pre keyword    post

     welfare , and secure  the blessings of
     nations , and secured immortal glory with
       , and shall secure  to you the
    cherished . To secure  us against these
     defense as to secure  our cities and
          I can to secure  economy and fidelity

请补充实现下面函数：

def kwic(sentences: List[str], selword: str, window_len: int) -> str:
    """
    :type: sentences: input sentences
    :type: selword: selected word
    :type: window_len: window length
    """

更多KWIC显示参考如下：

http://dep.chs.nihon-u.ac.jp/english_lang/tukamoto/kwic_e.html

此项目的完整代码和分析已发布在 Python中文网

以下代码都经过测试，完整可运行，当然错误可能还是再所难免，欢迎指正，提交链接：https://github.com/jackzhenguo/python-small-examples/issues

"""
@file: kwic_service.py
@desc: providing functions about KWIC presentation
@author: group3
@time: 5/9/2021
"""

import re
from typing import List

获取关键词sel_word的窗口，默认窗口长度为5

def get_keyword_window(sel_word: str, words_of_sentence: List, length=5) -> List[str]:
    """
    find the index of sel_word at sentence, then decide words of @length size
    by backward and forward of it.
    For example: I am very happy to this course of psd if sel_word is happy, then
    returning: [am, very, happy, to, this]

    if length is even, then returning [very, happy, to, this]

    remember: sel_word being word root
    """
    if length <= 0 or len(words_of_sentence) <= length:
        return words_of_sentence
    index = -1
    for iw, word in enumerate(words_of_sentence):
        word = word.lower()
        if len(re.findall(sel_word.lower(), word)) > 0:
            index = iw
            break

    if index == -1:
        return words_of_sentence
    if index < length // 2:
        back_slice = words_of_sentence[:index]
        if (length - index) >= len(words_of_sentence):
            return words_of_sentence
        else:
            return back_slice + words_of_sentence[index: index + length - len(back_slice)]
    if (index + length // 2) >= len(words_of_sentence):
        forward_slice = words_of_sentence[index:len(words_of_sentence)]
        if index - length <= 0:
            return words_of_sentence
        else:
            return words_of_sentence[index - (length - len(forward_slice)):index] + forward_slice

    return words_of_sentence[index - length // 2: index + length // 2 + 1] if length % 2 \
        else words_of_sentence[index - length // 2 + 1: index + length // 2 + 1]

KWIC显示逻辑：

def kwic_show(sel_language, words_of_sentence, sel_word, window_size=9, align_param=70, token_space_param=1):
    """return kwic string for words_of_sentence and sel_word being key token
    :param sel_language: selected language
    :param words_of_sentence: all words in one sentence
    :param sel_word: key token
    :param window_size: size of kwic window
    :param align_param: parameters used to align the display
    :param token_space_param: space length before or after keyword

    window_size and align_param's default value is not suggested to revise
    """
    if window_size < 1:
        return None
    if window_size >= len(words_of_sentence):
        window_size = len(words_of_sentence)

    words_in_window = get_keyword_window(sel_word, words_of_sentence, window_size)

    sent = ' '.join(words_in_window)
    try:
        key_index = sent.lower().index(sel_word.lower())
    except ValueError as ve:
        key_index = -1
    if key_index == -1:
        return None, None

    align_param = align_param - len(sel_word) - 2 * token_space_param
    if align_param < 0:
        log.warning('align_param value required bigger length of input word')
        return None, None
    pre_part = sent[:key_index].rstrip()
    i, n_pre_words = 1, len(pre_part.split(' '))
    while i < n_pre_words and len(pre_part) > align_param // 2:
        pre_words = pre_part.split(' ')
        pre_words = pre_words[i:]
        pre_part = " ".join(pre_words)
        i += 1

    pre_kwic = pre_part.rjust(align_param // 2)
    key_kwic = token_space_param * ' ' + sent[key_index: key_index + len(sel_word)].lstrip() + token_space_param * ' '

    post_kwic = sent[key_index + len(sel_word):].lstrip()
    n_post_words = len(post_kwic.split(' '))
    i = n_post_words - 1
    while i > 0 and len(post_kwic) > align_param // 2:
        post_kwic_words = post_kwic.split(' ')
        post_kwic_words = post_kwic_words[:i]
        post_kwic = " ".join(post_kwic_words)
        i -= 1

    sel_word_kwic = pre_kwic + key_kwic + post_kwic
    return sel_word_kwic, pre_kwic

测试代码

"""
@file: test_kwic_show.py
@desc:
@author: group3
@time: 5/3/2021
"""
from src.feature.kwic import kwic_show

if __name__ == '__main__':
    words = ['I', 'am', 'very', 'happy', 'to', 'this', 'course', 'of', 'psd']

    print(kwic_show('English', words, 'I', window_size=1)[0])
    print(kwic_show('English', words, 'I', window_size=5)[0])

    print(kwic_show('English', words, 'very', token_space_param=5)[0])
    print(kwic_show('English', words, 'very', window_size=6, token_space_param=5)[0])
    print(kwic_show('English', words, 'very', window_size=1, token_space_param=5)[0])

    print(kwic_show('English', words, 'stem', align_param=20)[0])
    print(kwic_show('English', words, 'stem', align_param=100)[0])
    print(kwic_show('English', words, 'II', window_size=1)[0])
    print(kwic_show('English', words, 'related', window_size=10000)[0])

打印结果

                                  I 
                                  I am very happy to
                        I am     very     happy to this course of psd
                        I am     very     happy to this
                                 very     
None
None
None
None

Python 20个专题完整目录：

Python前言

Google Python代码风格指南

Python数字

Python正则之提取正整数和大于0浮点数

Python字符串

CSV读写乱码问题

Unicode标准化

Unicode, UTF-8, ASCII

Python动态生成变量

Python字符串对齐

Python小项目1：文本句子关键词的KWIC显示