find_similar package

Subpackages

Submodules

find_similar.calc_functions module

Calculation functions to find similarity percent

class find_similar.calc_functions.TokenText(text, tokens=None, dictionary=None, language='russian', remove_stopwords=True, **kwargs)[source]

Bases: object

The main type to work with text tokens

find_similar.calc_functions.calc_cosine_similarity_opt(x_set: set, y_set: set) float[source]

Get cos between two sets of words :param x_set: One set :param y_set: Another set :return: cos similarity

find_similar.calc_functions.calc_keywords_rating(text, keywords)[source]

Calc keywords rating :param keywords: Keywords

find_similar.calc_functions.get_tokens(text, dictionary=None, language='russian', remove_stopwords=True) set[source]

Get tokens from str text :param text: str text :param dictionary: default = None. If you want to replace one words to others you can send the dictionary :param language :param remove_stopwords :return: tokes for text

find_similar.calc_functions.sort_search_list(token_texts, keywords=None)[source]

Sort search list :param token_texts: Texts with tokens :param keywords: Keywords, default None

find_similar.calc_models module

Models for calculation and compare

exception find_similar.calc_models.LanguageNotFoundException(language)[source]

Bases: TokenizeException

Language not found error

exception find_similar.calc_models.TokenizeException[source]

Bases: Exception

Base Exception class for Tokenize exceptions

find_similar.core module

Core module with search functions

find_similar.core.find_similar(text_to_check, texts, language='russian', count=5, dictionary=None, remove_stopwords=True, keywords=None) list[TokenText][source]

The main function to search similar texts.

Parameters:
  • text_to_check – Text to find similars

  • texts – List of str or TokenText. In these texts we will search similars

  • language – Language, default=’russian’

  • count – Results count

  • dictionary – default = None. If you want to replace one words to others

  • keywords – default = None.

  • remove_stopwords – default = True. Remove or not stopwords

Returns:

Result list sorted by similarity percent (cos)

find_similar.package module

Package info

find_similar.tokenize module

Module with tokenize functions

class find_similar.tokenize.HashebleSet[source]

Bases: set

Special class set with hash to compare and sort two sets

find_similar.tokenize.add_nltk_stopwords(language: str, stop_words=None)[source]

Add stopwords to STOP_WORDS_NO_LANGUAGE :param language: current text language :param stop_words: existing stop words

find_similar.tokenize.get_normal_form(part_parse)[source]

Get Normal Form :param part_parse: special object :return: object normal form

find_similar.tokenize.get_parsed_text(word: str)[source]

Get Parsed Text :param word: str word :return: pymorphy2 object

find_similar.tokenize.get_stopwords_from_nltk(language: str)[source]

Get stopwords for specific language :param language: current text language

find_similar.tokenize.prepare_dictionary(dictionary)[source]

Get special object from simple python dict :param dictionary: default = None. If you want to replace one words to others you can send the dictionary. :return: dictionary of HashebleSet with data

find_similar.tokenize.remove_part_speech(part_parse, parts=None, dictionary=None)[source]

Remove variable part of speach from word :param dictionary: default = None. If you want to replace one words to others you can send the dictionary. :param part_parse: pymorph2 object :param parts: set of part of speach NOUN noun name ADJF adjective name (full) VERB verb (personal form) INFN verb (infinitive) NUMR numeral PREP preposition CONJ conjunction PRCL particle :return: text without variable part of speach or None

find_similar.tokenize.replace_yio(text)[source]

Change russian ё to e :param text: Text to change :return: new text without ё with е

find_similar.tokenize.replacing(text: str, chars: list)[source]

replace chars to empty string :param text: Text to replace :param chars: Chars to replace :return: new text without chars

find_similar.tokenize.spacing(text: str, chars: list)[source]

replace chars to space :param text: Text to spacing :param chars: Chars to replace :return: new text without chars with spaces

find_similar.tokenize.split_text_and_digits(text)[source]

Split words and digits :param text: enter text :return: list of separated texts

find_similar.tokenize.tokenize(text: str, language: str, dictionary=None, remove_stopwords=True)[source]

Main function to tokenize text :param text: Text to tokenize :param language: language for setting stop-words :param dictionary: default = None. If you want to replace one words to others you can send the dictionary. :param remove_stopwords: default = True. Remove stopwords if True :return: Tokens

find_similar.tokenize.use_dictionary_multiple(tokens, dictionary)[source]

Use dictionary with multiple compliance

Module contents

find-similar package