find_similar package
Subpackages
Submodules
find_similar.calc_functions module
Calculation functions to find similarity percent
- class find_similar.calc_functions.TokenText(text, tokens=None, dictionary=None, language='russian', remove_stopwords=True, **kwargs)[source]
Bases:
object
The main type to work with text tokens
- find_similar.calc_functions.calc_cosine_similarity_opt(x_set: set, y_set: set) float [source]
Get cos between two sets of words :param x_set: One set :param y_set: Another set :return: cos similarity
- find_similar.calc_functions.calc_keywords_rating(text, keywords)[source]
Calc keywords rating :param keywords: Keywords
- find_similar.calc_functions.get_tokens(text, dictionary=None, language='russian', remove_stopwords=True) set [source]
Get tokens from str text :param text: str text :param dictionary: default = None. If you want to replace one words to others you can send the dictionary :param language :param remove_stopwords :return: tokes for text
find_similar.calc_models module
Models for calculation and compare
- exception find_similar.calc_models.LanguageNotFoundException(language)[source]
Bases:
TokenizeException
Language not found error
find_similar.core module
Core module with search functions
- find_similar.core.find_similar(text_to_check, texts, language='russian', count=5, dictionary=None, remove_stopwords=True, keywords=None) list[TokenText] [source]
The main function to search similar texts.
- Parameters:
text_to_check – Text to find similars
texts – List of str or TokenText. In these texts we will search similars
language – Language, default=’russian’
count – Results count
dictionary – default = None. If you want to replace one words to others
keywords – default = None.
remove_stopwords – default = True. Remove or not stopwords
- Returns:
Result list sorted by similarity percent (cos)
find_similar.package module
Package info
find_similar.tokenize module
Module with tokenize functions
- class find_similar.tokenize.HashebleSet[source]
Bases:
set
Special class set with hash to compare and sort two sets
- find_similar.tokenize.add_nltk_stopwords(language: str, stop_words=None)[source]
Add stopwords to STOP_WORDS_NO_LANGUAGE :param language: current text language :param stop_words: existing stop words
- find_similar.tokenize.get_normal_form(part_parse)[source]
Get Normal Form :param part_parse: special object :return: object normal form
- find_similar.tokenize.get_parsed_text(word: str)[source]
Get Parsed Text :param word: str word :return: pymorphy2 object
- find_similar.tokenize.get_stopwords_from_nltk(language: str)[source]
Get stopwords for specific language :param language: current text language
- find_similar.tokenize.prepare_dictionary(dictionary)[source]
Get special object from simple python dict :param dictionary: default = None. If you want to replace one words to others you can send the dictionary. :return: dictionary of HashebleSet with data
- find_similar.tokenize.remove_part_speech(part_parse, parts=None, dictionary=None)[source]
Remove variable part of speach from word :param dictionary: default = None. If you want to replace one words to others you can send the dictionary. :param part_parse: pymorph2 object :param parts: set of part of speach NOUN noun name ADJF adjective name (full) VERB verb (personal form) INFN verb (infinitive) NUMR numeral PREP preposition CONJ conjunction PRCL particle :return: text without variable part of speach or None
- find_similar.tokenize.replace_yio(text)[source]
Change russian ё to e :param text: Text to change :return: new text without ё with е
- find_similar.tokenize.replacing(text: str, chars: list)[source]
replace chars to empty string :param text: Text to replace :param chars: Chars to replace :return: new text without chars
- find_similar.tokenize.spacing(text: str, chars: list)[source]
replace chars to space :param text: Text to spacing :param chars: Chars to replace :return: new text without chars with spaces
- find_similar.tokenize.split_text_and_digits(text)[source]
Split words and digits :param text: enter text :return: list of separated texts
- find_similar.tokenize.tokenize(text: str, language: str, dictionary=None, remove_stopwords=True)[source]
Main function to tokenize text :param text: Text to tokenize :param language: language for setting stop-words :param dictionary: default = None. If you want to replace one words to others you can send the dictionary. :param remove_stopwords: default = True. Remove stopwords if True :return: Tokens
Module contents
find-similar package