当前位置：网站首页>Flashtext, a data cleaning tool, has directly increased the efficiency by dozens of times

Flashtext, a data cleaning tool, has directly increased the efficiency by dozens of times

2022-06-28 02:39:00 【Python concentration camp】

In some ordinary small-scale data filtering 、 Regular expressions are the most commonly used in the cleaning process , But as the data scale increases , Regular expressions seem to have some spare energy .

【 Read the whole passage 】

Regular expressions in a 10k In the thesaurus of 15k The time of a keyword is almost 0.165 second . But for Flashtext It just needs 0.002 second . therefore , On this issue Flashtext Is about faster than regular expressions 82 times .

file

From the performance comparison of the above example diagram , You can see that as we need to process more and more characters , The processing speed of regular expressions almost increases linearly . However ,Flashtext Almost a constant .

1、 Get ready flashtext Environmental Science

adopt pip To install flashtext, Or other ways are also possible , The mirror station of Tsinghua University is used by default .

pip install flashtext -i https://pypi.tuna.tsinghua.edu.cn/simple

Getting ready for flashtext After environment , Take a look at flashtext Important use process , Help us to better complete the data cleaning operation .

2、 Add keywords

When adding a keyword here, it is added to the keyword thesaurus through a single keyword , Use add_keyword Function to add . The first parameter indicates the keyword to be added , The second parameter is the alias of the first keyword , If the keyword is found, it is displayed as an alias , If the second parameter is not used as an alias, the original name will still be displayed .

from flashtext import KeywordProcessor

#  Initialize the key vocabulary processor 

processor = KeywordProcessor()

#  Add keywords in the normal way 

processor.add_keyword('Python')

#  Add keywords by alias 

processor.add_keyword('Scala', 'Java')

In this way, the required keywords have been added to the thesaurus processor in two ways .

3、 Extract key words

Add keywords through the previous step , Now the keyword information already exists in the thesaurus processor , Reuse extract_keywords Just extract the keywords .

#  Extract keyword information from a string 

found = processor.extract_keywords('I like Python and Scala.')

#  result 

print(found)

# ['Python', 'Java']

And here it is , As we expected , and Scala Also shown as Java.

4、 Replace keywords

Replace the keywords with replace_keywords function , The premise is that the words with aliases in the thesaurus can be replaced , Just like up here Scala Displayed as Java equally .

Replace... In a string Scala key word , because Scala The corresponding alias is Java, So... In a string Scala It should be replaced by Java.

replaced = processor.replace_keywords('I like Scala.')

#  result 

print(replaced)

# I like Java.

# Scala  If so, it will be replaced by Java.

5、 Get all keywords

Sometimes , stay KeywordProcessor You may not remember what keywords have been added to the thesaurus processor , It can be used at this time get_all_keywords Function to get all the current keywords .

all_keywords = processor.get_all_keywords()

#  result 

print(all_keywords)

# {'python': 'Python', 'scala': 'Java'}

6、 Add keywords in batch

When the key vocabulary needs more keywords , You can add them in batches through lists or dictionaries . The corresponding functions are add_keywords_from_list、add_keywords_from_dict function .

#  Initialize a dictionary for batch addition 

dict_ = {
    'java': ['java_ee', 'java_se', 'java_me'],
    'python': ['pandas', 'all']
}

#  Add keywords in batches through dictionaries 

processor.add_keywords_from_dict(dict_)

#  Match keywords from batch added keywords 

result = processor.extract_keywords('looking for java_ee and pandas.')

#  result 

print(result)

# ['java', 'python']

#  Batch add keywords by listing 

processor.add_keywords_from_list(['scala', 'python', 'scala', 'go'])

#  adopt get_all_keywords Take a look at all the keywords 

all_keywords = processor.get_all_keywords()

#  result 

print(all_keywords)

# {'python': 'python', 'pandas': 'python', 'scala': 'scala', 'java_ee': 'java', 'java_se': 'java', 'java_me': 'java', 'all': 'python', 'go': 'go'}

Find that all the keywords have been added to the thesaurus processor , And repeated will not be added again .

7、 Delete keywords in batch

There are also two ways to batch delete keywords in the thesaurus processor , One is the list 、 The other is a dictionary . The corresponding functions are remove_keywords_from_list、remove_keywords_from_dict function .

#  Remove keywords from the list in batch 

processor.remove_keywords_from_list(['python','java_ee','java_me'])

#  Remove keywords from the dictionary in batch 

processor.remove_keywords_from_dict({'python': ['pandas','all']})

#  adopt get_all_keywords Take a look at all the keywords 

all_keywords = processor.get_all_keywords()

#  result 

print(all_keywords)

# {'scala': 'scala', 'java_se': 'java', 'go': 'go'}

It is found that all the keywords that need to be removed have been removed .

8、 Comparison of execution efficiency

For a more impressive display effect , I found two flashtext The efficiency comparison chart in the process of searching and replacing keywords can be seen at a glance .

flashtext、 Regular expression search efficiency comparison

file