当前位置:网站首页>Extract the language you need from the multilingual data set and save it as CSV
Extract the language you need from the multilingual data set and save it as CSV
2022-07-16 08:46:00 【LAN Qilin】
If there are multiple languages in the dataset , How to recognize each language ? What I borrow here is langdetect Package function , Don't talk much , Look at the code .
# -*- coding: utf-8 -*-
# Support detection 55 Languages : af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
# str = 'Otec matka syn.'
from langdetect import detect
from langdetect import detect_langs
# When the text is too short or fuzzy , The result of judgment will be uncertain ;
# If you want to make the result unique , Add the following two lines :
from langdetect import DetectorFactory
DetectorFactory.seed = 0
if __name__ == '__main__':
# This paragraph str It's after cleaning the web page , Take its text directly
str = ' Chinese corpus '
# Judge the type of language
print(detect(str))
# probability
print(detect_langs(str))
def getLangs(str) :
return detect(str)
Dataset format
3,In Ordnung other,"Die Hülle ist schön, jedoch nicht ganz mein Geschmack."
3, Try the battery without しでした. drugstore,すでにこのタイプの Compensator をもってらっしゃる Fang は、 Battery も Hand yuan にあるのでしょうが、 first めて Starting with した body としては、 Try the battery がついてなかったことは、ちょっとがっかりさせられました. today までニコンイヤーファッションなどの Ear point tonic device を Buy った Age には、たいてい Try the battery が1 One ついてましたから. Mender body の send い The winner などは、これからぼちぼちみていきたいと thinking います.
5,Super kit kitchen,"J'ai changé toutes les pièces facilement, c'est top merci"
1,toujours pas recu home,toujour s pas reçu ou est mon colis
The method of extracting language written in this format , Other formats can be fine tuned
''' Extract the corresponding language from the data set '''
import csv
import pandas as pd
import os
from extractLanguage import getLangs
# data_path = "NAO/NAO-WS/rnn/data/amazon_test_data"
data_path = "data" # The directory where the dataset is located
need_langs = ['zh_cn', 'en', 'de'] # Required language
file_names = ['eval.csv', 'test.csv', 'train.csv'] # The name of the file under the dataset
for need_lang in need_langs:
store_path = data_path + "/" + need_lang # The directory where the generated files are saved
if not os.path.exists(store_path):
os.makedirs(store_path)
for name in file_names:
df = pd.read_csv(data_path + "/" + name, sep=",", header=None)
data_lang = []
for i in df.itertuples():
row_data = list(i)
text_data = str(row_data[2]) + str(row_data[3])
score = row_data[1]
lang = getLangs(text_data)
if lang == need_lang:
temp = [row_data[1], row_data[2], row_data[3]]
data_lang.append(temp)
# print(list(i))
df_lang = pd.DataFrame(data_lang)
df_lang.to_csv(store_path + "/" + name, index=False, header=None)
If you have other questions, please contact me
边栏推荐
- The mental journey of a sealer maintainer
- Teach you how to install CUDA by hand
- Jerry's right and left channel settings [chapter]
- dareu键盘灯光怎么关
- Taishan Office Technology Lecture: the height of strange times New Roman Fonts
- How to turn off the dareu keyboard light
- 线性表概念
- Leetcode 735 planetary collision [stack simulation] the leetcode road of heroding
- Win11安全中心删除的文件如何恢复?
- Fibonacci heap - Analysis and Implementation
猜你喜欢

基于华为WAC双机VRRP热备份下旁挂三层组网隧道转发模式解决方案
![[day 2] machine reading comprehension -- common machine reading comprehension models (Part 1)](/img/bf/0fd972f8c1749a0cd1964b633fc5e3.png)
[day 2] machine reading comprehension -- common machine reading comprehension models (Part 1)

Deeply uncover Alibaba cloud's asynchronous task capability of function computing

Wechat classroom appointment of applet completion works applet graduation project (4) opening report

浅谈——对技术转型做管理的看法

Is it difficult to become a hardware engineer?

不会真有人觉得聊天机器人难吧——使用BERT加载预训练模型得到中文句子向量

Left leaning heap - Analysis and Implementation

【历史上的今天】7 月 13 日:数据库之父逝世;苹果公司购买 CUPS 代码;IBM 芯片联盟
![[Go]二、RESTful API介绍和API流程和代码结构](/img/fd/8ae3d6a4c0d0c973ce81672c1c529c.png)
[Go]二、RESTful API介绍和API流程和代码结构
随机推荐
Inclined stack - principle and Implementation
Leetcode 735 planetary collision [stack simulation] the leetcode road of heroding
我用开天平台做了一个城市防疫政策查询系统,你不试试?
快讯:京东科技发布“百亿收入计划”;博通软件业务总裁离职
Object中线程相关方法wait、notify、notifyAll分析
After some experiments, the metaphysics of batch size was broken
Jerry's right and left channel settings [chapter]
Taishan Office Technology Lecture: the height of strange times New Roman Fonts
Win11安全中心删除的文件如何恢复?
Focusing on data center innovation, what new forces does NVIDIA DOCA 1.3 bring
Where is the win11 uninstaller? Two methods of uninstalling software in win11
1. 在 SAP ABAP 事物码 SEGW 里创建 SAP OData 项目
Summary of some experiences in the process of R & D platform splitting
Mongodb plummeted!!!
华为交换机SEP双半环设计方案及配置详细步骤
Test basis 3
Detailed explanation and use of interface mock
[GPIO of keys and LEDs of Renesas ra6m4 development board]
杰理之按键切换默认 EQ 和 EQ 工具调节的 EQ【篇】
手把手教你安装CUDA