当前位置:网站首页>Go crawler framework -colly actual combat (4) -- Zhihu answer crawl (2) -- visual word cloud
Go crawler framework -colly actual combat (4) -- Zhihu answer crawl (2) -- visual word cloud
2022-06-25 00:16:00 【You're like an ironclad treasure】
Original link :Hzy Blog
Try some simple processing of the data today , Then visualize , So I thought of making some rough statistics on the cartoons that have appeared , And then according to Word frequency
To output word cloud !
Let's take a look at the renderings first
The code is in my GitHub On , There are some for study go Some small projects in the process .
Follow yesterday , Yesterday I grabbed zhihushan's answer , Put it in a file .
- The first page should be read line by line from the file ( Each line is an answer ).
- Read out the sentences , We have to do some simple segmentation , For example, only the animation in the book title is extracted .(ps: Of course, libraries that can be analyzed in other languages , Want to python Medium jieba, But I was go There seems to be no similar library found in ), Then just write a simple one by yourself .
- Extract the animation and count it , We are going to visualize it , I am here github We found
go-echarts
2.go-charts Brief introduction
install
go get -u github.com/go-echarts/go-echarts
file :https://go-echarts.github.io/go-echarts/
go-ehcharts Baidu open source is used echarts Chart Library , And provides a concise api.
3. Everything is ready , Here is the time to type the code
3.1 First, open the file , Then read each of these lines , Then split to find the animation name , Then count .
/*
Word count
*/
// This structure is used to implement sort Interface used , because map If according to value It's not easy to sort .
type Pair struct {
Key string
Value int
}
type PairList []Pair
func (p PairList) Swap(i, j int) { p[i], p[j] = p[j], p[i] }
func (p PairList) Len() int { return len(p) }
func (p PairList) Less(i, j int) bool { return p[j].Value < p[i].Value } // The reverse
type WordCount map[string]interface{}
// The following symbols are encountered , Segmentation of sentences
func SplitByMoreStr(r rune) bool{
splitSymbol := []rune("《》<>")
for _,v:=range(splitSymbol){
if r == v{
return true
}
}
return false
}
// Here the read line is cut , And simple statistics
func (wc WordCount)SplitAndStatistics(s string){
dist1 := strings.FieldsFunc(s,SplitByMoreStr)
for _,v :=range(dist1){
flag :=0
v = strings.Replace(v," ","",-1)
for key :=range wc {
if strings.Index(v,key)!=-1{ // The new field contains map Fields that once appeared in , directly +1
wc[key]=wc[key].(int)+1
flag =1
}
}
if flag==0{
if wc[v]==nil{
wc[v] =1
}else{
wc[v]=wc[v].(int)+1
}
}
//fmt.Println(v)
}
}
// Read each line of the file , And make statistics
func (wc WordCount)ReadFile(f *os.File){
rd := bufio.NewReader(f)
for{
line, err := rd.ReadString('\n') // With '\n' Read in a line for the Terminator
if err != nil || io.EOF == err {
break
}
wc.SplitAndStatistics(line)// Cut and count
}
}
// This function is used to sort , Display the results , But it doesn't use .
func(wc WordCount)AnalysisResut(){
// take map[string][int] Turn into struct Realization sort Interface to achieve sorting function
pl :=make(PairList,len(wc))
i:=0
for k,v :=range(wc){
pl[i] = Pair{k,v.(int)}
i++
}
sort.Sort(pl)
for _,pair :=range(pl){
fmt.Println(pair.Value,pair.Key)
}
}
3.42 After cutting , We have to output the word cloud to finish it .
The above libraries are installed , That's all right. .
// route , Output word cloud
func handler(w http.ResponseWriter, _ *http.Request) {
nwc := charts.NewWordCloud()
nwc.SetGlobalOptions(charts.TitleOpts{Title: " Zhihu problem :"})
wc :=make(wordCount.WordCount)
f, err := os.Open(wordCount.Path+"answer.txt")
if err!=nil{
panic(err)
}
defer f.Close()
wc.ReadFile(f)
nwc.Add("wordcloud", wc, charts.WordCloudOpts{SizeRange: []float32{14, 250}})
nwc.Render(w)
}
// Judge whether the file exists
func Exists(path string) bool {
_, err := os.Stat(path) //os.Stat Get file information
if err != nil {
if os.IsExist(err) {
return true
}
return false
}
return true
}
func main(){
if !Exists(wordCount.Path+"answer.txt"){
wordCount.QuestionAnswer()
}
http.HandleFunc("/", handler)
http.ListenAndServe(":8081", nil)
}
summary , It's still interesting , Try some better next time , More accurate statistical methods , This should be the problem of naturallanguageprocessing , Ha ha ha , Yes, I have , But I haven't played …
边栏推荐
- 磁带svg动画js特效
- 【面试题】什么是事务,什么是脏读、不可重复读、幻读,以及MySQL的几种事务隔离级别的应对方法
- 美国众议院议员:数字美元将支持美元作为全球储备货币
- ArcGIS加载免费在线历史影像作为底图(不需要插件)
- Adding, deleting, modifying and checking in low build code
- Zed acquisition
- 【面试题】instancof和getClass()的区别
- canvas螺旋样式的动画js特效
- 颜色渐变梯度颜色集合
- Is it so difficult to calculate the REM size of the web page according to the design draft?
猜你喜欢
融合模型权限管理设计方案
机器学习自学成才的十条戒律
Human body transformation vs digital Avatar
Difficult and miscellaneous problems: A Study on the phenomenon of text fuzziness caused by transform
Interesting checkbox counters
【排行榜】Carla leaderboard 排行榜 运行与参与手把手教学
What is the difference between one way and two way ANOVA analysis, and how to use SPSS or prism for statistical analysis
Technology sharing | wvp+zlmediakit realizes streaming playback of camera gb28181
U.S. House of Representatives: digital dollar will support the U.S. dollar as the global reserve currency
Collective例子
随机推荐
linux 系统redis常用命令
Interesting checkbox counters
美国众议院议员:数字美元将支持美元作为全球储备货币
[interview question] what is a transaction? What are dirty reads, unrepeatable reads, phantom reads, and how to deal with several transaction isolation levels of MySQL
Analysis report on the "fourteenth five year plan" and development trend of China's engineering project management industry from 2022 to 2028
The third generation of power electronics semiconductors: SiC MOSFET learning notes (V) research on driving power supply
Related operations of ansible and Playbook
创意SVG环形时钟js特效
ArcGIS loads free online historical images as the base map (no plug-ins are required)
Design and practice of vivo server monitoring architecture
Current situation analysis and development trend forecast report of global and Chinese acrylonitrile butadiene styrene industry from 2022 to 2028
【面试题】什么是事务,什么是脏读、不可重复读、幻读,以及MySQL的几种事务隔离级别的应对方法
[issue 25] face to face experience of golang Engineer in the rightmost social recruitment
Zed acquisition
Development status and prospect trend forecast report of humic acid sodium industry in the world and China from 2022 to 2028
Human body transformation vs digital Avatar
Opengauss kernel: simple query execution
C# Winform 最大化遮挡任务栏和全屏显示问题
【面试题】instancof和getClass()的区别
JDBC —— 数据库连接