当前位置:网站首页>Statistical analysis of catering data --- Teddy cloud course homework
Statistical analysis of catering data --- Teddy cloud course homework
2022-07-24 06:08:00 【Probably gouqing】
Statistical analysis of catering data
Diet is the foundation of people's life , For working people, they usually need to order takeout for dinner , Then you need to shop on the catering platform . Businesses on the catering platform need to ensure food safety , According to the 《 Food safety law of the people's Republic of China 》 Article 99 of the supplementary provisions of Chapter X stipulates , Food safety , It refers to non-toxic food 、 harmless , Meet the proper nutritional requirements , It will not cause any acute to human health 、 Subacute or chronic hazards . Businesses need to operate in good faith , Don't covet petty profits , Do something illegal and criminal .
A catering takeout platform provides online ordering services to the majority of users , Its market share has been increasing in recent years . When the user orders meals on the platform , The platform will guide users to evaluate and score the dishes they have tasted , So get a dish scoring data file mealrating.parquet, The data contains 5 A field , Field descriptions are shown in the table 1-1 Shown . The catering platform also provides a menu data meal_list.txt, contain 3 A field , Field descriptions are shown in the table 1-2 Shown .
Based on two data , The tasks to be achieved are as follows .
(1) Create databases and tables and import data .
(2) Data cleaning ( Missing value 、 Outliers, etc ).
(3) Data exploration ( Explore the relationship between fields ).
(4) Statistical analysis of data ( Such as : Statistics of popular dishes and user ratings ).
Through the corresponding processing of the data provided by the catering platform , Analyze user and dish information , Provide marketing strategies for catering platforms .
Start the operation
Reference resources :hive Big homework
The difference between him and me is :
- One of my forms is .parquet Format ; Both of his are txt Format
- The field names are a little different
Two tables :

1. Create two tables
First create hive database :
hive>create database meal;
hive>use meal;
1》 establish mealrating surface :
hive> create table mealrating (
> userid string,
> mealid string,
> rating double,
> review string,
> reviewtime string)
> stored as parquet;
Import data :( form mealrating.parquet stay /data Inside )
hive>load data local inpath '/data/mealrating.parquet' overwrite into table mealrating;
Just a quick check :
hive>select * from mealrating;

2》 establish meal_list form :( I didn't check it here by hand )
hive>create table meal_list(
>id int,
>mealid string,
>mealname string)
>row format delimited
>fields terminated by ',';
hive>load data local inpath '/data/meal_list.txt' overwrite into table meal_list;
hive>select * from meal_list;

Data cleaning
meal_list Form from 1686 OK, let's start , Field mealid The lack of , This part of the data does not need .(hive Delete some data of the table )
hive Delete reference :hive Some data of the table
Ideas : Because it is a table without partitions , So it can be cleaned by covering
Code :
hive> insert overwrite table meal_list
> select * from meal_list limit 1685;
Effect screenshots : Although it is a flashback But it doesn't affect meal_list Subsequent query related operations .
2. analysis
1. Count the daily sales volume according to the user rating data :
hive>select count(1) from mealrating where reviewtime between 1496100000 and 1496200000;
–538
2. Daily consumption
select count(distinct userid)from mealrating where reviewtime between 1497100000 and 1497200000;
–408
3. There are records of scores and scoring contents at the same time
select count(1) from mealrating where rating is not null and review is not null;
–38383
4. Analyze the distribution of users' scores
select * ,cast(rating/(sum(rating)over())as decimal(8,2)) as rat_percent
> from(
> select rating,
> count(1) rat_num,
> cast(sum(rating)/count(1)as decimal(8,2))avg_rat
> from mealrating group by rating
> )as p order by rat_percent desc;

5. Statistics 10 Popular dishes
select name ,count(name)as frequency from mealrating jon meal_list on mealrating.mealid=meal_list.mealid group by name order by frequency desc limit 10;
FAILED: ParseException line 1:57 missing EOF at 'meal_list' near 'jon'

6. The top ten scores are 5.0 A dish of
select mealname,rating,count(mealname) as frequency from mealrating join meal_list on mealrating.mealid=meal_list.mealid where rating=5 group by mealname ,rating order by frequency desc limit 10 ;

7. The number of users who score more than twice a day
select count(*) from (select reviewtime,userid,count(*) from mealrating group by reviewtime,userid having count(*)>2)as tmp;
–2038
8. Find out the users who have scored more than twice , The record with the highest score of each user
select UserID,max(Rating) from mealrating group by UserID having UserID in (select UserID from mealrating group by UserID HAVING count(MealID)>2);
边栏推荐
- systemctl + journalctl
- Accessing a one-dimensional array with a pointer
- Add se channel attention module to the network
- Typora installation package in November 2021, the last free version of the installation package to download v13.6.1
- JDBC进阶—— 师承尚硅谷(DAO)
- "Statistical learning methods (2nd Edition)" Li Hang Chapter 16 principal component analysis PCA mind map notes and after-school exercise answers (detailed steps) PCA matrix singular value Chapter 16
- Conversion of world coordinate system, camera coordinate system and image coordinate system
- 使用Qt连接MySql并创建表号、写入数据、删除数据
- [activiti] process variables
- [activiti] process example
猜你喜欢

Conversion of world coordinate system, camera coordinate system and image coordinate system

JVM system learning

Thymeleaf快速入门学习

Use QT to connect to MySQL and create table numbers, write data, and delete data
![[principles of database system] Chapter 4 advanced database model: Unified Modeling Language UML, object definition language ODL](/img/51/7387c73148ee7bd1034bb6e77af7f0.png)
[principles of database system] Chapter 4 advanced database model: Unified Modeling Language UML, object definition language ODL

"Statistical learning methods (2nd Edition)" Li Hang Chapter 16 principal component analysis PCA mind map notes and after-school exercise answers (detailed steps) PCA matrix singular value Chapter 16

Signals and systems: Hilbert transform

AD1256

【深度学习】手写神经网络模型保存

世界坐标系、相机坐标系和图像坐标系的转换
随机推荐
MySQL基础---约束
Test whether the label and data set correspond after data enhancement
[MYCAT] Introduction to MYCAT
如何解决训练集和测试集的分布差距过大问题
[activiti] personal task
Write the list to txt and directly remove the comma in the middle
Conversion of world coordinate system, camera coordinate system and image coordinate system
【深度学习】手把手教你写“手写数字识别神经网络“,不使用任何框架,纯Numpy
[MYCAT] MYCAT installation
Deepsort summary
Jupyter notebook select CONDA environment
JUC concurrent programming foundation (9) -- thread pool
用指针访问二维数组
Machine learning (Zhou Zhihua) Chapter 4 notes on learning experience of decision tree
[USB host] stm32h7 cubemx porting USB host with FreeRTOS to read USB disk, usbh_ Process_ The OS is stuck. There is a value of 0xa5a5a5
【数据库系统原理】第四章 高级数据库模型:统一建模语言UML、对象定义语言ODL
STM32 DSP library MDK vc5\vc6 compilation error: 256, (const float64_t *) twiddlecoeff64_ 256, armBitRevIndexTableF64_ 256,
synergy局域网实现多主机共享键鼠(amd、arm)
Jupyter notebook选择conda环境
Numpy cheatsheet