当前位置：网站首页>Statistical analysis of catering data --- Teddy cloud course homework

Statistical analysis of catering data --- Teddy cloud course homework

2022-07-24 06:08:00 【Probably gouqing】

Statistical analysis of catering data

Diet is the foundation of people's life , For working people, they usually need to order takeout for dinner , Then you need to shop on the catering platform . Businesses on the catering platform need to ensure food safety , According to the 《 Food safety law of the people's Republic of China 》 Article 99 of the supplementary provisions of Chapter X stipulates , Food safety , It refers to non-toxic food 、 harmless , Meet the proper nutritional requirements , It will not cause any acute to human health 、 Subacute or chronic hazards . Businesses need to operate in good faith , Don't covet petty profits , Do something illegal and criminal .

A catering takeout platform provides online ordering services to the majority of users , Its market share has been increasing in recent years . When the user orders meals on the platform , The platform will guide users to evaluate and score the dishes they have tasted , So get a dish scoring data file mealrating.parquet, The data contains 5 A field , Field descriptions are shown in the table 1-1 Shown . The catering platform also provides a menu data meal_list.txt, contain 3 A field , Field descriptions are shown in the table 1-2 Shown .

Based on two data , The tasks to be achieved are as follows .
（1） Create databases and tables and import data .
（2） Data cleaning （ Missing value 、 Outliers, etc ）.
（3） Data exploration （ Explore the relationship between fields ）.
（4） Statistical analysis of data （ Such as ： Statistics of popular dishes and user ratings ）.

Through the corresponding processing of the data provided by the catering platform , Analyze user and dish information , Provide marketing strategies for catering platforms .

Start the operation

Reference resources ：hive Big homework
The difference between him and me is ：

One of my forms is .parquet Format ; Both of his are txt Format
The field names are a little different

Two tables ：
Insert picture description here

1. Create two tables

First create hive database ：

hive>create database meal;

hive>use meal;

1》 establish mealrating surface ：

hive> create table mealrating (
    > userid string,
    > mealid string,
    > rating double,
    > review string,
    > reviewtime string)
    > stored as parquet;

Import data ：（ form mealrating.parquet stay /data Inside ）

hive>load data local inpath '/data/mealrating.parquet' overwrite into table mealrating;

Just a quick check ：

hive>select * from mealrating;

Insert picture description here

2》 establish meal_list form ：( I didn't check it here by hand )

hive>create table meal_list(
	>id int,
	>mealid string,
	>mealname string)
	>row format delimited
	>fields terminated by ',';

hive>load data local inpath '/data/meal_list.txt' overwrite into table meal_list;

hive>select * from meal_list;

Insert picture description here

Data cleaning

meal_list Form from 1686 OK, let's start , Field mealid The lack of , This part of the data does not need .（hive Delete some data of the table ）
hive Delete reference ：hive Some data of the table
Ideas ： Because it is a table without partitions , So it can be cleaned by covering
Code ：

hive> insert overwrite table meal_list
	> select * from meal_list limit 1685;

Effect screenshots ： Although it is a flashback But it doesn't affect meal_list Subsequent query related operations .
Insert picture description here

2. analysis

1. Count the daily sales volume according to the user rating data ：

hive>select count(1) from mealrating where reviewtime between 1496100000 and 1496200000;

–538

2. Daily consumption

 select count(distinct userid)from mealrating where reviewtime between 1497100000 and 1497200000;

–408
3. There are records of scores and scoring contents at the same time

select count(1) from mealrating where rating is not null and review is not null;

–38383
4. Analyze the distribution of users' scores

select * ,cast(rating/(sum(rating)over())as decimal(8,2)) as rat_percent
    > from(
    > select rating,
    > count(1) rat_num,
    > cast(sum(rating)/count(1)as decimal(8,2))avg_rat
    > from mealrating group by rating
    > )as p order by rat_percent desc;

Insert picture description here

5. Statistics 10 Popular dishes

select name ,count(name)as frequency from mealrating jon meal_list on mealrating.mealid=meal_list.mealid group by name order by frequency desc limit 10;
FAILED: ParseException line 1:57 missing EOF at 'meal_list' near 'jon'

Insert picture description here

6. The top ten scores are 5.0 A dish of

select mealname,rating,count(mealname) as frequency from mealrating join meal_list on mealrating.mealid=meal_list.mealid where rating=5 group by mealname ,rating order by frequency desc limit 10 ;

Insert picture description here

7. The number of users who score more than twice a day

select count(*) from (select reviewtime,userid,count(*) from mealrating group by reviewtime,userid having count(*)>2)as tmp;

–2038
8. Find out the users who have scored more than twice , The record with the highest score of each user

select UserID,max(Rating) from mealrating group by UserID having UserID in (select UserID from mealrating group by UserID HAVING count(MealID)>2);

原网站

版权声明
本文为[Probably gouqing]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207240517309549.html

当前位置：网站首页>Statistical analysis of catering data --- Teddy cloud course homework

Statistical analysis of catering data --- Teddy cloud course homework

Statistical analysis of catering data

Start the operation

1. Create two tables

Data cleaning

2. analysis

边栏推荐

猜你喜欢

随机推荐