当前位置:网站首页>Plot+seaborn+folium: a visual exploration of Abbey's rental housing data
Plot+seaborn+folium: a visual exploration of Abbey's rental housing data
2022-06-23 06:36:00 【PIDA】
author :Peter edit :Peter
Hello everyone , I am a Peter~
Airbnb yes AirBed and Breakfast ( “Air-b-n-b” ) Abbreviation , The Chinese name is : Air accommodation , It is a service-oriented website to contact tourists and rent vacant houses , It can provide users with various accommodation information .
This paper aims at kaggle A data about aibiying in Singapore was explored and analyzed on . primary notebook Study address :https://www.kaggle.com/bavalpreet26/singapore-airbnb/notebook
<!--MORE-->
Abiying collected the global rental data , And on their official website for reference , Official data address :http://insideairbnb.com/get-the-data.html
The data of many cities above , In China, there is Beijing 、 Shanghai, etc , Are free to download , Interested friends can play with these data .
This article chooses garden city - Lion City Singapore , It is a good place to travel abroad !
Import library
Import the library required for data analysis :
import pandas as pd
import numpy as np
# Two dimensional graphics
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
plt.style.use('fivethirtyeight')
%matplotlib inline
# Dynamic graph
import plotly as plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot, plot
init_notebook_mode(connected=True)
# Map making
import folium
import folium.plugins
# NLP: Clouds of words
import wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# Machine learning modeling related
import sklearn
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
# Ignore alarm
import warnings
warnings.filterwarnings("ignore")Basic data information
Import the data we obtained :
View basic information of data : shape shape、 Field 、 Missing values, etc
# Data shape
df.shape
(7907, 16)
# Field information
columns = df.columns
columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
'minimum_nights', 'number_of_reviews', 'last_review',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365'],
dtype='object')Specifically, the Chinese meaning of each field is :
- id: Record ID
- name: House name
- host_id: landlord or landlady id
- host_name: Landlord's name
- neighbourhood: Area
- latitude: latitude
- longitude: longitude
- room_type: Room type
- price: Price
- minimum_nights: Minimum booking days
- number_of_reviews: Number of comments
- last_reviews: Last comment time
- reviews_per_month: comments / month
- calculated_host_listings_count: The number of rentable houses owned by the landlord
- availability_365: The number of days the house can be rented in a year
adopt DataFrame Of info Properties we can view multiple pieces of information about the data :
Specific missing values :
Missing value processing
1、 First, check the distribution of missing values in the field : It can be seen from the figure below that last_review and reviews_per_month Field has missing value
sns.set(rc={'figure.figsize':(19.7, 8.27)})
sns.heatmap(
df.isnull(),
yticklabels=False,
cbar=False,
cmap='viridis'
)
plt.show()2、 Fields with missing values ( The top two ) and name Directly delete two lines of records in the field
The final data becomes 7905 Row sum 14 A field . The raw data is 7907 That's ok ,16 Field properties
data EDA
EDA The full name is :Exploratory Data Analysis, Mainly to explore the distribution of data
Price price
Overall , The price is still 1000 following
sns.distplot(df["price"]) # Histogram plt.show()
Let's take a look at the relationship between the price and the minimum reservation days :
sns.scatterplot(
x="price",
y="minimum_nights", # At least every night
data=df)
plt.show()Through the scatter chart of prices , It can also be observed that the main prices are still distributed in the lowest booking days 200 Of the following listings
Area
View the area of the house ( Geography is ) Distribution : More houses are located in Central Region Location .
sns.countplot(df["neighbourhood_group"]) plt.show()
The above is a comparison of the number of houses in each area , Here's a comparison
df1 = df[df.price < 250] # Less than 250 There are many houses
plt.figure(figsize=(10,6))
sns.boxplot(x = 'neighbourhood_group',
y = 'price',
data=df1
)
plt.title("neighbourhood_group < 250")
plt.show()Observe from the box diagram :Central Region Area houses
- House prices are more widely distributed
- The average price of house is also higher than other places
- The price distribution does not compare other values , reasonable
The above is a comparison from the area of the house , Below you can find their specific longitude and latitude :
plt.figure(figsize=(12,8))
sns.scatterplot(df.longitude,
df.latitude,
hue=df.neighbourhood_group)
plt.show()Heat map of house supply distribution
In order to draw the thermal map of the geographical location , You can learn this library :folium
import folium
from folium.plugins import HeatMap
m = folium.Map([1.44255,103.79580],zoom_start=11)
HeatMap(df[['latitude','longitude']].dropna(),
radius=10,
gradient={0.2:'blue',
0.4:'purple',
0.6:'orange',
1.0:'red'}).add_to(m)
display(m)Room type room_type
Proportion of different room types
Statistics 3 The total number of different room types and the corresponding percentage :
For this 3 Visual comparison of the proportions of the three types :
labels = room_df.index
values = room_df.values
fig = go.Figure(data=[go.Pie(labels=labels,
values=values,
hole=0.5
)])
fig.show()Conclusion : The largest proportion of houses is in the form of whole rent or apartment , Maybe more popular .
Room types in different areas
plt.figure(figsize=(12,6))
sns.countplot(
data = df,
x="room_type",
hue="neighbourhood_group"
)
plt.title("room types occupied by the neighbourhood_group")
plt.show()Compare different types of rooms in different areas , We come to the same conclusion : In different room_type Next ,Central Region The location of the room is the most
Personal additions : How to use Plotly To draw the group diagram above ?
px.bar(type_group,
x="room_type",
y="number",
color="neighbourhood_group",
barmode="group")Room type and price relationship
plt.figure(figsize=(12,6)) sns.catplot(data=df,x="room_type",y="price") plt.show()
Personal increase : Use Plotly Drawing version
Room name
The whole picture of words
Draw based on room name name Cloud picture of words :
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.name)
wordcloud = WordCloud(
max_words=200,
background_color="white").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation="Bilinear")
plt.axis("off")
plt.show()- 2BR:2 Bedroom Apartments, Two rooms
- MRT:Mass Rapid Transit, Subway in Singapore ; Maybe there are more houses near the subway
Key in name
The key words in the name after cutting :
# Put all the data names in the list names in
names = []
for name in df.name:
names.append(name)
def split_name(name):
""" effect : Cut each name """
spl = str(name).split()
return spl
names_count = []
for each in names: # Loop list names
for word in split_name(each): # Each name performs a cutting operation
word = word.lower() # Unified into lowercase
names_count.append(word) # The results of each cut are placed in the list
# Counting Library
from collections import Counter
result = Counter(names_count).most_common()
result[:5]top_20 = result[0:20] # front 20 A high-frequency word top_20_words = pd.DataFrame(top_20, columns=["words","count"]) top_20_words
plt.figure(figsize=(10,6))
fig = sns.barplot(data=top_20_words,x="words",y="count")
fig.set_title("Counts of the top 20 used words for listing names")
fig.set_ylabel("Count of words")
fig.set_xlabel("Words")
fig.set_xticklabels(fig.get_xticklabels(), rotation=80)Return visit statistics
Check which rooms have a high number of follow-up visits :
df1 = df.sort_values(by="number_of_reviews",ascending=False).head(1000) df1.head()
import folium
from folium.plugins import MarkerCluster
from folium import plugins
print("Rooms with the most number of reviews")
Long=103.91492
Lat=1.32122
mapdf1 = folium.Map([Lat, Long], zoom_start=10)
mapdf1_rooms_map = plugins.MarkerCluster().add_to(mapdf1)
for lat, lon, label in zip(df1.latitude,df1.longitude,df1.name):
folium.Marker(location=[lat, lon],icon=folium.Icon(icon="home"),
popup=label).add_to(mapdf1_rooms_map)
mapdf1.add_child(mapdf1_rooms_map)Rentable days
At different longitude and latitude , Comparison of the number of rentable days in a year :
plt.figure(figsize=(10,6))
plt.scatter(df.longitude,
df.latitude,
c=df.availability_365,
cmap="spring",
edgecolors="black",
linewidths=1,
alpha=1
)
cbar=plt.colorbar()
cbar.set_label("availability_365")Personal additions : Use Plotly How to draw ?
# plotly edition px.scatter(df,x="longitude",y="latitude",color="availability_365")
price Less than 500 The distribution of houses :
# price Less than 500 The data of
plt.figure(figsize=(10,6))
low_500 = df[df.price < 500]
viz1 = low_500.plot(
kind="scatter",
x='longitude',
y='latitude',
label='availability_365',
c='price',
cmap=plt.get_cmap('jet'),
colorbar=True,
alpha=0.4)
viz1.legend()
plt.show()Add some : More succinct Plotl8y edition
# plotly edition
px.scatter(low_500,
x='longitude',
y='latitude',
color='price'
)Linear regression modeling
Preprocessing
Modeling scheme based on linear regression , Delete invalid fields first :
df.drop(["name","id","host_name"],inplace=True,axis=1)
Code type conversion :
cols = ["neighbourhood_group","neighbourhood","room_type"]
for col in cols:
le = preprocessing.LabelEncoder()
le.fit(df[col])
df[col] = le.transform(df[col])
df.head()modeling
# Model instantiation
lm = LinearRegression()
# Data sets
X = df.drop("price",axis=1)
y = df["price"]
# Training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
lm.fit(X_train, y_train)Test set validation
title=['Pred vs Actual']
fig = go.Figure(data=[
go.Bar(name='Predicted',
x=error_airbnb.index,
y=error_airbnb['Predict']),
go.Bar(name='Actual',
x=error_airbnb.index,
y=error_airbnb['Actual'])
])
fig.update_layout(barmode='group')
fig.show()Personal additions : We compare the predicted value with the real value , Make the difference between the two diff( Add fields )
error_airbnb["diff"] = error_airbnb["Predict"] - error_airbnb["Actual"] px.box(error_airbnb,y="diff")
Through the difference value diff We found that : The real value and the predicted value differ greatly in some data .
Through the following descride Properties can also be seen : Some are even different 6820( The absolute value ), A condition that is an outlier ; A quarter of the median is -19, The difference is 19, On the whole, the two are relatively close
边栏推荐
- Day_06 传智健康项目-移动端开发-体检预约
- Illuminate\Support\Collection 去重 unique 列表去重
- Leetcode topic resolution valid anagram
- Day_ 10 smart health project - permission control, graphic report
- 又到半年总结时,IT人只想躺平
- Link of Baidu URL Parameters? Recherche sur le chiffrement et le décryptage des paramètres d'URL (exemple de Code)
- Day_03 传智健康项目-预约管理-检查组管理
- Day_01 传智健康项目-项目概述和环境搭建
- golang正则regexp包使用-04-使用正则替换(ReplaceAll(),ReplaceAllLiteral(),ReplaceAllFunc())
- [resolved] "the unity environment took too long to respond. make sure that: \n“
猜你喜欢

Day_05 传智健康项目-预约管理-预约设置

mysql如何将日期转为数字

原址 交换

Day_01 传智健康项目-项目概述和环境搭建

Design scheme of Small PLC based on t5l1
Link of Baidu URL Parameters? Recherche sur le chiffrement et le décryptage des paramètres d'URL (exemple de Code)

Day_03 传智健康项目-预约管理-检查组管理

开源生态|超实用开源License基础知识扫盲帖(下)

Day_ 09 smart health project - mobile terminal development - Mobile quick login and permission control

Index - MySQL
随机推荐
Coordinate transformation
设计师需要懂的数据指标与数据分析模型
C# wpf 通过绑定实现控件动态加载
The central network and Information Technology Commission issued the National Informatization Plan for the 14th five year plan, and the network security market entered a period of rapid growth
[resolved] "the unity environment took too long to respond. make sure that: \n“
Day_10 传智健康项目-权限控制、图形报表
Network packet capturing tcpdump User Guide
Introduction to JVM principle
Xray linkage crawlergo automatic scanning pit climbing record
ffplay实现自定义输入流播放
Skilled use of slicing operations
Day_ 02 smart communication health project - appointment management - inspection item management
Redis sentry
记一次GLIB2.14升级GLIB2.18的记录以及其中的步骤原理
The softing datafeed OPC suite stores Siemens PLC data in an Oracle Database
去除防火墙和虚拟机对live555启动IP地址的影响
C语言 踩坑:文档编码错误,导致base64中文编码错误
C# wpf 附加属性实现界面上定义装饰器
MySQL5.6 (5.7-8) 基于shardingsphere5.1.1 Sharding-Proxy模式读写分离
RF content learning