当前位置：网站首页>Plot+seaborn+folium: a visual exploration of Abbey's rental housing data

Plot+seaborn+folium: a visual exploration of Abbey's rental housing data

2022-06-23 06:36:00 【PIDA】

author ：Peter edit ：Peter

Hello everyone , I am a Peter~

Airbnb yes AirBed and Breakfast ( “Air-b-n-b” ) Abbreviation , The Chinese name is ： Air accommodation , It is a service-oriented website to contact tourists and rent vacant houses , It can provide users with various accommodation information .

This paper aims at kaggle A data about aibiying in Singapore was explored and analyzed on . primary notebook Study address ：https://www.kaggle.com/bavalpreet26/singapore-airbnb/notebook

Abiying collected the global rental data , And on their official website for reference , Official data address ：http://insideairbnb.com/get-the-data.html

The data of many cities above , In China, there is Beijing 、 Shanghai, etc , Are free to download , Interested friends can play with these data .

This article chooses garden city - Lion City Singapore , It is a good place to travel abroad ！

Import library

Import the library required for data analysis ：

import pandas as pd
import numpy as np

#  Two dimensional graphics 
import matplotlib                  
import matplotlib.pyplot as plt
import seaborn as sns             
import geopandas as gpd            
plt.style.use('fivethirtyeight')
%matplotlib inline

#  Dynamic graph 
import plotly as plotly               
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot, plot
init_notebook_mode(connected=True)

#  Map making 
import folium
import folium.plugins

# NLP： Clouds of words 
import wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


#  Machine learning modeling related 
import sklearn
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor

#  Ignore alarm 
import warnings
warnings.filterwarnings("ignore")

Basic data information

Import the data we obtained ：

View basic information of data ： shape shape、 Field 、 Missing values, etc

#  Data shape 
df.shape

(7907, 16)

#  Field information 
columns = df.columns 
columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

Specifically, the Chinese meaning of each field is ：

id： Record ID
name： House name
host_id： landlord or landlady id
host_name： Landlord's name
neighbourhood： Area
latitude： latitude
longitude： longitude
room_type： Room type
price： Price
minimum_nights： Minimum booking days
number_of_reviews： Number of comments
last_reviews： Last comment time
reviews_per_month： comments / month
calculated_host_listings_count： The number of rentable houses owned by the landlord
availability_365： The number of days the house can be rented in a year

adopt DataFrame Of info Properties we can view multiple pieces of information about the data ：

Specific missing values ：

Missing value processing

1、 First, check the distribution of missing values in the field ： It can be seen from the figure below that last_review and reviews_per_month Field has missing value

sns.set(rc={'figure.figsize':(19.7, 8.27)})
sns.heatmap(
  df.isnull(),
  yticklabels=False,
  cbar=False,
  cmap='viridis'
)

plt.show()

2、 Fields with missing values （ The top two ） and name Directly delete two lines of records in the field

The final data becomes 7905 Row sum 14 A field . The raw data is 7907 That's ok ,16 Field properties

data EDA

EDA The full name is ：Exploratory Data Analysis, Mainly to explore the distribution of data

Price price

Overall , The price is still 1000 following

sns.distplot(df["price"])  #  Histogram 
plt.show()

Let's take a look at the relationship between the price and the minimum reservation days ：

sns.scatterplot(
    x="price",
    y="minimum_nights",  #  At least every night 
    data=df)

plt.show()

Through the scatter chart of prices , It can also be observed that the main prices are still distributed in the lowest booking days 200 Of the following listings

Area

View the area of the house （ Geography is ） Distribution ： More houses are located in Central Region Location .

sns.countplot(df["neighbourhood_group"])
plt.show()

The above is a comparison of the number of houses in each area , Here's a comparison

df1 = df[df.price < 250]   #  Less than 250 There are many houses 
plt.figure(figsize=(10,6))

sns.boxplot(x = 'neighbourhood_group',
            y = 'price',
            data=df1
           )

plt.title("neighbourhood_group < 250")

plt.show()

Observe from the box diagram ：Central Region Area houses

House prices are more widely distributed
The average price of house is also higher than other places
The price distribution does not compare other values , reasonable

The above is a comparison from the area of the house , Below you can find their specific longitude and latitude ：

plt.figure(figsize=(12,8))

sns.scatterplot(df.longitude,
               df.latitude,
               hue=df.neighbourhood_group)

plt.show()

Heat map of house supply distribution

In order to draw the thermal map of the geographical location , You can learn this library ：folium

import folium
from folium.plugins import HeatMap

m = folium.Map([1.44255,103.79580],zoom_start=11)

HeatMap(df[['latitude','longitude']].dropna(),
        radius=10,
        gradient={0.2:'blue',
                  0.4:'purple',
                  0.6:'orange',
                  1.0:'red'}).add_to(m)
display(m)

Room type room_type

Proportion of different room types

Statistics 3 The total number of different room types and the corresponding percentage ：

For this 3 Visual comparison of the proportions of the three types ：

labels = room_df.index
values = room_df.values

fig = go.Figure(data=[go.Pie(labels=labels,
                             values=values,
                             hole=0.5
                            )])

fig.show()

Conclusion ： The largest proportion of houses is in the form of whole rent or apartment , Maybe more popular .

Room types in different areas

plt.figure(figsize=(12,6))

sns.countplot(
    data = df,
    x="room_type",
    hue="neighbourhood_group"
)

plt.title("room types occupied by the neighbourhood_group")
plt.show()

Compare different types of rooms in different areas , We come to the same conclusion ： In different room_type Next ,Central Region The location of the room is the most

Personal additions ： How to use Plotly To draw the group diagram above ？

px.bar(type_group,
       x="room_type",
       y="number",
       color="neighbourhood_group",
       barmode="group")

Room type and price relationship

plt.figure(figsize=(12,6))

sns.catplot(data=df,x="room_type",y="price")

plt.show()

Personal increase ： Use Plotly Drawing version

Room name

The whole picture of words

Draw based on room name name Cloud picture of words ：

from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.name)

wordcloud = WordCloud(
    max_words=200,
    background_color="white").generate(text)

plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation="Bilinear")

plt.axis("off")
plt.show()

2BR：2 Bedroom Apartments, Two rooms
MRT：Mass Rapid Transit, Subway in Singapore ; Maybe there are more houses near the subway

Key in name

The key words in the name after cutting ：

#  Put all the data names in the list names in 
names = []
for name in df.name:
    names.append(name)  
    
def split_name(name):
    """ effect ： Cut each name """
    spl = str(name).split()
    return spl


names_count = []
for each in names:  #  Loop list names
    for word in split_name(each): #  Each name performs a cutting operation 
        word = word.lower()  #  Unified into lowercase 
        names_count.append(word)  #  The results of each cut are placed in the list 

#  Counting Library         
from collections import Counter
result = Counter(names_count).most_common()
result[:5]

top_20 = result[0:20]  #  front 20 A high-frequency word 

top_20_words = pd.DataFrame(top_20, columns=["words","count"])
top_20_words

plt.figure(figsize=(10,6))

fig = sns.barplot(data=top_20_words,x="words",y="count")
fig.set_title("Counts of the top 20 used words for listing names")
fig.set_ylabel("Count of words")
fig.set_xlabel("Words")
fig.set_xticklabels(fig.get_xticklabels(), rotation=80)

Return visit statistics

Check which rooms have a high number of follow-up visits ：

df1 = df.sort_values(by="number_of_reviews",ascending=False).head(1000)

df1.head()

import folium
from folium.plugins import MarkerCluster
from folium import plugins

print("Rooms with the most number of reviews")

Long=103.91492
Lat=1.32122

mapdf1 = folium.Map([Lat, Long], zoom_start=10)

mapdf1_rooms_map = plugins.MarkerCluster().add_to(mapdf1)

for lat, lon, label in zip(df1.latitude,df1.longitude,df1.name):
    folium.Marker(location=[lat, lon],icon=folium.Icon(icon="home"),
                 popup=label).add_to(mapdf1_rooms_map)

mapdf1.add_child(mapdf1_rooms_map)

Rentable days

At different longitude and latitude , Comparison of the number of rentable days in a year ：

plt.figure(figsize=(10,6))

plt.scatter(df.longitude,
            df.latitude,
            c=df.availability_365,
            cmap="spring",
            edgecolors="black",
            linewidths=1,
            alpha=1
           )

cbar=plt.colorbar()
cbar.set_label("availability_365")

Personal additions ： Use Plotly How to draw ？

# plotly edition 
px.scatter(df,x="longitude",y="latitude",color="availability_365")

price Less than 500 The distribution of houses ：

# price Less than 500 The data of 

plt.figure(figsize=(10,6))
low_500 = df[df.price < 500]

viz1 = low_500.plot(
  kind="scatter",
  x='longitude',
  y='latitude',
  label='availability_365',
  c='price',
  cmap=plt.get_cmap('jet'),
  colorbar=True,
  alpha=0.4)
viz1.legend()
plt.show()

Add some ： More succinct Plotl8y edition

# plotly edition 
px.scatter(low_500,
           x='longitude',
           y='latitude',
           color='price'
          )

Linear regression modeling

Preprocessing

Modeling scheme based on linear regression , Delete invalid fields first ：

df.drop(["name","id","host_name"],inplace=True,axis=1)

Code type conversion ：

cols = ["neighbourhood_group","neighbourhood","room_type"]

for col in cols:
    le = preprocessing.LabelEncoder()
    le.fit(df[col])
    df[col] = le.transform(df[col])
    
df.head()

modeling

#  Model instantiation 
lm = LinearRegression()

#  Data sets 
X = df.drop("price",axis=1)
y = df["price"]

#  Training set and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

lm.fit(X_train, y_train)

Test set validation

image-20220106005648258

title=['Pred vs Actual']

fig = go.Figure(data=[
    go.Bar(name='Predicted',
           x=error_airbnb.index,
           y=error_airbnb['Predict']),
    
    go.Bar(name='Actual',
           x=error_airbnb.index,
           y=error_airbnb['Actual'])
])

fig.update_layout(barmode='group')
fig.show()

Personal additions ： We compare the predicted value with the real value , Make the difference between the two diff（ Add fields ）

error_airbnb["diff"] = error_airbnb["Predict"] - error_airbnb["Actual"]
px.box(error_airbnb,y="diff")

Through the difference value diff We found that ： The real value and the predicted value differ greatly in some data .

Through the following descride Properties can also be seen ： Some are even different 6820（ The absolute value ）, A condition that is an outlier ; A quarter of the median is -19, The difference is 19, On the whole, the two are relatively close

原网站

版权声明
本文为[PIDA]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/01/202201130839131384.html