当前位置：网站首页>[pyspark foundation] row to column and column to row (when there are more than one column)

[pyspark foundation] row to column and column to row (when there are more than one column)

2022-07-24 21:36:00 【Evening scenery at the top of the mountain】

List of articles

One 、 problem
Two 、 Method 1
3、 ... and 、 Solution
Reference

One 、 problem

Now? pyspark There are fields in user_id and k individual item_id Column , The goal is to achieve something similar sql Row to column and column to row for classic tasks in , That is, one by one user_id and item_id. Can pass df.printSchema() View the current df Field of ：

root
 |-- user_id: double (nullable = true)
 |-- beat_id[0]: double (nullable = true)
 |-- beat_id[1]: double (nullable = true)
 |-- beat_id[2]: double (nullable = true)
 |-- beat_id[3]: double (nullable = true)
 .......

Two 、 Method 1

Start with a chestnut , The place that may be confused is selectExpr Inside stack, It can be understood as changing the corresponding original fields “ The stack ”, Then one more feed into the back as Renamed project Field ：

# test_example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('JupyterPySpark').enableHiveSupport().getOrCreate()
import pyspark.sql.functions as F
#  Raw data  
test = spark.createDataFrame([('2018-01',' project 1',100), ('2018-01',' project 2',200), ('2018-01',' project 3',300),
                            ('2018-02',' project 1',1000), ('2018-02',' project 2',2000), ('2018-03',' project 4',999),
                            ('2018-05',' project 1',6000), ('2018-05',' project 2',4000), ('2018-05',' project 4',1999)
                           ], [' month ',' project ',' income '])
# test.show()

#  One 、 Transfer line column 
test_pivot = test.groupBy(' month ') \
        .pivot(' project ', [' project 1', ' project 2', ' project 3', ' project 4']) \
        .agg(F.sum(' income ')) \
        .fillna(0)
test_pivot.show()

#  Two 、 Column turned 
 #  Inverse perspective Unpivot
unpivot_test =test_pivot.selectExpr("` month `",
                        "stack(4, ' project 1', ` project 1`,' project 2', ` project 2`, ' project 3', ` project 3`, ' project 4', ` project 4`) as (` project `,` income `)") \
        .filter("` income ` > 0 ") \
        .orderBy(["` month `", "` project `"]) \

unpivot_test.show()

+-------+-----+-----+-----+-----+
|    month | project 1| project 2| project 3| project 4|
+-------+-----+-----+-----+-----+
|2018-03|    0|    0|    0|  999|
|2018-02| 1000| 2000|    0|    0|
|2018-05| 6000| 4000|    0| 1999|
|2018-01|  100|  200|  300|    0|
+-------+-----+-----+-----+-----+

+-------+-----+----+
|    month |  project | income |
+-------+-----+----+
|2018-01| project 1| 100|
|2018-01| project 2| 200|
|2018-01| project 3| 300|
|2018-02| project 1|1000|
|2018-02| project 2|2000|
|2018-03| project 4| 999|
|2018-05| project 1|6000|
|2018-05| project 2|4000|
|2018-05| project 4|1999|
+-------+-----+----+

3、 ... and 、 Solution

Same idea , But if the field is Chinese , Need will be in stack Chinese to Chinese plus `` Symbol , But it is not conducive to later processing , So it's best to remove it with regular expressions .
If there are many columns that need to be transferred , It is more necessary to use the following implementation , Defined as unpivot function .

from pyspark.sql.functions import regexp_replace

def unpivot(df, keys,feature,value):
    '''df: Data frame to be converted  keys： The primary key to be reserved in the table to be converted key, With list[] Type in  feature, value： Converted column names , Customizable  '''
    #  The conversion type is to avoid field class mismatch , Uniformly convert data into double type (string It's OK ), If you ensure that the data types are completely consistent , The sentence can be omitted 
    df = df.select(*[col(x).astype("double") for x in df.columns])
    cols = [x for x in df.columns if x not in keys]
    stack_str = ','.join(map(lambda x: "'`%s`', `%s`" % (x, x), cols))# here join To use connectors ‘,’ Will all ('`x`',`x`) Connect 
    
    df = (df.selectExpr(*keys, "stack(%s, %s) as (%s, %s)" % (len(cols), stack_str,feature,value))
          .withColumn(feature,regexp_replace(feature,'\`',''))
         )
    return df


keys = ['user_id']
feature,value = 'features','beat_id'
# df_test.new = unpivot(df_test, keys,feature,value)
df_result3 = unpivot(df_result2, keys,feature,value)
df_result3.show() 

+-----------+-----------+---------+
|    user_id|   features|  beat_id|
+-----------+-----------+---------+
|1.9079423E7| beat_id[0]|1018216.0|
|1.9079423E7| beat_id[1]| 886351.0|
|1.9079423E7| beat_id[2]|1051107.0|
|1.9079423E7| beat_id[3]|1018226.0|
+-----------+-----------+---------+