当前位置:网站首页>pyspark on hpc
pyspark on hpc
2022-06-23 22:50:00 【flavorfan】
Local internal cluster resources are limited , Simple data processing has gone 3 God .HPC There are many computing resources on , Out of the idea of eating the pot first and then the bowl , Consider making full use of shared resources first . Simple survey , It's not very complicated .
1 programme
spark use local Pattern
spark standalone It involves multi node communication , High complexity ; Multi task parallelism can be used to plan data fragmentation , One for each individual spark local Handle ; This avoids complex cluster construction . Through the requisition Mo node 、 many cpu、 Multi memory to achieve .
Give Way python The environment can find pyspark
This is essentially through env Environment variable implementation , The specific implementation is python Set up , One .bashrc or shell Set up .
2 step
1) install spark( It's decompression )
decompression spark-3.1.2-bin-hadoop3.2.tgz Go to the user directory , such as /users/username/tools/spark/spark
I used a soft connection , Consider switching between different versions later
cd /users/[username]/tools/ tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz ln -s spark-3.1.2-bin-hadoop3.2 spark
2) stay python Configure... In the code , To use the pyspark
The following build environment and test code can be found in py Document and jupyter Medium test passed .
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/users/[username]/miniconda3/bin/python"
os.environ["SPARK_HOME"] = "/users/[username]/tools/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-10.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
# test code
import random
from pyspark import SparkContext
sc = pyspark.SparkContext(appName="myAppName")
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()3) adopt bashrc Or script configuration pyspark
To configure myspark.sh
#!/bin/sh export SPARK_HOME='/users/[username]/tools/spark' export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" export PYSPARK_PYTHON="/users//[username]/miniconda3/bin/python"
Put this in .bashrc, You don't need the above python To configure , Senseless use pyspark.
边栏推荐
- 解密抖音春节红包背后的技术设计与实践
- What are the operation and maintenance advantages of Fortress machine web application publishing server? Two outstanding advantages
- Flush cache clear
- AAAI 2022 | Tencent Youtu 14 papers were selected, including image coloring, face security, scene text recognition and other frontier fields
- How to access the top-level domain name and automatically jump to the secondary domain name?
- Chaos engineering, learn about it
- Virtual machine performance monitoring and fault handling commands on the console
- How to use xshell to log in to the server through the fortress machine? How does the fortress machine configure the tunnel?
- The time deviation is more than 15 hours (54000 seconds), and the time cannot be automatically calibrated
- How to lossless publish API gateway why do you need API gateway?
猜你喜欢

Ant group's self-developed tee technology has passed the national financial technology product certification

In the eyes of the universe, how to correctly care about counting East and West?

蚂蚁集团自研TEE技术通过国家级金融科技产品认证

PHPMailer 发送邮件 PHP

Section 29 basic configuration case of Tianrongxin topgate firewall

Section 30 high availability (HA) configuration case of Tianrongxin topgate firewall

专业“搬砖”老司机总结的 12 条 SQL 优化方案,非常实用!

Why is only one value displayed on your data graph?

Slsa: accelerator for successful SBOM

為什麼你的數據圖譜分析圖上只顯示一個值?
随机推荐
新股民怎样炒股票开户?在线开户安全么?
Log4j has been exposed to a nuclear bomb level vulnerability, and the developer has fried the pot!
How to shut down the server in the fortress machine? What other operations can the fortress machine perform?
Detailed explanation of GC principle
VNC multi gear resolution adjustment, 2008R2 setting 1280 × 1024 resolution
The time deviation is more than 15 hours (54000 seconds), and the time cannot be automatically calibrated
Game security - call analysis - write code
How to set secondary title in website construction what is the function of secondary title
Advantages of micro service registry Nacos over Eureka
Semaphore semaphore details
sql server常用sql
[technical dry goods] the technical construction route and characteristics of zero trust in ant Office
Tcapulusdb Jun · industry news collection
Website construction column setting form which website construction company is better
Problem solving: inittramfs unpacking failed:decoding failed
Opengauss Developer Day 2022 was officially launched to build an open source database root community with developers
Discussion: will low code integrated oa/erp/mes system be an important part of enterprise application ecology?
How to set the website construction title bar drop-down
Implement sequence restriction on memory operations
Role of API service gateway benefits of independent API gateway