当前位置:网站首页>用XGBoost迭代读取数据集
用XGBoost迭代读取数据集
2022-06-27 06:35:00 【Datawhale】
Datawhale干货
来源:Coggle数据科学
在大规模数据集进行读取进行训练的过程中,迭代读取数据集是一个非常合适的选择,在Pytorch中支持迭代读取的方式。接下来我们将介绍XGBoost的迭代读取的方式。
内存数据读取
class IterLoadForDMatrix(xgb.core.DataIter):
def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
self.features = features
self.target = target
self.df = df
self.batch_size = batch_size
self.batches = int( np.ceil( len(df) / self.batch_size ) )
self.it = 0 # set iterator to 0
super().__init__()
def reset(self):
'''Reset the iterator'''
self.it = 0
def next(self, input_data):
'''Yield next batch of data.'''
if self.it == self.batches:
return 0 # Return 0 when there's no more batch.
a = self.it * self.batch_size
b = min( (self.it + 1) * self.batch_size, len(self.df) )
dt = pd.DataFrame(self.df.iloc[a:b])
input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
self.it += 1
return 1调用方法(此种方式比较适合GPU训练):
Xy_train = IterLoadForDMatrix(train.loc[train_idx], FEATURES, 'target')
dtrain = xgb.DeviceQuantileDMatrix(Xy_train, max_bin=256)参考文档:
https://xgboost.readthedocs.io/en/latest/python/examples/quantile_data_iterator.html
外部数据迭代读取
class Iterator(xgboost.DataIter):
def __init__(self, svm_file_paths: List[str]):
self._file_paths = svm_file_paths
self._it = 0
super().__init__(cache_prefix=os.path.join(".", "cache"))
def next(self, input_data: Callable):
if self._it == len(self._file_paths):
# return 0 to let XGBoost know this is the end of iteration
return 0
X, y = load_svmlight_file(self._file_paths[self._it])
input_data(X, y)
self._it += 1
return 1
def reset(self):
"""Reset the iterator to its beginning"""
self._it = 0调用方法(此种方式比较适合CPU训练):
it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
Xy = xgboost.DMatrix(it)
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
# as noted in following sections.
booster = xgboost.train({"tree_method": "approx"}, Xy)参考文档:
https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html

整理不易,点赞三连↓
边栏推荐
猜你喜欢

vs怎么配置OpenCV?2022vs配置OpenCV详解(多图)

2022 CISP-PTE(一)文件包含

【OpenAirInterface5g】RRC NR解析之RrcSetupComplete

Fractional Order PID control

Oppo interview sorting, real eight part essay, abusing the interviewer

The fourth question of the 299th weekly match 6103 Minimum fraction of edges removed from the tree

winow10安装Nexus nexus-3.20.1-01

Configuring FTP, enterprise official website, database and other methods for ECS

YOLOv6又快又准的目标检测框架 已开源

一线大厂面试官问:你真的懂电商订单开发吗?
随机推荐
云服务器配置ftp、企业官网、数据库等方法
OPPO面试整理,真正的八股文,狂虐面试官
Oppo interview sorting, real eight part essay, abusing the interviewer
ORA-00909: 参数个数无效,concat引起
TiDB 中的视图功能
观测电机转速转矩
Block level elements & inline elements
Caldera安装及简单使用
[QT notes] simple understanding of QT meta object system
YOLOv6又快又准的目标检测框架 已开源
[QT dot] QT download link
论文阅读技巧
HTAP 深入探索指南
multiprocessing. Detailed explanation of pool
[QT] use structure data to generate read / write configuration file code
Cloud-Native Database Systems at Alibaba: Opportunities and Challenges
Ahb2apb bridge design (2) -- Introduction to synchronous bridge design
写一个 goroutine 实例, 同时练习一下 chan
On gpu: historical development and structure
Active learning