当前位置:网站首页>Kaggle 专利匹配比赛赛后总结
Kaggle 专利匹配比赛赛后总结
2022-06-24 22:23:00 【致Great】
比赛简介
在专利匹配数据集中,选手需要判断两个短语的相似度,一个是anchor ,一个是target
,然后输出两者在不同语义(context)的相似度,范围是0-1,我们队伍id为xlyhq,a榜rank 13,b榜ran12,非常感谢@heng zheng、@pythonlan,@leolu1998,@syzong四位队友的努力和付出,最后比较幸运的狗到金牌。
和其他前排核心思路差不多,我们在这里主要分享下我们的比赛历程以及相关实验的具体结果,以及有意思的尝试
文本处理
数据集主要有anchor、target和context字段,另外有额外的文本拼接信息,在比赛过程中我们主要是尝试了以下拼接的尝试:
- v1:test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
- v2:test['anchor'] + '[SEP]' + test['target'] + '[SEP]' +test['context']+ '[SEP]' + test['context_text'],相当于直接把A47类似编码拼接上去
- v3:test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context'] + '[SEP]' + test['context_text'] 获取更多的文本进行拼接,相当于把A47下面的子类别拼接上去,比如A47B,A47C
context_mapping = {
"A": "Human Necessities",
"B": "Operations and Transport",
"C": "Chemistry and Metallurgy",
"D": "Textiles",
"E": "Fixed Constructions",
"F": "Mechanical Engineering",
"G": "Physics",
"H": "Electricity",
"Y": "Emerging Cross-Sectional Technologies",
}
titles = pd.read_csv('./input/cpc-codes/titles.csv')
def process(text):
return re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]", "", text)
def get_context(cpc_code):
cpc_data = titles[(titles['code'].map(len) <= 4) & (titles['code'].str.contains(cpc_code))]
texts = cpc_data['title'].values.tolist()
texts = [process(text) for text in texts]
return ";".join([context_mapping[cpc_code[0]]] + texts)
def get_cpc_texts():
cpc_texts = dict()
for code in tqdm(train['context'].unique()):
cpc_texts[code] = get_context(code)
return cpc_texts
cpc_texts = get_cpc_texts()这个拼接方式可以得到不小的提升,但是文本长度变得更长,最大长度设置为300,导致训练更慢
- v4:核心的拼接方式:test['text'] = test['text'] + '[SEP]' + test['target_info']
# 拼接target info
test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
target_info = test.groupby(['anchor', 'context'])['target'].agg(list).reset_index()
target_info['target'] = target_info['target'].apply(lambda x: list(set(x)))
target_info['target_info'] = target_info['target'].apply(lambda x: ', '.join(x))
target_info['target_info'].apply(lambda x: len(x.split(', '))).describe()
del target_info['target']
test=test.merge(target_info,on=['anchor','context'],how='left')
test['text'] = test['text'] + '[SEP]' + test['target_info']
test.head()这种拼接方式可以让模型cv和lb分数得到较大提升,通过v3和v4两种不同拼接方式的对比,我们可以发现选取质量更高的文本进行拼接对模型更有提升作用,v3方式中有很多冗余信息,而v4方式中有很多实体级别的关键信息。
数据划分
在比赛过程中,我们尝试了不同的数据划分方式,其中包括:
- StratifiedGroupKFold,这种拼接方式cv与lb线差比较小,分数稍微好一点
- StratifiedKFold:线下cv比较高
- 其他Kfold和GrouFold效果不好
损失函数
主要可以参考的损失函数有:
- BCE: nn.BCEWithLogitsLoss(reduction="mean")
- MSE:nn.MSELoss()
- Mixture Loss:MseCorrloss
class CorrLoss(nn.Module):
"""
use 1 - correlational coefficience between the output of the network and the target as the loss
input (o, t):
o: Variable of size (batch_size, 1) output of the network
t: Variable of size (batch_size, 1) target value
output (corr):
corr: Variable of size (1)
"""
def __init__(self):
super(CorrLoss, self).__init__()
def forward(self, o, t):
assert(o.size() == t.size())
# calcu z-score for o and t
o_m = o.mean(dim = 0)
o_s = o.std(dim = 0)
o_z = (o - o_m)/o_s
t_m = t.mean(dim =0)
t_s = t.std(dim = 0)
t_z = (t - t_m)/t_s
# calcu corr between o and t
tmp = o_z * t_z
corr = tmp.mean(dim = 0)
return 1 - corr
class MSECorrLoss(nn.Module):
def __init__(self, p = 1.5):
super(MSECorrLoss, self).__init__()
self.p = p
self.mseLoss = nn.MSELoss()
self.corrLoss = CorrLoss()
def forward(self, o, t):
mse = self.mseLoss(o, t)
corr = self.corrLoss(o, t)
loss = mse + self.p * corr
return loss我们实验采用的这个损失函数,效果稍微比BCE好一点
模型设计
为了提高模型的差异度,我们主要选择了不同模型的变体,其中包括以下五个模型:
- Deberta-v3-large
- Bert-for-patents
- Roberta-large
- Ernie-en-2.0-Large
- Electra-large-discriminator
具体cv分数如下:
deberta-v3-large:[0.8494,0.8455,0.8523,0.8458,0.8658] cv 0.85176
bertforpatents [0.8393, 0.8403, 0.8457, 0.8402, 0.8564] cv 0.8444
roberta-large [0.8183,0.8172,0.8203,0.8193,0.8398] cv 0.8233
ernie-large [0.8276,0.8277,0.8251,0.8296,0.8466] cv 0.8310
electra-large [0.8429,0.8309,0.8259,0.8416,0.846] cv 0.8376训练优化
根据以往比赛经验,我们主要采用了以下模型训练优化方式:
- 对抗训练:尝试了FGM 对模型训练有提升效果
class FGM():
def __init__(self, model):
self.model = model
self.backup = {}
def attack(self, epsilon=1., emb_name='word_embeddings'):
# emb_name这个参数要换成你模型中embedding的参数名
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0 and not torch.isnan(norm):
r_at = epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='emb.'):
# emb_name这个参数要换成你模型中embedding的参数名
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}- 模型泛化:加入了multidroout
- ema对模型训练有提升效果
class EMA():
def __init__(self, model, decay):
self.model = model
self.decay = decay
self.shadow = {}
self.backup = {}
def register(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
self.shadow[name] = new_average.clone()
def apply_shadow(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
self.backup[name] = param.data
param.data = self.shadow[name]
def restore(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}
# 初始化
ema = EMA(model, 0.999)
ema.register()
# 训练过程中,更新完参数后,同步update shadow weights
def train():
optimizer.step()
ema.update()
# eval前,apply shadow weights;eval之后,恢复原来模型的参数
def evaluate():
ema.apply_shadow()
# evaluate
ema.restore()没有用的尝试:
- AWP
- PGD
模型融合
根据线下交叉验证分数以及线上分数反馈,我们通过加权融合的方式进行平均融合
from sklearn.preprocessing import MinMaxScaler
MMscaler = MinMaxScaler()
predictions1 = MMscaler.fit_transform(submission['predictions1'].values.reshape(-1,1)).reshape(-1)
predictions2 = MMscaler.fit_transform(submission['predictions2'].values.reshape(-1,1)).reshape(-1)
predictions3 = MMscaler.fit_transform(submission['predictions3'].values.reshape(-1,1)).reshape(-1)
predictions4 = MMscaler.fit_transform(submission['predictions4'].values.reshape(-1,1)).reshape(-1)
predictions5 = MMscaler.fit_transform(submission['predictions5'].values.reshape(-1,1)).reshape(-1)
# final_predictions=(predictions1+predictions2)/2
# final_predictions=(predictions1+predictions2+predictions3+predictions4+predictions5)/5
# 5:2:1:1:1
final_predictions=0.5*predictions1+0.2*predictions2+0.1*predictions3+0.1*predictions4+0.1*predictions5其他尝试
- two stage
前期我们做了不同预训练模型的微调,所以特征数量相对较多,我们尝试基于树模型的对文本统计特征以及模型预测做stacking尝试,当时模型是有比较不错的融合效果,下面含有部分代码
# ====================================================
# predictions1
# ====================================================
def get_fold_pred(CFG, path, model):
CFG.path = path
CFG.model = model
CFG.config_path = CFG.path + "config.pth"
CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path)
test_dataset = TestDataset(CFG, test)
test_loader = DataLoader(test_dataset,
batch_size=CFG.batch_size,
shuffle=False,
num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
predictions = []
for fold in CFG.trn_fold:
model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
state = torch.load(CFG.path + f"{CFG.model.split('/')[-1]}_fold{fold}_best.pth",
map_location=torch.device('cpu'))
model.load_state_dict(state['model'])
prediction = inference_fn(test_loader, model, device)
predictions.append(prediction.flatten())
del model, state, prediction
gc.collect()
torch.cuda.empty_cache()
# predictions1 = np.mean(predictions, axis=0)
# fea_df = pd.DataFrame(predictions).T
# fea_df.columns = [f"{CFG.model.split('/')[-1]}_fold{fold}" for fold in CFG.trn_fold]
# del test_dataset, test_loader
return predictions
model_paths = [
"../input/albert-xxlarge-v2/albert-xxlarge-v2/",
"../input/bert-large-cased-cv5/bert-large-cased/",
"../input/deberta-base-cv5/deberta-base/",
"../input/deberta-v3-base-cv5/deberta-v3-base/",
"../input/deberta-v3-small/deberta-v3-small/",
"../input/distilroberta-base/distilroberta-base/",
"../input/roberta-large/roberta-large/",
"../input/xlm-roberta-base/xlm-roberta-base/",
"../input/xlmrobertalarge-cv5/xlm-roberta-large/",
]
print("train.shape, test.shape", train.shape, test.shape)
print("titles.shape", titles.shape)
# for model_path in model_paths:
# with open(f'{model_path}/oof_df.pkl', "rb") as fh:
# oof = pickle.load(fh)[['id', 'fold', 'pred']]
# # oof = pd.read_pickle(f'{model_path}/oof_df.pkl')[['id', 'fold', 'pred']]
# oof[f"{model_path.split('/')[1]}"] = oof['pred']
# train = train.merge(oof[['id', f"{model_path.split('/')[1]}"]], how='left', on='id')
oof_res=pd.read_csv('../input/train-res/train_oof.csv')
train = train.merge(oof_res, how='left', on='id')
model_infos = {
'albert-xxlarge-v2': ['../input/albert-xxlarge-v2/albert-xxlarge-v2/', "albert-xxlarge-v2"],
'bert-large-cased': ['../input/bert-large-cased-cv5/bert-large-cased/', "bert-large-cased"],
'deberta-base': ['../input/deberta-base-cv5/deberta-base/', "deberta-base"],
'deberta-v3-base': ['../input/deberta-v3-base-cv5/deberta-v3-base/', "deberta-v3-base"],
'deberta-v3-small': ['../input/deberta-v3-small/deberta-v3-small/', "deberta-v3-small"],
'distilroberta-base': ['../input/distilroberta-base/distilroberta-base/', "distilroberta-base"],
'roberta-large': ['../input/roberta-large/roberta-large/', "roberta-large"],
'xlm-roberta-base': ['../input/xlm-roberta-base/xlm-roberta-base/', "xlm-roberta-base"],
'xlm-roberta-large': ['../input/xlmrobertalarge-cv5/xlm-roberta-large/', "xlm-roberta-large"],
}
for model, path_info in model_infos.items():
print(model)
model_path, model_name = path_info[0], path_info[1]
fea_df = get_fold_pred(CFG, model_path, model_name)
model_infos[model].append(fea_df)
del model_path, model_name
del oof_res训练代码:
for fold_ in range(5):
print("Fold:", fold_)
trn_ = train[train['fold'] != fold_].index
val_ = train[train['fold'] == fold_].index
# print(train.iloc[val_].sort_values('id'))
trn_x, trn_y = train[train_features].iloc[trn_], train['score'].iloc[trn_]
val_x, val_y = train[train_features].iloc[val_], train['score'].iloc[val_]
# train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
# valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
reg = lgb.LGBMRegressor(**params,n_estimators=1100)
xgb = XGBRegressor(**xgb_params, n_estimators=1000)
cat = CatBoostRegressor(iterations=1000,learning_rate=0.03,
depth=10,
eval_metric='RMSE',
random_seed = 42,
bagging_temperature = 0.2,
od_type='Iter',
metric_period = 50,
od_wait=20)
print("-"* 20 + "LightGBM Training" + "-"* 20)
reg.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,verbose=100,eval_metric='rmse')
print("-"* 20 + "XGboost Training" + "-"* 20)
xgb.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,eval_metric='rmse',verbose=100)
print("-"* 20 + "Catboost Training" + "-"* 20)
cat.fit(trn_x, np.log1p(trn_y), eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,use_best_model=True,verbose=100)
imp_df = pd.DataFrame()
imp_df['feature'] = train_features
imp_df['gain_reg'] = reg.booster_.feature_importance(importance_type='gain')
imp_df['fold'] = fold_ + 1
importances = pd.concat([importances, imp_df], axis=0, sort=False)
for model, values in model_infos.items():
test[model] = values[2][fold_]
for model, values in uspppm_model_infos.items():
test[f"uspppm_{model}"] = values[2][fold_]
# for f in tqdm(amount_feas, desc="amount_feas 基本聚合特征"):
# for cate in category_fea:
# if f != cate:
# test['{}_{}_medi'.format(cate, f)] = test.groupby(cate)[f].transform('median')
# test['{}_{}_mean'.format(cate, f)] = test.groupby(cate)[f].transform('mean')
# test['{}_{}_max'.format(cate, f)] = test.groupby(cate)[f].transform('max')
# test['{}_{}_min'.format(cate, f)] = test.groupby(cate)[f].transform('min')
# test['{}_{}_std'.format(cate, f)] = test.groupby(cate)[f].transform('std')
# LightGBM
oof_reg_preds[val_] = reg.predict(val_x, num_iteration=reg.best_iteration_)
# oof_reg_preds[oof_reg_preds < 0] = 0
lgb_preds = reg.predict(test[train_features], num_iteration=reg.best_iteration_)
# lgb_preds[lgb_preds < 0] = 0
# Xgboost
oof_reg_preds1[val_] = xgb.predict(val_x)
oof_reg_preds1[oof_reg_preds1 < 0] = 0
xgb_preds = xgb.predict(test[train_features])
# xgb_preds[xgb_preds < 0] = 0
# catboost
oof_reg_preds2[val_] = cat.predict(val_x)
oof_reg_preds2[oof_reg_preds2 < 0] = 0
cat_preds = cat.predict(test[train_features])
cat_preds[xgb_preds < 0] = 0
# merge all prediction
merge_pred[val_] = oof_reg_preds[val_] * 0.4 + oof_reg_preds1[val_] * 0.3 +oof_reg_preds2[val_] * 0.3
# sub_reg_preds += np.expm1(_preds) / len(folds)
# sub_reg_preds += np.expm1(_preds) / len(folds)
sub_preds += (lgb_preds / 5) * 0.6 + (xgb_preds / 5) * 0.2 + (cat_preds / 5) * 0.2 #三个模型五折测试集预测结果
sub_reg_preds+=lgb_preds / 5 # lgb五折测试集预测结果
print("lgb",pearsonr(train['score'], np.expm1(oof_reg_preds))[0]) # lgb
print("xgb",pearsonr(train['score'], np.expm1(oof_reg_preds1))[0]) # xgb
print("cat",pearsonr(train['score'], np.expm1(oof_reg_preds2))[0]) # cat
print("xgb lgb cat",pearsonr(train['score'], np.expm1(merge_pred))[0]) # xgb lgb cat边栏推荐
- 当一个接口出现异常时候,你是如何分析异常的?
- EasyCVR国标协议接入的通道,在线通道部分播放异常是什么原因?
- String common methods
- Unity C# 网络学习(六)——FTP(一)
- Fatigue liée à l'examen du marché secondaire des médicaments innovants: succès clinique de la phase III et approbation du produit
- 字符串常用方法
- 2个NPN三极管组成的恒流电路
- 菊花链(寒假每日一题 39)
- TSDB在民机行业中的应用
- write a number of lines to a new file in vim
猜你喜欢

Day 04 - file IO

谈谈飞书对开发工作的优势 | 社区征文

如何通过EasyCVR接口监测日志观察平台拉流情况?

Experience of epidemic prevention and control, home office and online teaching | community essay solicitation
![Experiment 5 8254 timing / counter application experiment [microcomputer principle] [experiment]](/img/e2/7da59a566e4ccb8e43f2a64c0420e7.png)
Experiment 5 8254 timing / counter application experiment [microcomputer principle] [experiment]

内网学习笔记(5)

元宇宙的生态圈

Numerical scheme simulation of forward stochastic differential equations with Markov Switching

带马尔科夫切换的正向随机微分方程数值格式模拟

Abnova a4gnt polyclonal antibody
随机推荐
2个NPN三极管组成的恒流电路
保险APP适老化服务评测分析2022第06期
MOS管相关知识
Poj3669 meteor shower (BFS pretreatment)
你知道你的ABC吗(春季每日一题 1)
‘distutils‘ has no attribute ‘version
AssertionError: CUDA unavailable, invalid device 0 requested
Day 04 - file IO
【LeetCode】11、盛最多水的容器
Listen to the markdown file and hot update next JS page
JS array object to object
Multimodal emotion recognition_ Research on emotion recognition based on multimodal fusion
Chinese and English instructions of Papain
‘distutils‘ has no attribute ‘version
Build and train your own dataset for pig face recognition
[mobile terminal] design size of mobile phone interface
股票开账户如何优惠开户?手机开户是安全么?
Icml2022 | establishing a continuous time model of counterfactual results using neural control differential equations
Baidu voice synthesizes voice files and displays them on the website
Constant current circuit composed of 2 NPN triodes