当前位置:网站首页>Gold medal scheme of kaggle patent matching competition post competition summary
Gold medal scheme of kaggle patent matching competition post competition summary
2022-06-25 02:56:00 【To great】
Brief introduction to the game
In the patent matching dataset , The contestant needs to judge the similarity of the two phrases , One is anchor , One is target
, Then output the two in different semantics (context) The similarity , The scope is 0-1, Our team id by xlyhq,a A list of rank 13,b A list of ran12, Thank you very much @heng zheng、@pythonlan,@leolu1998,@syzong The efforts and efforts of the four teammates , Last comparison lucky Dog to gold medal .
It is similar to other front row core ideas , Here we mainly share the course of our competition and the specific results of relevant experiments , And interesting attempts
Text processing
The main data sets are anchor、target and context Field , In addition, there is additional text splicing information , During the competition, we mainly tried the following splicing attempts :
- v1:test[‘anchor’] + ‘[SEP]’ + test[‘target’] + ‘[SEP]’ + test[‘context_text’]
- v2:test[‘anchor’] + ‘[SEP]’ + test[‘target’] + ‘[SEP]’ +test[‘context’]+ ‘[SEP]’ + test[‘context_text’], It's equivalent to putting A47 Similar codes are spliced together
- v3:test[‘text’] = test[‘anchor’] + ‘[SEP]’ + test[‘target’] + ‘[SEP]’ + test[‘context’] + ‘[SEP]’ + test[‘context_text’] Get more text to splice , Is equivalent to A47 The following subcategories are spliced together , such as A47B,A47C
context_mapping = {
"A": "Human Necessities",
"B": "Operations and Transport",
"C": "Chemistry and Metallurgy",
"D": "Textiles",
"E": "Fixed Constructions",
"F": "Mechanical Engineering",
"G": "Physics",
"H": "Electricity",
"Y": "Emerging Cross-Sectional Technologies",
}
titles = pd.read_csv('./input/cpc-codes/titles.csv')
def process(text):
return re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]", "", text)
def get_context(cpc_code):
cpc_data = titles[(titles['code'].map(len) <= 4) & (titles['code'].str.contains(cpc_code))]
texts = cpc_data['title'].values.tolist()
texts = [process(text) for text in texts]
return ";".join([context_mapping[cpc_code[0]]] + texts)
def get_cpc_texts():
cpc_texts = dict()
for code in tqdm(train['context'].unique()):
cpc_texts[code] = get_context(code)
return cpc_texts
cpc_texts = get_cpc_texts()
This splicing method can be greatly improved , But the text becomes longer , The maximum length is set to 300, This leads to slower training
- v4: The splicing method of the core :test[‘text’] = test[‘text’] + ‘[SEP]’ + test[‘target_info’]
# Splicing target info
test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
target_info = test.groupby(['anchor', 'context'])['target'].agg(list).reset_index()
target_info['target'] = target_info['target'].apply(lambda x: list(set(x)))
target_info['target_info'] = target_info['target'].apply(lambda x: ', '.join(x))
target_info['target_info'].apply(lambda x: len(x.split(', '))).describe()
del target_info['target']
test=test.merge(target_info,on=['anchor','context'],how='left')
test['text'] = test['text'] + '[SEP]' + test['target_info']
test.head()
This splicing method can make the model cv and lb Scores have been greatly improved , adopt v3 and v4 Comparison of two different splicing methods , We can find that selecting higher quality text for splicing can improve the model ,v3 There is a lot of redundant information in the mode , and v4 There are many key information at the entity level .
Data partitioning
In the course of the game , We tried different data partitioning methods , These include :
- StratifiedGroupKFold, This splicing method cv And lb The line difference is small , The score is a little better
- StratifiedKFold: Offline cv Relatively high
- other Kfold and GrouFold The result is bad
Loss function
The main loss functions that can be referred to are :
- BCE: nn.BCEWithLogitsLoss(reduction=“mean”)
- MSE:nn.MSELoss()
- Mixture Loss:MseCorrloss
class CorrLoss(nn.Module):
"""
use 1 - correlational coefficience between the output of the network and the target as the loss
input (o, t):
o: Variable of size (batch_size, 1) output of the network
t: Variable of size (batch_size, 1) target value
output (corr):
corr: Variable of size (1)
"""
def __init__(self):
super(CorrLoss, self).__init__()
def forward(self, o, t):
assert(o.size() == t.size())
# calcu z-score for o and t
o_m = o.mean(dim = 0)
o_s = o.std(dim = 0)
o_z = (o - o_m)/o_s
t_m = t.mean(dim =0)
t_s = t.std(dim = 0)
t_z = (t - t_m)/t_s
# calcu corr between o and t
tmp = o_z * t_z
corr = tmp.mean(dim = 0)
return 1 - corr
class MSECorrLoss(nn.Module):
def __init__(self, p = 1.5):
super(MSECorrLoss, self).__init__()
self.p = p
self.mseLoss = nn.MSELoss()
self.corrLoss = CorrLoss()
def forward(self, o, t):
mse = self.mseLoss(o, t)
corr = self.corrLoss(o, t)
loss = mse + self.p * corr
return loss
The loss function used in our experiment , The effect is slightly better than BCE A little bit better.
Model design
In order to improve the difference of the model , We mainly selected the variants of different models , It includes the following five models :
- Deberta-v3-large
- Bert-for-patents
- Roberta-large
- Ernie-en-2.0-Large
- Electra-large-discriminator
Specifically cv The scores are as follows :
deberta-v3-large:[0.8494,0.8455,0.8523,0.8458,0.8658] cv 0.85176
bertforpatents [0.8393, 0.8403, 0.8457, 0.8402, 0.8564] cv 0.8444
roberta-large [0.8183,0.8172,0.8203,0.8193,0.8398] cv 0.8233
ernie-large [0.8276,0.8277,0.8251,0.8296,0.8466] cv 0.8310
electra-large [0.8429,0.8309,0.8259,0.8416,0.846] cv 0.8376
Training optimization
Based on previous competition experience , We mainly adopt the following model training optimization methods :
- Confrontation training : tried FGM It can improve the model training
class FGM():
def __init__(self, model):
self.model = model
self.backup = {}
def attack(self, epsilon=1., emb_name='word_embeddings'):
# emb_name This parameter needs to be changed to your model embedding Parameter name of
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0 and not torch.isnan(norm):
r_at = epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='emb.'):
# emb_name This parameter needs to be changed to your model embedding Parameter name of
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}
- Model generalization : Joined the multidroout
- ema It can improve the model training
class EMA():
def __init__(self, model, decay):
self.model = model
self.decay = decay
self.shadow = {}
self.backup = {}
def register(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
self.shadow[name] = new_average.clone()
def apply_shadow(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
self.backup[name] = param.data
param.data = self.shadow[name]
def restore(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}
# initialization
ema = EMA(model, 0.999)
ema.register()
# During training , After updating the parameters , Sync update shadow weights
def train():
optimizer.step()
ema.update()
# eval front ,apply shadow weights;eval after , Restore the parameters of the original model
def evaluate():
ema.apply_shadow()
# evaluate
ema.restore()
A useless attempt :
- AWP
- PGD
Model fusion
Feedback according to offline cross verification scores and online scores , We use weighted fusion to average fusion
from sklearn.preprocessing import MinMaxScaler
MMscaler = MinMaxScaler()
predictions1 = MMscaler.fit_transform(submission['predictions1'].values.reshape(-1,1)).reshape(-1)
predictions2 = MMscaler.fit_transform(submission['predictions2'].values.reshape(-1,1)).reshape(-1)
predictions3 = MMscaler.fit_transform(submission['predictions3'].values.reshape(-1,1)).reshape(-1)
predictions4 = MMscaler.fit_transform(submission['predictions4'].values.reshape(-1,1)).reshape(-1)
predictions5 = MMscaler.fit_transform(submission['predictions5'].values.reshape(-1,1)).reshape(-1)
# final_predictions=(predictions1+predictions2)/2
# final_predictions=(predictions1+predictions2+predictions3+predictions4+predictions5)/5
# 5:2:1:1:1
final_predictions=0.5*predictions1+0.2*predictions2+0.1*predictions3+0.1*predictions4+0.1*predictions5
Other attempts
- two stage
In the early stage, we made fine adjustments to different pre training models , Therefore, the number of features is relatively large , We try to predict text statistical features and models based on tree models stacking Try , At that time, the model had a good fusion effect , The following contains some code
# ====================================================
# predictions1
# ====================================================
def get_fold_pred(CFG, path, model):
CFG.path = path
CFG.model = model
CFG.config_path = CFG.path + "config.pth"
CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path)
test_dataset = TestDataset(CFG, test)
test_loader = DataLoader(test_dataset,
batch_size=CFG.batch_size,
shuffle=False,
num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
predictions = []
for fold in CFG.trn_fold:
model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
state = torch.load(CFG.path + f"{CFG.model.split('/')[-1]}_fold{fold}_best.pth",
map_location=torch.device('cpu'))
model.load_state_dict(state['model'])
prediction = inference_fn(test_loader, model, device)
predictions.append(prediction.flatten())
del model, state, prediction
gc.collect()
torch.cuda.empty_cache()
# predictions1 = np.mean(predictions, axis=0)
# fea_df = pd.DataFrame(predictions).T
# fea_df.columns = [f"{CFG.model.split('/')[-1]}_fold{fold}" for fold in CFG.trn_fold]
# del test_dataset, test_loader
return predictions
model_paths = [
"../input/albert-xxlarge-v2/albert-xxlarge-v2/",
"../input/bert-large-cased-cv5/bert-large-cased/",
"../input/deberta-base-cv5/deberta-base/",
"../input/deberta-v3-base-cv5/deberta-v3-base/",
"../input/deberta-v3-small/deberta-v3-small/",
"../input/distilroberta-base/distilroberta-base/",
"../input/roberta-large/roberta-large/",
"../input/xlm-roberta-base/xlm-roberta-base/",
"../input/xlmrobertalarge-cv5/xlm-roberta-large/",
]
print("train.shape, test.shape", train.shape, test.shape)
print("titles.shape", titles.shape)
# for model_path in model_paths:
# with open(f'{model_path}/oof_df.pkl', "rb") as fh:
# oof = pickle.load(fh)[['id', 'fold', 'pred']]
# # oof = pd.read_pickle(f'{model_path}/oof_df.pkl')[['id', 'fold', 'pred']]
# oof[f"{model_path.split('/')[1]}"] = oof['pred']
# train = train.merge(oof[['id', f"{model_path.split('/')[1]}"]], how='left', on='id')
oof_res=pd.read_csv('../input/train-res/train_oof.csv')
train = train.merge(oof_res, how='left', on='id')
model_infos = {
'albert-xxlarge-v2': ['../input/albert-xxlarge-v2/albert-xxlarge-v2/', "albert-xxlarge-v2"],
'bert-large-cased': ['../input/bert-large-cased-cv5/bert-large-cased/', "bert-large-cased"],
'deberta-base': ['../input/deberta-base-cv5/deberta-base/', "deberta-base"],
'deberta-v3-base': ['../input/deberta-v3-base-cv5/deberta-v3-base/', "deberta-v3-base"],
'deberta-v3-small': ['../input/deberta-v3-small/deberta-v3-small/', "deberta-v3-small"],
'distilroberta-base': ['../input/distilroberta-base/distilroberta-base/', "distilroberta-base"],
'roberta-large': ['../input/roberta-large/roberta-large/', "roberta-large"],
'xlm-roberta-base': ['../input/xlm-roberta-base/xlm-roberta-base/', "xlm-roberta-base"],
'xlm-roberta-large': ['../input/xlmrobertalarge-cv5/xlm-roberta-large/', "xlm-roberta-large"],
}
for model, path_info in model_infos.items():
print(model)
model_path, model_name = path_info[0], path_info[1]
fea_df = get_fold_pred(CFG, model_path, model_name)
model_infos[model].append(fea_df)
del model_path, model_name
del oof_res
Training code :
for fold_ in range(5):
print("Fold:", fold_)
trn_ = train[train['fold'] != fold_].index
val_ = train[train['fold'] == fold_].index
# print(train.iloc[val_].sort_values('id'))
trn_x, trn_y = train[train_features].iloc[trn_], train['score'].iloc[trn_]
val_x, val_y = train[train_features].iloc[val_], train['score'].iloc[val_]
# train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
# valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
reg = lgb.LGBMRegressor(**params,n_estimators=1100)
xgb = XGBRegressor(**xgb_params, n_estimators=1000)
cat = CatBoostRegressor(iterations=1000,learning_rate=0.03,
depth=10,
eval_metric='RMSE',
random_seed = 42,
bagging_temperature = 0.2,
od_type='Iter',
metric_period = 50,
od_wait=20)
print("-"* 20 + "LightGBM Training" + "-"* 20)
reg.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,verbose=100,eval_metric='rmse')
print("-"* 20 + "XGboost Training" + "-"* 20)
xgb.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,eval_metric='rmse',verbose=100)
print("-"* 20 + "Catboost Training" + "-"* 20)
cat.fit(trn_x, np.log1p(trn_y), eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,use_best_model=True,verbose=100)
imp_df = pd.DataFrame()
imp_df['feature'] = train_features
imp_df['gain_reg'] = reg.booster_.feature_importance(importance_type='gain')
imp_df['fold'] = fold_ + 1
importances = pd.concat([importances, imp_df], axis=0, sort=False)
for model, values in model_infos.items():
test[model] = values[2][fold_]
for model, values in uspppm_model_infos.items():
test[f"uspppm_{model}"] = values[2][fold_]
# for f in tqdm(amount_feas, desc="amount_feas Basic aggregation features "):
# for cate in category_fea:
# if f != cate:
# test['{}_{}_medi'.format(cate, f)] = test.groupby(cate)[f].transform('median')
# test['{}_{}_mean'.format(cate, f)] = test.groupby(cate)[f].transform('mean')
# test['{}_{}_max'.format(cate, f)] = test.groupby(cate)[f].transform('max')
# test['{}_{}_min'.format(cate, f)] = test.groupby(cate)[f].transform('min')
# test['{}_{}_std'.format(cate, f)] = test.groupby(cate)[f].transform('std')
# LightGBM
oof_reg_preds[val_] = reg.predict(val_x, num_iteration=reg.best_iteration_)
# oof_reg_preds[oof_reg_preds < 0] = 0
lgb_preds = reg.predict(test[train_features], num_iteration=reg.best_iteration_)
# lgb_preds[lgb_preds < 0] = 0
# Xgboost
oof_reg_preds1[val_] = xgb.predict(val_x)
oof_reg_preds1[oof_reg_preds1 < 0] = 0
xgb_preds = xgb.predict(test[train_features])
# xgb_preds[xgb_preds < 0] = 0
# catboost
oof_reg_preds2[val_] = cat.predict(val_x)
oof_reg_preds2[oof_reg_preds2 < 0] = 0
cat_preds = cat.predict(test[train_features])
cat_preds[xgb_preds < 0] = 0
# merge all prediction
merge_pred[val_] = oof_reg_preds[val_] * 0.4 + oof_reg_preds1[val_] * 0.3 +oof_reg_preds2[val_] * 0.3
# sub_reg_preds += np.expm1(_preds) / len(folds)
# sub_reg_preds += np.expm1(_preds) / len(folds)
sub_preds += (lgb_preds / 5) * 0.6 + (xgb_preds / 5) * 0.2 + (cat_preds / 5) * 0.2 # The prediction results of the three models' 50% discount test set
sub_reg_preds+=lgb_preds / 5 # lgb 50% discount test set prediction results
print("lgb",pearsonr(train['score'], np.expm1(oof_reg_preds))[0]) # lgb
print("xgb",pearsonr(train['score'], np.expm1(oof_reg_preds1))[0]) # xgb
print("cat",pearsonr(train['score'], np.expm1(oof_reg_preds2))[0]) # cat
print("xgb lgb cat",pearsonr(train['score'], np.expm1(merge_pred))[0]) # xgb lgb cat
边栏推荐
- Practice and Thinking on process memory
- 给你讲懂 MVCC 续篇
- Solution of separating matlab main window and editor window into two interfaces
- 保险也能拼购?个人可以凑够人数组团购买医疗保险的4大风险
- How transformers Roberta adds tokens
- 运行时修改Universal Render Data
- left join on和 join on的区别
- 怎么开户打新债 开户是安全的吗
- 计网 | 【四 网络层】知识点及例题
- 消息称一加将很快更新TWS耳塞、智能手表和手环产品线
猜你喜欢

好用的字典-defaultdict

小米路由R4A千兆版安装breed+OpenWRT教程(全脚本无需硬改)

高速缓存Cache详解(西电考研向)

分布式事务解决方案和代码落地

Performance rendering of dSPACE

Enlightenment of using shadergraph to make edge fusion particle shader

Can automate - 10k, can automate - 20K, do you understand automated testing?

36岁前亚马逊变性黑客,窃取超1亿人数据被判20年监禁!

Software testing weekly (issue 77): giving up once will breed the habit of giving up, and the problems that could have been solved will become insoluble.

Insurance can also be bought together? Four risks that individuals can pool enough people to buy Medical Insurance in groups
随机推荐
小米路由R4A千兆版安装breed+OpenWRT教程(全脚本无需硬改)
PSQL column to row
Migrate Oracle database from windows system to Linux Oracle RAC cluster environment (2) -- convert database to cluster mode
Centos7.3 modifying MySQL default password_ Explain centos7 modifying the password of the specified user in MySQL
[I.MX6UL] U-Boot移植(六) 网络驱动修改 LAN8720A
Software testing salary in first tier cities - are you dragging your feet
VSCode中如何实现点击DOM自动定位到相应代码行
PE file infrastructure sorting
DSPACE set zebra crossings and road arrows
Practice and Thinking on process memory
AOSP ~ WIFI架构总览
Unity存档系统——Json格式的文件
yarn : 无法加载文件 C:\Users\xxx\AppData\Roaming\npm\yarn.ps1,因为在此系统上禁止运行脚本
Add in cmakelists_ Definitions() function
ACM. HJ70 矩阵乘法计算量估算 ●●
Call system function security scheme
F - Spices(线性基)
Difference between left join on and join on
Computer wechat user picture decoded into picture in DAT format (TK version)
[analysis of STL source code] functions and applications of six STL components (directory)