Dabūt LiepU datus
VeA moodle datus uzjautāt: Estere Vitola esterev@venta.lv
Atgādināt Amjad Abu Saa datu kopu
Literatūras daļā īpašs uzsvars uz tevis atrastajām publikācijām šādā veidā pa tabulām - dabūt progresu ar tekstu http://share.yellowrobot.xyz/quick/2023-2-21-9E9F6B63-2048-44C3-A098-E7CC2D4CB39A.pdf
pielikt cpu , devices=1 vai devices=0, lai strādātu debugging, jo multiprocesos nestrādā debugging
Sagatavot datu kopu jaunā CSV (script ielikt GIT) - Kad gatas CSV atsūti pārbaudīt - one-hot-encoded taisam tikai beigās CSV lai ir indeksi
Datu kopa - Daugules:
CSV
Order ASC by datetime
User ID -> group by sequences
Input
Course ID -> index 0…N -> One Hot Encoded (pd.dummies) => Concat katram laika
Delta time -> Normalizētas minutes starp ierakstiem
Type -> index 0…N -> One Hot Encoded (pd.dummies)
Is Correct -> index 0…N -> One hot encoded Vai arī -1 , +1 scalar varētu labi strādāt
Output
Procentuāli total correct answers => diskretizēt 0-10 atzīmēs, un veikt klasifikāciju (visas sekvences beigās atzīme)
Pirms apmācības veic analīzi ar histogrammu cik studenti iekrīt katrā klās
Pētījuma jautuājumi:
Cik daudz ierakstus vajag, lai sasniegtu precizitāti ?
Apmācības laikā lietot weigthed sampler, lai novērstu, ka sliktie/ labie ir mazāk nekā vidējie
Kad datu kopa gatava pārmeklēt un apmācīt modeli, izmantot temporal pooling pa laika dimensiju, piem (B, seq, Features) => mean(dim=1) => (B, Features) var arī ņemt max vai pēdējo laika soli - visi šie normāli veidi
xxxxxxxxxx
1351import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import torch
5from torch import nn
6from torch.nn import functional as F
7from torch.utils.data import DataLoader, TensorDataset
8from torch.utils.data import random_split
9from torchmetrics.classification import BinaryAccuracy
10import pytorch_lightning as pl
11from sklearn.preprocessing import StandardScaler
12from pytorch_lightning.loggers import WandbLogger
13from pytorch_lightning.callbacks.early_stopping import EarlyStopping
14from deltapy import transform, interact, mapper, extract
15
16df_full = pd.read_csv('/Users/evalds/Downloads/lochsmith_conversations.csv')
17print(df_full.shape)
18
19print(df_full.describe())
20
21# fill missing values with mean
22df_full = df_full.fillna(df_full.mean())
23
24#print(df_full.describe())
25
26# save modified dataset
27#df_full.to_csv('/Users/evalds/Downloads/lochsmith_conversations_filled.csv', index=False)
28
29# drop columns conversation_id, user_id, lang
30df_categorical_inputs = df_full[['user_id', 'hour_in_day']]
31
32# convert categorical columns
33df_categorical_inputs = pd.get_dummies(df_categorical_inputs, columns=['user_id', 'hour_in_day'])
34df_full = df_full.drop(['conversation_id', 'user_id', 'lang', 'hour_in_day', 'metric_sales_amount'], axis=1)
35
36# extract is_yes, is_no columns
37df_yes = df_full[['is_yes']]
38df_no = df_full[['is_no']]
39
40# count where df_yes and df_no in both are true
41df_yes_no = df_yes & df_no
42print('df_yes_no', df_yes_no.sum())
43print('df_yes', df_yes.sum())
44print('df_no', df_no.sum())
45
46# drop is_yes, is_no columns
47df_full = df_full.drop(['is_yes', 'is_no'], axis=1)
48
49# describe all columns, include all
50# with pd.option_context('display.max_columns', 40):
51# print(df_full.describe(include='all'))
52
53# show in plt histogram for all columns
54# df_full.hist(bins=50, figsize=(50,50))
55# plt.show()
56
57# standardize data
58scaler = StandardScaler()
59df_full[:] = scaler.fit_transform(df_full)
60#df_full[:] = transform.robust_scaler(df_full)
61
62# df_full.hist(bins=50, figsize=(50,50))
63# plt.show()
64
65# add categorical columns
66
67df_full = pd.concat([df_full, df_categorical_inputs], axis=1)
68
69BATCH_SIZE = 64
70LEARNING_RATE = 1e-4
71
72metric_acc = BinaryAccuracy()
73
74class Model(pl.LightningModule):
75 def __init__(self):
76 super().__init__()
77 self.encoder = nn.Sequential(
78 nn.Linear(df_full.shape[1], 128),
79 nn.BatchNorm1d(128),
80 nn.LeakyReLU(),
81 nn.Linear(128, 128),
82 nn.BatchNorm1d(128),
83 nn.LeakyReLU(),
84 nn.Linear(128, 1),
85 nn.Sigmoid()
86 )
87
88 def forward(self, x):
89 y_hat = self.encoder(x)
90 return y_hat
91
92 def configure_optimizers(self):
93 optimizer = torch.optim.RAdam(self.parameters(), lr=LEARNING_RATE)
94 return optimizer
95
96 def training_step(self, train_batch, batch_idx):
97 x, y = train_batch
98 y_hat = self.encoder(x)
99 loss = F.binary_cross_entropy(y_hat, y)
100 acc = metric_acc(y_hat, y)
101 self.log('train_loss', loss)
102 self.log('train_acc', acc, prog_bar=True)
103 return loss
104
105 def validation_step(self, val_batch, batch_idx):
106 x, y = val_batch
107 y_hat = self.encoder(x)
108 loss = F.binary_cross_entropy(y_hat, y)
109 acc = metric_acc(y_hat, y)
110 self.log('val_loss', loss)
111 self.log('val_acc', acc, prog_bar=True)
112
113
114# data
115dataset = TensorDataset(torch.from_numpy(df_full.values).float(), torch.from_numpy(df_yes.values).float())
116train_size = int(0.8 * len(dataset))
117val_size = len(dataset) - train_size
118mnist_train, mnist_val = random_split(dataset, [train_size, val_size])
119
120train_loader = DataLoader(mnist_train, batch_size=BATCH_SIZE)
121val_loader = DataLoader(mnist_val, batch_size=BATCH_SIZE)
122
123# model
124model = Model()
125
126# training
127# weights and biases
128wandb_logger = WandbLogger(project='small-training')
129
130early_stop_callback = EarlyStopping(monitor="val_acc", min_delta=0.00, patience=3, verbose=False, mode="max")
131
132trainer = pl.Trainer(logger=wandb_logger, callbacks=[early_stop_callback])
133trainer.fit(model, train_loader, val_loader)
134
135