AI_in_GIST

Table of Contents

Preprocessing
Statistics
Classical ML
DL Papers

Preprocessing

Handle Missing Values

Classification	Regression	Time Series	Clustering
Deletion	Deletion	Deletion	Deletion
Conditional Imputation with mode	Conditional Imputation with mean/median	Forward/Backward fill	Kmeans Conditional imputation
-	Interpolation (linear, spline, etc)	Interpolation (linear/spline)	-
Treat as special category like UNK	-	-	Treat as saparate cluster
-	-	Change bining to reduce effect (monthly-> quaterly)	-

Detect Outliers

Method	Description
Inter-quartile range (IQR)	Statistical method to detect outliers using quantiles
Isolation forest	Unsupervised, Random forest, easy to detect outliers since they need fewer cuts/branches and can be separated easily. Need prior knowledge of %outliers.
One Class SVM	Unsupervised, One cluster is the normal class, other comprises only the origin point. Maximize the margin. Better to use RBF kernel
DBSACN	Clustering Algo, works to detect anomoly cluster (cluster is least points)
Auto-encoder	Deep Learning method to detect vectors away from data embedding vector in bottleneck layer, thus a larger loss for them

Handle Outliers

Method	Description
Deletion	Do only if you know the reason
Transformation	Can lead to loss of info present in the shape of input variable
Truncate	Replace by the threshold
Binning	Binning numerical vars
Robust model	Like ensemble methods
Robust loss	Like Huber loss

Dimension Reduction Techniques

Method	Description
Feature Selection	Filter (remove correlation features, not imp features based on feature importance)
	Subsets (RFE (use DT and remove highest gini feature one by one till desired features reached)& Wrapper techniques like forward, backward, bidirectional (same as RFE but you choose eliminator model))
	Train loop (LASSO)
Feature Extraction	PCA (linear), kernal PCA (like rbf kernel), t-SNE (nonlinear), UMAP (better for nonlinear, large dataset)
	LDA (supervised ML, max separability of class directions)
	Auto encoder (DL) as in clustering
Clustering	Making groups for cats and numerical
Binning	Binning for numerical and cats
Aggegration features and transform	ex: Polynomial features, interaction terms

Handle Skew data

Function Transformation	Power Transformation	Quantile Transformation
Logarithmic (>0 Right skewed input)	Box-Cox (Input should be >0)	Quantile Normalization
Squared (Left skewed input)	Yeo-Johnson	Rank Transformation
Square Root (>0 right skewed input)
Reciprocal (>0 strong right skewed input)

Techniques to handle imbalance dataset

Method	Description
Oversample, undersample	oversample minority class with replacement, undersample/remove entries from majority class
BalancedML	Like BalancedBaggingClassifier, does oversampling
SMOTE	Synthetically generate new samples using KNN
Appropriate ML	like tree ensemble methods
Better metric	Precision, Recall, F-beta score instead of accuracy
CV	Stratified cross-validation scores, bootstrapping, etc
Loss	Weighted loss, Focal loss and its variant (example: Dicefocal loss)

Variable Encoding

Encoding Type	Categorical Ordinal Var	Categorical Nominal Var	Description
Classical	One Hot	One Hot	1 when True otherwise 0, use when n(unique) is small/reasonable
	Label/ordinal	-	Denotes hierarchical levels
	Hashing	Hashing	Hash function mapping for each unique category, can handle large cardinality/n(unique)
	Binary	-	One hot + Hashing
		Frequency	count number of unique
Bayesian	Target	Target	Encode target info in encoding using conditional target value for each unique cat value. Can lead to overfitting so smoothed versions exist.
	LOO Target	LOO Target	Exclude the current row while calculating target encoding for that row.
Deep Learning	~	Embedding	Do entity embedding for high cardinality variables like cities, etc to catch semantic info by training the DNN/Embedding layer (Categorical to continuous values)

Numerical Encoding	Description
Binning equal	Bin equally, replace the value with bin number. Depends on the numerical variable, problem at hand
Binning uneuqal	Bin unequally, replace value with bin number. Depends on the numerical variable, problem at hand
Binning quantile	Bin in quantiles of values of the variable, replace the value with bin number. Depends on the numerical variable, problem at hand

Statistics

Distributions

Distribution	Question it addresses	Parameters
Gaussian	Param mean, standard deviation
Bernoulli	Probability of success (p) in a single trial	Mean=p, var=p(1-p)
Binomial	Probability of k successes in n trials if p is the probability of success in 1 trial. Models number of successes in fixed number of trials.	Mean=np, var=np(1-p)
Negative Binomial	Probability of k trials required for fixed r successes, if p is the probability of success in 1 trial. Models number of trials required for fixed no. of successes.	Mean=r/p, var=r(1-p)/p^2
Poisson	Probability of x events to occur in unit time interval if lambda events occur on average in unit time interval. Here, events (discrete var is x axis)	Mean=lambda, var=lambda
Geometric	Probability of an event/success occurs after x number of trials (failures, discrete var), if one event/success occurs on average after 1/p trials. Here p is the Bernoulli probability of success	Mean=1/p, var=(1-p)/p^2
Exponential	Probability of an event to occur after x time interval (or any continuous variable like price/distance) if one event occurs on average after 1/lambda time interval. Here time (continuous var is x axis). Continuous equivalent of geometric distribution	Mean=1/lambda, var=1/lambda^2
Gamma	Probability of alpha events to occur in x time interval if one event occurs in 1/beta time interval. Here time (continuous var) is x axis. Generalised case of Exponential distribution.	Mean=alpha/beta, var=alpha/beta^2

Types of Bias

Bias
Inherent bias in data
Sampling/data collection bias
Preprocessing bias
ML algo bias, like no L1/L2 regularizer, etc
ML algo evaluation metric bias

Goodness of fit test

Test	Description
Chi-squared	Between two categorical variables like compare unbiased or biased coin toss distribution
Kolmogrow Smirnov	Continuous variables, >2000 data, a non-parametric test (no underlying hypothesis)
Anderson Darling	Similar to KS, gives more importance to tails
Shapiro Wilk	checks if data is gauss (compares quantile to that of gauss), <2000 data
AIC	Regression tasks, continuous variables, Combines GOF+model complexity (DOF)
BIC	Like AIC but Bayesian in approach, takes the number of prior data into account, more penalty for model complexity
R squared	Continuous variables, MSE compared with Mean/Intercept only model MSE
Adjusted R squared	Similar to R squared but takes model complexity into account (DOF)

Correlation

Independent	Dependent	Test
Categorical	Categorical	Chi-square test
Categorical	Numerical	t-test, z-test, ANNOVA
Numerical	Categorical	Logistic regression
Numerical	Numerical	Pearson corr (linear), Spearman corr (rank, monotonic, categorical ordinal data also works)

Hypothesis Tests

Test	Types	Description
Shapiro-Wilk		H0: Distribution is normal, H1: Distribution is not normal; <2000 dataset
Levene Test	Parametric	H0: var(data1)=var(data2), H1: var(data1)!=var(data2), Homogeneity of variances
t-test	Parametric, One-sided/Two-sided, One-sample/two-sample	H0: mu0=mu1, H1: mu0!=mu1 or mu
z-test	Parametric, One-sided/Two-sided, One-sample/two-sample
ANNOVA	Parametric, One way (1 dep and indep var)/ Two way (2 indep, 1 dep var)
F-test	Parameteric
Chi square	Non-parametric
Mann-Whitney	Non-parametric
Krusal Wallis	Non-parametric
Granger causality

Classical ML

Regression & classification

Method	Description
Linear Regression/OLS	When x~y linearly, residual is homoscedastic, and variance(residual)=constant
Linear Regression extension	Adding non-linear and interaction terms
Generalised linear models	For GLM family, check y Count data-> Poisson, Negative Binomial Continuous->Normal Continuous right skew: Gamma Continuous left skew: Inverse Gauss Probability distribution: Binomial, Multinomial
	For GLM link function, check y ~ x y~x -> Identity ln(y)~x->Log link logit(y)~x: Logit link
Naive Bayes	Mainly for classification problems, calculates Bayes probability for each of the target classes for the given test example and checks which target has the highest probability. If the input variable is Gaussian distributed, can use Gaussian NB, similarly Multi-nomial NB.
Logistic Regression	Classification algorithm where you do linear regression with target as logit(p) where p is the probability of being class1 in binary classification
Decision trees	Variable, split, SSE or gini index wrt target Classification: Split variable having minimum Gini, max information gain, or minimum entropy Regression: Split variable at a boundary giving minimum SSR in its final leaf nodes
Random forest	Bagging + Feature Selection i.e. training several DT each with randomly selected (default) sqrt(features) for classification and f/3 for regression.
XGBoost	Similar to random forest. Except for minimum gini score or SSE of a variable and splitting point, we use the similarity score of the variable and splitting point. Similarity score takes into account not only residual but also the number of data in each leaf used to calculate SSR of that leaf (in DT, number of points in leaf is not taken into account). Then it defines info gain as sum(SS_child)-SS(parent) to decide splitting if gain>gmamma, where gamma is the regularisation factor. Further, even the SS calculation has a regularisation param lambda. i.e DT->XGboost=>SSR/Gini->SS=SSR+lambda+no of data in leaf. Also it is fast because it searches quantile for splitting (like qLoRA), parallel cuts eval, caching
Adaboost	Make stump (DT with depth=1) for each var and choose var with the least gini. Give it a weight (%correct). Now, remake the dataset with giving more weight(or copy data) which got incorrectly classified and redo the making of stump and weighting.
SVM	Generally used in classification Maximize margin such that examples are classified correctly (classification) or target y values deviates less than epsilon from regression line/curve Hard SVM: Maximize margin such that classes are on either side of the decision boundary. Soft SVM: Maximize margin and weighted penalize incorrect classification from its decision from the boundary.
Kernal SVM	Use kernel trick to go to higher dimension i.e. use kernal methods which simplify calculation from low to high dim space
LDA	Supervised classification problem, find axes which maximize separation between means of classes and minimize the variance within the classes.
kNN	Supervised Classification and regression algo, average/mode of k closest data points.

Time Series

Method	Description
Moving Average	Smoothing method, give equal importance to all data in the context window
Simple exponential smoothing	Smoothing method, gives exponentially decay weight for moving average on neighboring data in context window.
Holt’s exponential smoothing	Double exponential smoothing, Can forecast predict level+trend
Winter-Holt’s exponential smoothing	Tripple exponential smoothing, can forecast level+trend+season, thus next cycle
AR	Forecast using lags
MA	Forecast using error/residual on previous lags
ARMA	Forecast using both lags (AR->PACF) and residual of previous lags (MA->ACF, Lunj-box test). Do Grid search to find AR and MA order which minimizes AIC the most while training
ARIMA	If the time series is not stationary. I in ARIMA stands which previous lags should be subtracted to remove the trend for the Time series to make it stationary.
SARIMA	Same as ARIMA, except takes seasonality into account. Thus its additionally, seasonality AR, MA, I order.
VAR	AR to forecast TS when forecast depends on lags from other correlated time series as well. Apply Gangar Causality to check correlation
Hybrid models	Use of feature transformers like linear regression to predict trend and target transformers like tree-based, NN to predict seasonality.
Prophet	Curve fitting algo by Meta, can take weekly and yearly seasonality, holidays, and can have influx points for trend.
biLSTM	bi-directional LSTM module. Can handle multiple time series models as well to predict one TS. needs normalization but no stationarity of TS.
ARCH/GARCH	Forecast using the residual ARCH, forecast using the error in the residual GARCH. Example, apply it in %change of stock price series or when you want to understand the residual of some time series when they exhibit trends/seasonality

Clustering

Method	Type	Description
K-means	Hard, Centroid	Requires number of clusters prior (can be found via WCSS or Silhouette score), updates the centroid of each cluster iteratively.
Generalised K means	Hard, Centroid	Can have clusters of different sizes and shapes, resistant to outliers
K mediods	Hard, Centroid	Requires number of clusters prior. Uses actual data point as centroid
Algomerative hierarchical	Hard, Hierarchial	Merge data points by closest distance iteratively up to some distance threshold.
DBSCAN	Hard, Density	Does not require number of clusters prior, but min (radial) distance of a cluster (found by kNN distance plot) and number of points (default=2*features)
Spectral	Hard	First step: move to the lower dimension by using PCA or graph-based dimension reduction (node as points, edge as distance). Then apply a clustering algorithm like Kmeans
Fuzzy C means	Soft, Centroid	Update probability a point belongs to a centroid of a cluster iteratively. Needs number of clusters prior.
Gaussian Mixture models	Soft, Distribution	Use Gaussian distribution to define clusters, optimal param of Gaussian distribution found using Expectation-Minimization method

DL Papers

Fully Convolution Neural Network

Paper	Date	Description
LetNet-5	December 1998	CNN-AvgPool NN for softmax MNIST classifier, 1st usage of CNNs & weight sharing idea, activations: tanh, softmax
AlexNet	September 2012	1000 class classifier, Similar to LeNet but deeper, 1st usage of Dropout, ReLU
VGGNet-16/19	September 2014	Very deep (16/19 layers) and narrow CNN, large number of small filters 3X3 to capture more complex and granular features, stacking of many layers
InceptionNet	September 2014	Stacking of parallel modules, 1X1 followed by either 3X3 or 5X5, or 1X1 modules. 1X1 reduces channels dim & time (bottleneck layers) and diff nXn extracts low & high-level features. Global avg pooling counters overfitting & no. of params, Vanishing gradient countered by 2 auxiliary classifiers using same labels (0.3 weighted in final loss).
InceptionNetv2 & v3	December 2015	Factorize nXn conv to 1Xn and nX1, thus improve speed. BN in auxiliary classifier, label smoothing for overfitting, RMSProp optimizer
ResNet	December 2015	Residual connection in CNN blocks of deep CNN for vanishing grads.
InceptionNetV4 and InceptionResNet	February 2016	More uniform inception modules with new dim reduction blocks, skip connect module o/p to i/p (same channel dim achieved by 1X1) & replaced pooling operation. Scale residual activation to 0.3 to decrease vanishing grads.
DenseNet	August 2016	Each layer in block receives input from all previous layers counter vanishing grads, transition layers between dense blocks reduce spatial dim to balance computational cost and accuracy.
Xception	October 2016	Based on Inception v3, instead of inception modules, it uses depthwise separable (depth + pointwise) convolutions (split channels in input and run separate splitted filters (like 10 filters and 10 IP channels, one for each), concatenate it, and follow it by 1X1 so that the concatented block from that splitted filter operations can get mixed among the channels) over input image (grouped convolutions: i/p channel-wise conv, concatenate and apply pointwise 1X1 conv)
ResNeXt	November 2016	CNN with several Depthwise separable convolutions with ResNet.
MobileNetv1	April 2017	Depthwise separable convolutions network for mobile and embedded devices.
MobileNetv2	January 2018	Inverted Residual connections (skip in narrow-wide-narrow type block with wide obtained by depthwise separable convolutions). Loss of extra info due to ReLU not coupled by wide-narrow-wide channel type block, linear activation in used in the final layer of block (linear bottleneck)
MobileNetv3	May 2019	AutoML tools, MnasNet to select coarse architecture using reinforcement learning and NetAdapt to fine tune. Use squeeze-and-excitation block (Squeeze: HXWXC -> 1X1XC, getting per channel overview, Excitation: 1X1XC -> 1X1XC using NN, use this now as per channel weight for input and continue with model), remove 3 expensive layers from v2.
EfficientNet	May 2019	AutoML automated neural architecture search (NAS) to select balanced dim of width, depth and resolution (compound scaling method). Usage of efficient building blocks like inverted residuals, linear bottleneck

One Stage Object detectors

Two Stage Object detectors

Paper	Date	Description
RCNN	November 2013	Selective search for regions proposal (~2000), warp each RoI and pass each through CNN for feature extraction, linear SVM for classification and offset bbox regression. Non-max suppression to select from >1 bbox of same object
Fast RCNN	April 2015	Pass entire image through large backbone CNN, selective search on output, crop, wrap each RoI and do RoI pooling and pass through small detection head NN for offset bbox regression and classification and objectness score. Faster inference due to usage of truncated SVD on model weights to retain imp weights/nodes from head. About 25X faster than RCNN.
Faster RCNN	June 2015	Replace selective search in Fast RCNN by Region proposal neural network (RPN), 9 anchors different shape boxes to select best bbox per object. About X10 faster than Fast RCNN
Mask RCNN	March 2017	Add extra output to Faster RCNN to perform instance segmentation to classify each RoI.
Cascade RCNN	December 2017	Detection head bbox output depends on RPN performance in Faster RCNN, coupled by poor inference conditional on IoU. To counter, cascade of individual RPN is used each trained using the preceding bbox RPN prediction trained for increasing IoU.

Vision Transformers

Paper	Date	Description
Vision Transformer	October 2020	Encoder block of original transformer. Segment images into patches, flatten each to pass and use them as tokens for trainable embedding layer (patch embeddings), and treat them as tokers. Standard learnable 1D position embeddings (no gains with 2D-aware position embeddings). MLP layers with GELU non-linearity. CLS token for image classification. SOTA results when trained on large data (14M-300M images). Major issue is it is supervised pre-trained using ImageNet unlike BERT. BEiT was thus developed to overcome this.
Data-efficient Image Transformer (DeiT)	December 2020	Similar architecture like ViT, but distilled version->distillation token learns from teacher. Input: CLS, Patch1, Patch2.., Distil tokens. Distil token minimizes distillation loss (soft: KL-div with teacher, hard: CE with teacher). Cosine sim between distill and CLS embedding 0.93 (expected <1.0 by construction). Found ConvNet teacher better than Transformer. Distil token is better than CLS token for classification. Joint CLS+distil token gives a middle performance.
Swin transformer	March 2021	Hierarchial transformer, smaller ip patches (16X16 in ViT->4X4). RGB img -> Patch partition (4X4) -> Stage1(Linear embedding (NN) -> Swin transformer block)-> Stage2(Patch merging -> Swin transformer block) -> Stage3(Patch merging -> Swin transformer block)->Classification. Swin transformer block: 1st block does window attention (attention within window only, quick), 2nd block: shifted window attention (like sliding windows in convnet, cyclic rotation of patches in image and do attention so that if obj in between patch is captured in a new patch.). Patch merging/hierarchical structure helps go from local info capture to global info capture. The output of each hierarchical level can be used for object detection/segmentation algo backbone. (eg: with Mask R-CNN).
BEiT (BERT-iT)	June 2021	BERT-like self-sueprvised MIM pre-training. No vocabulary or tokenizer like BERT for NLP task. For tokenizer, we associate every patch of image with a visual token obtained using a separate dVAE. Pre-training: ViT backbone. Tokens for image patches: dVAE (discrete) learns Visual tokens (tokens for each patch) extracted from Tokenizer->Visual tokens (bottleneck)->Decoder(IMG) model. i.e the pre-training MIM task is to predict discrete tokens of masked patches. RGB img -> Patches -> patch wise masking (40% max) -> Flatten patch for patch embedding+Position embedding-> ViT (BEiT encoder)-> MIM (mask img modelling) task to learn visual tokens corresponding to masked patch. Fine tune: Image classification: Linear classifier (global pooling+softmax classifier). Semantic Segmentation: pre-trained BEiT as backbone encoder, several deconvolution layers for decoder to produce segmentation. Object detection.

NLP Encoder-Decoder models

Paper	Date	Description
Transformer	June 2017	Encoder-decoder model, Multi-head self-attention (Query, Key, Value) in encoder and Masked multihead attention MLM and multi-head cross attention (K,V from encoder output, V from decoder), skip connections, Layer normalisation (across all features of a token), BPE used as tokenizer, trainable text 512 length embedding, fixed non-trainable absolute position encoding (sin/cos), trained on WMT English-German, English-French dataset.

NLP Encoder models

Paper	Date	Description
BERT	October 2018	Encoder transformer model, max input 512 tokens, WordPiece tokenizer, trainable text embedding length 768, fixed non-trainable absolute position encoding (sin/cos), multi-head self-attention (Query, Key, Value), final linear-relu-softmax layer, trained on left-right context (bidirectional embedding), Pre-training (MLM task (MASK token), Next Sentence Prediction NSP task using CLS-sent1-SEP-sent2-SEP tokens), BERT fine-tuning (Text Classification CLS token, Question-Answer Task CLS-ques-SEP-context-SEP).
RoBERTa	July 2019	Optimize BERT, Dynamic masking of tokens in MLM in training pre-training BERT, no NSP due to low value, ByteEncoder, X10 train data (160GB), larger batch size
ELECTRA	March 2020	Replaced Token Detection instead of MLM (Generator BERT predicts MSK tokens, Discriminator BERT predicts isOriginal or isReplaced) thus focus on all tokens in sequence and not just MSK token in BERT, no NSP
ALBERT	September 2019	Cross param sharing across BERT blocks, Embedding matrix factorization (AXB=(AXN)*(NXB)) leading to 1/10 size of BERT. Pre-training MLM and Sentence order prediction SOP, no NSP since it makes more sense to know inter-sentence coherence.
DistilBERT	October 2019	60X faster, 0.40 size, 97% BERT accurate. Teacher-student knowledge distillation. Train: train teacher BERT for MLM task. Student BERT soft prediction with teacher BERT soft target using KL divergence loss, student BERT hard prediction with hard target using CE loss and cosine similarity loss between soft target and soft prediction embeddings.
TinyBERT	September 2019	Reduce Student BERT to 4 encoder block, 312 emb size (Teacher BERT has 12 blocks, 786 emb size). Knowledge distillation is not only at Prediction layer (as in DistilBERT), but also at the Embedding layer and transformer layer by minimizing MSE between them. Dimension mis-match b/w S and T solved by matrix factorization S(NX312) x Weight(312 X 768) = T(Nx768). Learn weight matrix also.

NLP Decoder models

Paper	Date	Description
GPT	June 2018	Generative pre-training and discriminatory fine tuning. Architecture is similar to the decoder of original transformer. Unsupervised pre-training (thus able to use large corpus) using long Book texts (thus capture long range info) for standard language model objective of maximizing likelihood of sum(log P(xi\x1..x_i-1)). The pre-trained language model supervised fine tuned for text classification, text entailment x=(premise$hypothesis)->y=class, similarity x=(sent1$sent2) and x=(sent2$sent1)->y=1/0, question and answer mcq x(context$ans1), x(context$ans2)->softmax(2).
GPT-2	February 2019	Demonstrated 0-shot learning (i.e. it learns language processing task without explicit supervision) with Large LM possible. Used new dataset WebText, modified BPE tokenizer, minor modification to decoder, context size 512->1024, 1.5B params. SOTA perplexity on language modelling (predict x_t\x1..x_t-1), accuracy on NER, perplexity/accuracy on LAMBDA data (predict last word of long range sent), accuracy on common sense reasoning, not SOTA ROUGE F1 in summarisation, BLUE in translation, comparable in accuracy in question-answer. These evaluation done without explicit training (fine tuning)
GPT-3	May 2020	175B params, Auto-regressive language model. Similar model like GPT-2. SOTA on Language model, 35->20 on perplexity, accuracy score on 0 shot, one-shot, few shot; in question answer in 1/3 dataset over T5, not best for translation since 93% training in English, mixed for common sense reasoning, arithmetic task performance improves with shots
BLOOM	November 2022	176B largest multilingual open source LL. Decoder only transformer similar to GPT-3, trained on 1.6TB hugging face dataset, 46 languages, 13 programming languages. Major architecture change using ALiBi Positional Embedding (attenuates attention score by the position info from distance between key and query, instead of addition position info to embedding layer), LayerNorm after embedding layer leading to better stability in training (effects zero-shot generalisation though). Pretrained, and the fine tuning for multitask and contrastive led to a multilingual information retrieval model and multilingual semantic textual similarity (STS) model. Performance is not effected on English-only task even though multilingual training.

Document AI

Paper	Date	Description
LayoutLM	December 2019	Uses BERT as backbone model, takes trainable text embedding and new additional 4 positional embeddings (bbox of each token) to get layoutLM embedding. Text and bbox were extracted using pre-built OCR reader. At the same time, the ROI/bbox text in image form is passed through Faster-RCNN to get image embeddings. They are then used with LayoutLM emb. for fine-tuning tasks. Pre-trained for Masked Visual Language Model MLVM to learn text MSK using its position amb and other text+pos emb, Multi-label document classification to learn document representation in CLS token. Fine-tuned for form understanding task, receipt understanding task, and document image classification task i.e. text, layout info in pre-training and visual info in fine-tuning. WordPiece tokenizer, BIESO tags for token labeling.
Table Net	January 2020	Extracts table from a scanned document. A segmentation model similar to UNet, but strided conv for upsampling. VGG-19 as encoder, two separate decoder: table segmentation and column segmentation targets. Table segmentation op helps to select only those bboxes and text from tessarct-OCR lying inside the table and column segmentation. Row detected by looking for demarcation lines between vertically placed words via Randon transform. For multi rows structure, look for rows when all horizontal entries there. Defaults to every horizontal line as a row.
LayoutLMv2	December 2020	Multi-modal emb with Spatial-Aware self-attention Mechanism (self-attention weights with bias vector to get relative spatial info). Pretraining using text, layout and image emb. Text emb = token emb + 1D position emb for token index + segment emb for different text segments. Visual emb = flatten features from ResNeXt-FPN + 1D position emb for token index + segment emb for different text segment, Layout emb - 6 bbox co-od. Pre-training: MVLM, Text image alignment TIA: cover images of token lines, predict covered or not, Text-Image matching: CLS token to predict if image is from the same text. Fine tune: Document image classification, token-level classification, visual question answering on document images.
Donut	November 2021	OCR-free Document understanding transformer avoids inefficiency of OCR like text not detected, doc structure). Swin Transformer as encoder and BART as decoder. Trained using teacher forcing strategy. Input image to encoder, input Promt (, .., (info extract)) and encoder op to decoder input.Pre-training: Train model to read all text from top to bottom, minimize CE of next token prediction given previous token and image. Fine tuning: Document Classification (CDIP dataset), Document info extraction(receipt data, CORD, Ticker, business card,) Document Visual Question Answer (DocVQA)
LayoutLMv3	April 2022	…
UDOP	December 2022	Unifies Vision-Text-Layout through VLT transformer. A VLT encoder and 2 decoders: TL and V. Encoder-TL decoder follows T5 architecture (generate text and layout toekns in seq-to-seq manner). V decoder is Masked Auto-Encoder (MAE) decoder (generate img pixel). T-V embedding: divide image in patches, add tokens of patch and corr. text if present as unified TV embedding. For TV-L embedding, discrete bbox and add them to TV embedding to create VLT embedding. Generative pretraining (input is prompt): Self-supervised (layout modelling: bbox/text, img; visual text recognition: text/bbox; text-layout recognition: text,bbox/img; MIM: img/text, bbox), Supervised pretraining (classification, layout analysis:give bbox if prompt para; info extraction, question answer)
DocLLM	January 2024