AI_in_GIST

Table of Contents

Preprocessing

Handle Missing Values

Classification Regression Time Series Clustering
Deletion Deletion Deletion Deletion
Conditional Imputation with mode Conditional Imputation with mean/median Forward/Backward fill Kmeans Conditional imputation
- Interpolation (linear, spline, etc) Interpolation (linear/spline) -
Treat as special category like UNK - - Treat as saparate cluster
- - Change bining to reduce effect (monthly-> quaterly) -

Detect Outliers

Method Description
Inter-quartile range (IQR) Statistical method to detect outliers using quantiles
Isolation forest Unsupervised, Random forest, easy to detect outliers since they need fewer cuts/branches and can be separated easily. Need prior knowledge of %outliers.
One Class SVM Unsupervised, One cluster is the normal class, other comprises only the origin point. Maximize the margin. Better to use RBF kernel
DBSACN Clustering Algo, works to detect anomoly cluster (cluster is least points)
Auto-encoder Deep Learning method to detect vectors away from data embedding vector in bottleneck layer, thus a larger loss for them

Handle Outliers

Method Description
Deletion Do only if you know the reason
Transformation Can lead to loss of info present in the shape of input variable
Truncate Replace by the threshold
Binning Binning numerical vars
Robust model Like ensemble methods
Robust loss Like Huber loss

Dimension Reduction Techniques

Method Description
Feature Selection Filter (remove correlation features, not imp features based on feature importance)
  Subsets (RFE (use DT and remove highest gini feature one by one till desired features reached)& Wrapper techniques like forward, backward, bidirectional (same as RFE but you choose eliminator model))
  Train loop (LASSO)
Feature Extraction PCA (linear), kernal PCA (like rbf kernel), t-SNE (nonlinear), UMAP (better for nonlinear, large dataset)
  LDA (supervised ML, max separability of class directions)
  Auto encoder (DL) as in clustering
Clustering Making groups for cats and numerical
Binning Binning for numerical and cats
Aggegration features and transform ex: Polynomial features, interaction terms

Handle Skew data

Function Transformation Power Transformation Quantile Transformation
Logarithmic (>0 Right skewed input) Box-Cox (Input should be >0) Quantile Normalization
Squared (Left skewed input) Yeo-Johnson Rank Transformation
Square Root (>0 right skewed input)    
Reciprocal (>0 strong right skewed input)    

Techniques to handle imbalance dataset

Method Description
Oversample, undersample oversample minority class with replacement, undersample/remove entries from majority class
BalancedML Like BalancedBaggingClassifier, does oversampling
SMOTE Synthetically generate new samples using KNN
Appropriate ML like tree ensemble methods
Better metric Precision, Recall, F-beta score instead of accuracy
CV Stratified cross-validation scores, bootstrapping, etc
Loss Weighted loss, Focal loss and its variant (example: Dicefocal loss)

Variable Encoding

Encoding Type Categorical Ordinal Var Categorical Nominal Var Description
Classical One Hot One Hot 1 when True otherwise 0, use when n(unique) is small/reasonable
  Label/ordinal - Denotes hierarchical levels
  Hashing Hashing Hash function mapping for each unique category, can handle large cardinality/n(unique)
  Binary - One hot + Hashing
    Frequency count number of unique
Bayesian Target Target Encode target info in encoding using conditional target value for each unique cat value. Can lead to overfitting so smoothed versions exist.
  LOO Target LOO Target Exclude the current row while calculating target encoding for that row.
Deep Learning ~ Embedding Do entity embedding for high cardinality variables like cities, etc to catch semantic info by training the DNN/Embedding layer (Categorical to continuous values)
Numerical Encoding Description
Binning equal Bin equally, replace the value with bin number. Depends on the numerical variable, problem at hand
Binning uneuqal Bin unequally, replace value with bin number. Depends on the numerical variable, problem at hand
Binning quantile Bin in quantiles of values of the variable, replace the value with bin number. Depends on the numerical variable, problem at hand

Statistics

Distributions

Distribution Question it addresses Parameters
Gaussian Param mean, standard deviation  
Bernoulli Probability of success (p) in a single trial Mean=p, var=p(1-p)
Binomial Probability of k successes in n trials if p is the probability of success in 1 trial. Models number of successes in fixed number of trials. Mean=np, var=np(1-p)
Negative Binomial Probability of k trials required for fixed r successes, if p is the probability of success in 1 trial. Models number of trials required for fixed no. of successes. Mean=r/p, var=r(1-p)/p^2
Poisson Probability of x events to occur in unit time interval if lambda events occur on average in unit time interval. Here, events (discrete var is x axis) Mean=lambda, var=lambda
Geometric Probability of an event/success occurs after x number of trials (failures, discrete var), if one event/success occurs on average after 1/p trials. Here p is the Bernoulli probability of success Mean=1/p, var=(1-p)/p^2
Exponential Probability of an event to occur after x time interval (or any continuous variable like price/distance) if one event occurs on average after 1/lambda time interval. Here time (continuous var is x axis). Continuous equivalent of geometric distribution Mean=1/lambda, var=1/lambda^2
Gamma Probability of alpha events to occur in x time interval if one event occurs in 1/beta time interval. Here time (continuous var) is x axis. Generalised case of Exponential distribution. Mean=alpha/beta, var=alpha/beta^2

Types of Bias

Bias
Inherent bias in data
Sampling/data collection bias
Preprocessing bias
ML algo bias, like no L1/L2 regularizer, etc
ML algo evaluation metric bias

Goodness of fit test

Test Description
Chi-squared Between two categorical variables like compare unbiased or biased coin toss distribution
Kolmogrow Smirnov Continuous variables, >2000 data, a non-parametric test (no underlying hypothesis)
Anderson Darling Similar to KS, gives more importance to tails
Shapiro Wilk checks if data is gauss (compares quantile to that of gauss), <2000 data
AIC Regression tasks, continuous variables, Combines GOF+model complexity (DOF)
BIC Like AIC but Bayesian in approach, takes the number of prior data into account, more penalty for model complexity
R squared Continuous variables, MSE compared with Mean/Intercept only model MSE
Adjusted R squared Similar to R squared but takes model complexity into account (DOF)

Correlation

Independent Dependent Test
Categorical Categorical Chi-square test
Categorical Numerical t-test, z-test, ANNOVA
Numerical Categorical Logistic regression
Numerical Numerical Pearson corr (linear), Spearman corr (rank, monotonic, categorical ordinal data also works)

Hypothesis Tests

Test Types Description
Shapiro-Wilk   H0: Distribution is normal, H1: Distribution is not normal; <2000 dataset
Levene Test Parametric H0: var(data1)=var(data2), H1: var(data1)!=var(data2), Homogeneity of variances
t-test Parametric, One-sided/Two-sided, One-sample/two-sample H0: mu0=mu1, H1: mu0!=mu1 or mu
z-test Parametric, One-sided/Two-sided, One-sample/two-sample  
ANNOVA Parametric, One way (1 dep and indep var)/ Two way (2 indep, 1 dep var)  
F-test Parameteric  
Chi square Non-parametric  
Mann-Whitney Non-parametric  
Krusal Wallis Non-parametric  
Granger causality    
     

Classical ML

Regression & classification

Method Description
Linear Regression/OLS When x~y linearly, residual is homoscedastic, and variance(residual)=constant
Linear Regression extension Adding non-linear and interaction terms
Generalised linear models For GLM family, check y
Count data-> Poisson, Negative Binomial
Continuous->Normal
Continuous right skew: Gamma
Continuous left skew: Inverse Gauss
Probability distribution: Binomial, Multinomial
  For GLM link function, check y ~ x
y~x -> Identity
ln(y)~x->Log link
logit(y)~x: Logit link
Naive Bayes Mainly for classification problems, calculates Bayes probability for each of the target classes for the given test example and checks which target has the highest probability. If the input variable is Gaussian distributed, can use Gaussian NB, similarly Multi-nomial NB.
Logistic Regression Classification algorithm where you do linear regression with target as logit(p) where p is the probability of being class1 in binary classification
Decision trees Variable, split, SSE or gini index wrt target
Classification: Split variable having minimum Gini, max information gain, or minimum entropy
Regression: Split variable at a boundary giving minimum SSR in its final leaf nodes
Random forest Bagging + Feature Selection i.e. training several DT each with randomly selected (default) sqrt(features) for classification and f/3 for regression.
XGBoost Similar to random forest. Except for minimum gini score or SSE of a variable and splitting point, we use the similarity score of the variable and splitting point. Similarity score takes into account not only residual but also the number of data in each leaf used to calculate SSR of that leaf (in DT, number of points in leaf is not taken into account). Then it defines info gain as sum(SS_child)-SS(parent) to decide splitting if gain>gmamma, where gamma is the regularisation factor. Further, even the SS calculation has a regularisation param lambda. i.e DT->XGboost=>SSR/Gini->SS=SSR+lambda+no of data in leaf.
Also it is fast because it searches quantile for splitting (like qLoRA), parallel cuts eval, caching
Adaboost Make stump (DT with depth=1) for each var and choose var with the least gini. Give it a weight (%correct). Now, remake the dataset with giving more weight(or copy data) which got incorrectly classified and redo the making of stump and weighting.
SVM Generally used in classification Maximize margin such that examples are classified correctly (classification) or target y values deviates less than epsilon from regression line/curve
Hard SVM: Maximize margin such that classes are on either side of the decision boundary.
Soft SVM: Maximize margin and weighted penalize incorrect classification from its decision from the boundary.
Kernal SVM Use kernel trick to go to higher dimension i.e. use kernal methods which simplify calculation from low to high dim space
LDA Supervised classification problem, find axes which maximize separation between means of classes and minimize the variance within the classes.
kNN Supervised Classification and regression algo, average/mode of k closest data points.

Time Series

Method Description
Moving Average Smoothing method, give equal importance to all data in the context window
Simple exponential smoothing Smoothing method, gives exponentially decay weight for moving average on neighboring data in context window.
Holt’s exponential smoothing Double exponential smoothing, Can forecast predict level+trend
Winter-Holt’s exponential smoothing Tripple exponential smoothing, can forecast level+trend+season, thus next cycle
AR Forecast using lags
MA Forecast using error/residual on previous lags
ARMA Forecast using both lags (AR->PACF) and residual of previous lags (MA->ACF, Lunj-box test). Do Grid search to find AR and MA order which minimizes AIC the most while training
ARIMA If the time series is not stationary. I in ARIMA stands which previous lags should be subtracted to remove the trend for the Time series to make it stationary.
SARIMA Same as ARIMA, except takes seasonality into account. Thus its additionally, seasonality AR, MA, I order.
VAR AR to forecast TS when forecast depends on lags from other correlated time series as well. Apply Gangar Causality to check correlation
Hybrid models Use of feature transformers like linear regression to predict trend and target transformers like tree-based, NN to predict seasonality.
Prophet Curve fitting algo by Meta, can take weekly and yearly seasonality, holidays, and can have influx points for trend.
biLSTM bi-directional LSTM module. Can handle multiple time series models as well to predict one TS. needs normalization but no stationarity of TS.
ARCH/GARCH Forecast using the residual ARCH, forecast using the error in the residual GARCH. Example, apply it in %change of stock price series or when you want to understand the residual of some time series when they exhibit trends/seasonality

Clustering

Method Type Description
K-means Hard, Centroid Requires number of clusters prior (can be found via WCSS or Silhouette score), updates the centroid of each cluster iteratively.
Generalised K means Hard, Centroid Can have clusters of different sizes and shapes, resistant to outliers
K mediods Hard, Centroid Requires number of clusters prior. Uses actual data point as centroid
Algomerative hierarchical Hard, Hierarchial Merge data points by closest distance iteratively up to some distance threshold.
DBSCAN Hard, Density Does not require number of clusters prior, but min (radial) distance of a cluster (found by kNN distance plot) and number of points (default=2*features)
Spectral Hard First step: move to the lower dimension by using PCA or graph-based dimension reduction (node as points, edge as distance). Then apply a clustering algorithm like Kmeans
Fuzzy C means Soft, Centroid Update probability a point belongs to a centroid of a cluster iteratively. Needs number of clusters prior.
Gaussian Mixture models Soft, Distribution Use Gaussian distribution to define clusters, optimal param of Gaussian distribution found using Expectation-Minimization method

DL Papers

Fully Convolution Neural Network

Paper Date Description
LetNet-5 December 1998 CNN-AvgPool NN for softmax MNIST classifier, 1st usage of CNNs & weight sharing idea, activations: tanh, softmax
AlexNet September 2012 1000 class classifier, Similar to LeNet but deeper, 1st usage of Dropout, ReLU
VGGNet-16/19 September 2014 Very deep (16/19 layers) and narrow CNN, large number of small filters 3X3 to capture more complex and granular features, stacking of many layers
InceptionNet September 2014 Stacking of parallel modules, 1X1 followed by either 3X3 or 5X5, or 1X1 modules. 1X1 reduces channels dim & time (bottleneck layers) and diff nXn extracts low & high-level features. Global avg pooling counters overfitting & no. of params, Vanishing gradient countered by 2 auxiliary classifiers using same labels (0.3 weighted in final loss).
InceptionNetv2 & v3 December 2015 Factorize nXn conv to 1Xn and nX1, thus improve speed. BN in auxiliary classifier, label smoothing for overfitting, RMSProp optimizer
ResNet December 2015 Residual connection in CNN blocks of deep CNN for vanishing grads.
InceptionNetV4 and InceptionResNet February 2016 More uniform inception modules with new dim reduction blocks, skip connect module o/p to i/p (same channel dim achieved by 1X1) & replaced pooling operation. Scale residual activation to 0.3 to decrease vanishing grads.
DenseNet August 2016 Each layer in block receives input from all previous layers counter vanishing grads, transition layers between dense blocks reduce spatial dim to balance computational cost and accuracy.
Xception October 2016 Based on Inception v3, instead of inception modules, it uses depthwise separable (depth + pointwise) convolutions (split channels in input and run separate splitted filters (like 10 filters and 10 IP channels, one for each), concatenate it, and follow it by 1X1 so that the concatented block from that splitted filter operations can get mixed among the channels) over input image (grouped convolutions: i/p channel-wise conv, concatenate and apply pointwise 1X1 conv)
ResNeXt November 2016 CNN with several Depthwise separable convolutions with ResNet.
MobileNetv1 April 2017 Depthwise separable convolutions network for mobile and embedded devices.
MobileNetv2 January 2018 Inverted Residual connections (skip in narrow-wide-narrow type block with wide obtained by depthwise separable convolutions). Loss of extra info due to ReLU not coupled by wide-narrow-wide channel type block, linear activation in used in the final layer of block (linear bottleneck)
MobileNetv3 May 2019 AutoML tools, MnasNet to select coarse architecture using reinforcement learning and NetAdapt to fine tune. Use squeeze-and-excitation block (Squeeze: HXWXC -> 1X1XC, getting per channel overview, Excitation: 1X1XC -> 1X1XC using NN, use this now as per channel weight for input and continue with model), remove 3 expensive layers from v2.
EfficientNet May 2019 AutoML automated neural architecture search (NAS) to select balanced dim of width, depth and resolution (compound scaling method). Usage of efficient building blocks like inverted residuals, linear bottleneck

One Stage Object detectors

Two Stage Object detectors

Paper Date Description
RCNN November 2013 Selective search for regions proposal (~2000), warp each RoI and pass each through CNN for feature extraction, linear SVM for classification and offset bbox regression. Non-max suppression to select from >1 bbox of same object
Fast RCNN April 2015 Pass entire image through large backbone CNN, selective search on output, crop, wrap each RoI and do RoI pooling and pass through small detection head NN for offset bbox regression and classification and objectness score. Faster inference due to usage of truncated SVD on model weights to retain imp weights/nodes from head. About 25X faster than RCNN.
Faster RCNN June 2015 Replace selective search in Fast RCNN by Region proposal neural network (RPN), 9 anchors different shape boxes to select best bbox per object. About X10 faster than Fast RCNN
Mask RCNN March 2017 Add extra output to Faster RCNN to perform instance segmentation to classify each RoI.
Cascade RCNN December 2017 Detection head bbox output depends on RPN performance in Faster RCNN, coupled by poor inference conditional on IoU. To counter, cascade of individual RPN is used each trained using the preceding bbox RPN prediction trained for increasing IoU.

Vision Transformers

Paper Date Description
Vision Transformer October 2020 Encoder block of original transformer. Segment images into patches, flatten each to pass and use them as tokens for trainable embedding layer (patch embeddings), and treat them as tokers. Standard learnable 1D position embeddings (no gains with 2D-aware position embeddings). MLP layers with GELU non-linearity. CLS token for image classification. SOTA results when trained on large data (14M-300M images). Major issue is it is supervised pre-trained using ImageNet unlike BERT. BEiT was thus developed to overcome this.
Data-efficient Image Transformer (DeiT) December 2020 Similar architecture like ViT, but distilled version->distillation token learns from teacher. Input: CLS, Patch1, Patch2.., Distil tokens. Distil token minimizes distillation loss (soft: KL-div with teacher, hard: CE with teacher). Cosine sim between distill and CLS embedding 0.93 (expected <1.0 by construction). Found ConvNet teacher better than Transformer. Distil token is better than CLS token for classification. Joint CLS+distil token gives a middle performance.
Swin transformer March 2021 Hierarchial transformer, smaller ip patches (16X16 in ViT->4X4). RGB img -> Patch partition (4X4) -> Stage1(Linear embedding (NN) -> Swin transformer block)-> Stage2(Patch merging -> Swin transformer block) -> Stage3(Patch merging -> Swin transformer block)->Classification. Swin transformer block: 1st block does window attention (attention within window only, quick), 2nd block: shifted window attention (like sliding windows in convnet, cyclic rotation of patches in image and do attention so that if obj in between patch is captured in a new patch.). Patch merging/hierarchical structure helps go from local info capture to global info capture. The output of each hierarchical level can be used for object detection/segmentation algo backbone. (eg: with Mask R-CNN).
BEiT (BERT-iT) June 2021 BERT-like self-sueprvised MIM pre-training. No vocabulary or tokenizer like BERT for NLP task. For tokenizer, we associate every patch of image with a visual token obtained using a separate dVAE. Pre-training: ViT backbone. Tokens for image patches: dVAE (discrete) learns Visual tokens (tokens for each patch) extracted from Tokenizer->Visual tokens (bottleneck)->Decoder(IMG) model. i.e the pre-training MIM task is to predict discrete tokens of masked patches. RGB img -> Patches -> patch wise masking (40% max) -> Flatten patch for patch embedding+Position embedding-> ViT (BEiT encoder)-> MIM (mask img modelling) task to learn visual tokens corresponding to masked patch. Fine tune: Image classification: Linear classifier (global pooling+softmax classifier). Semantic Segmentation: pre-trained BEiT as backbone encoder, several deconvolution layers for decoder to produce segmentation. Object detection.

NLP Encoder-Decoder models

Paper Date Description
Transformer June 2017 Encoder-decoder model, Multi-head self-attention (Query, Key, Value) in encoder and Masked multihead attention MLM and multi-head cross attention (K,V from encoder output, V from decoder), skip connections, Layer normalisation (across all features of a token), BPE used as tokenizer, trainable text 512 length embedding, fixed non-trainable absolute position encoding (sin/cos), trained on WMT English-German, English-French dataset.

NLP Encoder models

Paper Date Description
BERT October 2018 Encoder transformer model, max input 512 tokens, WordPiece tokenizer, trainable text embedding length 768, fixed non-trainable absolute position encoding (sin/cos), multi-head self-attention (Query, Key, Value), final linear-relu-softmax layer, trained on left-right context (bidirectional embedding), Pre-training (MLM task (MASK token), Next Sentence Prediction NSP task using CLS-sent1-SEP-sent2-SEP tokens), BERT fine-tuning (Text Classification CLS token, Question-Answer Task CLS-ques-SEP-context-SEP).
RoBERTa July 2019 Optimize BERT, Dynamic masking of tokens in MLM in training pre-training BERT, no NSP due to low value, ByteEncoder, X10 train data (160GB), larger batch size
ELECTRA March 2020 Replaced Token Detection instead of MLM (Generator BERT predicts MSK tokens, Discriminator BERT predicts isOriginal or isReplaced) thus focus on all tokens in sequence and not just MSK token in BERT, no NSP
ALBERT September 2019 Cross param sharing across BERT blocks, Embedding matrix factorization (AXB=(AXN)*(NXB)) leading to 1/10 size of BERT. Pre-training MLM and Sentence order prediction SOP, no NSP since it makes more sense to know inter-sentence coherence.
DistilBERT October 2019 60X faster, 0.40 size, 97% BERT accurate. Teacher-student knowledge distillation. Train: train teacher BERT for MLM task. Student BERT soft prediction with teacher BERT soft target using KL divergence loss, student BERT hard prediction with hard target using CE loss and cosine similarity loss between soft target and soft prediction embeddings.
TinyBERT September 2019 Reduce Student BERT to 4 encoder block, 312 emb size (Teacher BERT has 12 blocks, 786 emb size). Knowledge distillation is not only at Prediction layer (as in DistilBERT), but also at the Embedding layer and transformer layer by minimizing MSE between them. Dimension mis-match b/w S and T solved by matrix factorization S(NX312) x Weight(312 X 768) = T(Nx768). Learn weight matrix also.

NLP Decoder models

Paper Date Description
GPT June 2018 Generative pre-training and discriminatory fine tuning. Architecture is similar to the decoder of original transformer. Unsupervised pre-training (thus able to use large corpus) using long Book texts (thus capture long range info) for standard language model objective of maximizing likelihood of sum(log P(xi\x1..x_i-1)). The pre-trained language model supervised fine tuned for text classification, text entailment x=(premise$hypothesis)->y=class, similarity x=(sent1$sent2) and x=(sent2$sent1)->y=1/0, question and answer mcq x(context$ans1), x(context$ans2)->softmax(2).
GPT-2 February 2019 Demonstrated 0-shot learning (i.e. it learns language processing task without explicit supervision) with Large LM possible. Used new dataset WebText, modified BPE tokenizer, minor modification to decoder, context size 512->1024, 1.5B params. SOTA perplexity on language modelling (predict x_t\x1..x_t-1), accuracy on NER, perplexity/accuracy on LAMBDA data (predict last word of long range sent), accuracy on common sense reasoning, not SOTA ROUGE F1 in summarisation, BLUE in translation, comparable in accuracy in question-answer. These evaluation done without explicit training (fine tuning)
GPT-3 May 2020 175B params, Auto-regressive language model. Similar model like GPT-2. SOTA on Language model, 35->20 on perplexity, accuracy score on 0 shot, one-shot, few shot; in question answer in 1/3 dataset over T5, not best for translation since 93% training in English, mixed for common sense reasoning, arithmetic task performance improves with shots
BLOOM November 2022 176B largest multilingual open source LL. Decoder only transformer similar to GPT-3, trained on 1.6TB hugging face dataset, 46 languages, 13 programming languages. Major architecture change using ALiBi Positional Embedding (attenuates attention score by the position info from distance between key and query, instead of addition position info to embedding layer), LayerNorm after embedding layer leading to better stability in training (effects zero-shot generalisation though). Pretrained, and the fine tuning for multitask and contrastive led to a multilingual information retrieval model and multilingual semantic textual similarity (STS) model. Performance is not effected on English-only task even though multilingual training.

Document AI

Paper Date Description
LayoutLM December 2019 Uses BERT as backbone model, takes trainable text embedding and new additional 4 positional embeddings (bbox of each token) to get layoutLM embedding. Text and bbox were extracted using pre-built OCR reader. At the same time, the ROI/bbox text in image form is passed through Faster-RCNN to get image embeddings. They are then used with LayoutLM emb. for fine-tuning tasks. Pre-trained for Masked Visual Language Model MLVM to learn text MSK using its position amb and other text+pos emb, Multi-label document classification to learn document representation in CLS token. Fine-tuned for form understanding task, receipt understanding task, and document image classification task i.e. text, layout info in pre-training and visual info in fine-tuning. WordPiece tokenizer, BIESO tags for token labeling.
Table Net January 2020 Extracts table from a scanned document. A segmentation model similar to UNet, but strided conv for upsampling. VGG-19 as encoder, two separate decoder: table segmentation and column segmentation targets. Table segmentation op helps to select only those bboxes and text from tessarct-OCR lying inside the table and column segmentation. Row detected by looking for demarcation lines between vertically placed words via Randon transform. For multi rows structure, look for rows when all horizontal entries there. Defaults to every horizontal line as a row.
LayoutLMv2 December 2020 Multi-modal emb with Spatial-Aware self-attention Mechanism (self-attention weights with bias vector to get relative spatial info). Pretraining using text, layout and image emb. Text emb = token emb + 1D position emb for token index + segment emb for different text segments. Visual emb = flatten features from ResNeXt-FPN + 1D position emb for token index + segment emb for different text segment, Layout emb - 6 bbox co-od. Pre-training: MVLM, Text image alignment TIA: cover images of token lines, predict covered or not, Text-Image matching: CLS token to predict if image is from the same text. Fine tune: Document image classification, token-level classification, visual question answering on document images.
Donut November 2021 OCR-free Document understanding transformer avoids inefficiency of OCR like text not detected, doc structure). Swin Transformer as encoder and BART as decoder. Trained using teacher forcing strategy. Input image to encoder, input Promt (, .., (info extract)) and encoder op to decoder input.Pre-training: Train model to read all text from top to bottom, minimize CE of next token prediction given previous token and image. Fine tuning: Document Classification (CDIP dataset), Document info extraction(receipt data, CORD, Ticker, business card,) Document Visual Question Answer (DocVQA)
LayoutLMv3 April 2022
UDOP December 2022 Unifies Vision-Text-Layout through VLT transformer. A VLT encoder and 2 decoders: TL and V. Encoder-TL decoder follows T5 architecture (generate text and layout toekns in seq-to-seq manner). V decoder is Masked Auto-Encoder (MAE) decoder (generate img pixel). T-V embedding: divide image in patches, add tokens of patch and corr. text if present as unified TV embedding. For TV-L embedding, discrete bbox and add them to TV embedding to create VLT embedding. Generative pretraining (input is prompt): Self-supervised (layout modelling: bbox/text, img; visual text recognition: text/bbox; text-layout recognition: text,bbox/img; MIM: img/text, bbox), Supervised pretraining (classification, layout analysis:give bbox if prompt para; info extraction, question answer)
DocLLM January 2024