Classification | Regression | Time Series | Clustering |
---|---|---|---|
Deletion | Deletion | Deletion | Deletion |
Conditional Imputation with mode | Conditional Imputation with mean/median | Forward/Backward fill | Kmeans Conditional imputation |
- | Interpolation (linear, spline, etc) | Interpolation (linear/spline) | - |
Treat as special category like UNK | - | - | Treat as saparate cluster |
- | - | Change bining to reduce effect (monthly-> quaterly) | - |
Method | Description |
---|---|
Inter-quartile range (IQR) | Statistical method to detect outliers using quantiles |
Isolation forest | Unsupervised, Random forest, easy to detect outliers since they need fewer cuts/branches and can be separated easily. Need prior knowledge of %outliers. |
One Class SVM | Unsupervised, One cluster is the normal class, other comprises only the origin point. Maximize the margin. Better to use RBF kernel |
DBSACN | Clustering Algo, works to detect anomoly cluster (cluster is least points) |
Auto-encoder | Deep Learning method to detect vectors away from data embedding vector in bottleneck layer, thus a larger loss for them |
Method | Description |
---|---|
Deletion | Do only if you know the reason |
Transformation | Can lead to loss of info present in the shape of input variable |
Truncate | Replace by the threshold |
Binning | Binning numerical vars |
Robust model | Like ensemble methods |
Robust loss | Like Huber loss |
Method | Description |
---|---|
Feature Selection | Filter (remove correlation features, not imp features based on feature importance) |
Subsets (RFE (use DT and remove highest gini feature one by one till desired features reached)& Wrapper techniques like forward, backward, bidirectional (same as RFE but you choose eliminator model)) | |
Train loop (LASSO) | |
Feature Extraction | PCA (linear), kernal PCA (like rbf kernel), t-SNE (nonlinear), UMAP (better for nonlinear, large dataset) |
LDA (supervised ML, max separability of class directions) | |
Auto encoder (DL) as in clustering | |
Clustering | Making groups for cats and numerical |
Binning | Binning for numerical and cats |
Aggegration features and transform | ex: Polynomial features, interaction terms |
Function Transformation | Power Transformation | Quantile Transformation |
---|---|---|
Logarithmic (>0 Right skewed input) | Box-Cox (Input should be >0) | Quantile Normalization |
Squared (Left skewed input) | Yeo-Johnson | Rank Transformation |
Square Root (>0 right skewed input) | ||
Reciprocal (>0 strong right skewed input) |
Method | Description |
---|---|
Oversample, undersample | oversample minority class with replacement, undersample/remove entries from majority class |
BalancedML | Like BalancedBaggingClassifier, does oversampling |
SMOTE | Synthetically generate new samples using KNN |
Appropriate ML | like tree ensemble methods |
Better metric | Precision, Recall, F-beta score instead of accuracy |
CV | Stratified cross-validation scores, bootstrapping, etc |
Loss | Weighted loss, Focal loss and its variant (example: Dicefocal loss) |
Encoding Type | Categorical Ordinal Var | Categorical Nominal Var | Description |
---|---|---|---|
Classical | One Hot | One Hot | 1 when True otherwise 0, use when n(unique) is small/reasonable |
Label/ordinal | - | Denotes hierarchical levels | |
Hashing | Hashing | Hash function mapping for each unique category, can handle large cardinality/n(unique) | |
Binary | - | One hot + Hashing | |
Frequency | count number of unique | ||
Bayesian | Target | Target | Encode target info in encoding using conditional target value for each unique cat value. Can lead to overfitting so smoothed versions exist. |
LOO Target | LOO Target | Exclude the current row while calculating target encoding for that row. | |
Deep Learning | ~ | Embedding | Do entity embedding for high cardinality variables like cities, etc to catch semantic info by training the DNN/Embedding layer (Categorical to continuous values) |
Numerical Encoding | Description |
---|---|
Binning equal | Bin equally, replace the value with bin number. Depends on the numerical variable, problem at hand |
Binning uneuqal | Bin unequally, replace value with bin number. Depends on the numerical variable, problem at hand |
Binning quantile | Bin in quantiles of values of the variable, replace the value with bin number. Depends on the numerical variable, problem at hand |
Distribution | Question it addresses | Parameters |
---|---|---|
Gaussian | Param mean, standard deviation | |
Bernoulli | Probability of success (p) in a single trial | Mean=p, var=p(1-p) |
Binomial | Probability of k successes in n trials if p is the probability of success in 1 trial. Models number of successes in fixed number of trials. | Mean=np, var=np(1-p) |
Negative Binomial | Probability of k trials required for fixed r successes, if p is the probability of success in 1 trial. Models number of trials required for fixed no. of successes. | Mean=r/p, var=r(1-p)/p^2 |
Poisson | Probability of x events to occur in unit time interval if lambda events occur on average in unit time interval. Here, events (discrete var is x axis) | Mean=lambda, var=lambda |
Geometric | Probability of an event/success occurs after x number of trials (failures, discrete var), if one event/success occurs on average after 1/p trials. Here p is the Bernoulli probability of success | Mean=1/p, var=(1-p)/p^2 |
Exponential | Probability of an event to occur after x time interval (or any continuous variable like price/distance) if one event occurs on average after 1/lambda time interval. Here time (continuous var is x axis). Continuous equivalent of geometric distribution | Mean=1/lambda, var=1/lambda^2 |
Gamma | Probability of alpha events to occur in x time interval if one event occurs in 1/beta time interval. Here time (continuous var) is x axis. Generalised case of Exponential distribution. | Mean=alpha/beta, var=alpha/beta^2 |
Bias |
---|
Inherent bias in data |
Sampling/data collection bias |
Preprocessing bias |
ML algo bias, like no L1/L2 regularizer, etc |
ML algo evaluation metric bias |
Test | Description |
---|---|
Chi-squared | Between two categorical variables like compare unbiased or biased coin toss distribution |
Kolmogrow Smirnov | Continuous variables, >2000 data, a non-parametric test (no underlying hypothesis) |
Anderson Darling | Similar to KS, gives more importance to tails |
Shapiro Wilk | checks if data is gauss (compares quantile to that of gauss), <2000 data |
AIC | Regression tasks, continuous variables, Combines GOF+model complexity (DOF) |
BIC | Like AIC but Bayesian in approach, takes the number of prior data into account, more penalty for model complexity |
R squared | Continuous variables, MSE compared with Mean/Intercept only model MSE |
Adjusted R squared | Similar to R squared but takes model complexity into account (DOF) |
Independent | Dependent | Test |
---|---|---|
Categorical | Categorical | Chi-square test |
Categorical | Numerical | t-test, z-test, ANNOVA |
Numerical | Categorical | Logistic regression |
Numerical | Numerical | Pearson corr (linear), Spearman corr (rank, monotonic, categorical ordinal data also works) |
Test | Types | Description |
---|---|---|
Shapiro-Wilk | H0: Distribution is normal, H1: Distribution is not normal; <2000 dataset | |
Levene Test | Parametric | H0: var(data1)=var(data2), H1: var(data1)!=var(data2), Homogeneity of variances |
t-test | Parametric, One-sided/Two-sided, One-sample/two-sample | H0: mu0=mu1, H1: mu0!=mu1 or mu |
z-test | Parametric, One-sided/Two-sided, One-sample/two-sample | |
ANNOVA | Parametric, One way (1 dep and indep var)/ Two way (2 indep, 1 dep var) | |
F-test | Parameteric | |
Chi square | Non-parametric | |
Mann-Whitney | Non-parametric | |
Krusal Wallis | Non-parametric | |
Granger causality | ||
Method | Description |
---|---|
Linear Regression/OLS | When x~y linearly, residual is homoscedastic, and variance(residual)=constant |
Linear Regression extension | Adding non-linear and interaction terms |
Generalised linear models | For GLM family, check y Count data-> Poisson, Negative Binomial Continuous->Normal Continuous right skew: Gamma Continuous left skew: Inverse Gauss Probability distribution: Binomial, Multinomial |
For GLM link function, check y ~ x y~x -> Identity ln(y)~x->Log link logit(y)~x: Logit link |
|
Naive Bayes | Mainly for classification problems, calculates Bayes probability for each of the target classes for the given test example and checks which target has the highest probability. If the input variable is Gaussian distributed, can use Gaussian NB, similarly Multi-nomial NB. |
Logistic Regression | Classification algorithm where you do linear regression with target as logit(p) where p is the probability of being class1 in binary classification |
Decision trees | Variable, split, SSE or gini index wrt target Classification: Split variable having minimum Gini, max information gain, or minimum entropy Regression: Split variable at a boundary giving minimum SSR in its final leaf nodes |
Random forest | Bagging + Feature Selection i.e. training several DT each with randomly selected (default) sqrt(features) for classification and f/3 for regression. |
XGBoost | Similar to random forest. Except for minimum gini score or SSE of a variable and splitting point, we use the similarity score of the variable and splitting point. Similarity score takes into account not only residual but also the number of data in each leaf used to calculate SSR of that leaf (in DT, number of points in leaf is not taken into account). Then it defines info gain as sum(SS_child)-SS(parent) to decide splitting if gain>gmamma, where gamma is the regularisation factor. Further, even the SS calculation has a regularisation param lambda. i.e DT->XGboost=>SSR/Gini->SS=SSR+lambda+no of data in leaf. Also it is fast because it searches quantile for splitting (like qLoRA), parallel cuts eval, caching |
Adaboost | Make stump (DT with depth=1) for each var and choose var with the least gini. Give it a weight (%correct). Now, remake the dataset with giving more weight(or copy data) which got incorrectly classified and redo the making of stump and weighting. |
SVM | Generally used in classification Maximize margin such that examples are classified correctly (classification) or target y values deviates less than epsilon from regression line/curve Hard SVM: Maximize margin such that classes are on either side of the decision boundary. Soft SVM: Maximize margin and weighted penalize incorrect classification from its decision from the boundary. |
Kernal SVM | Use kernel trick to go to higher dimension i.e. use kernal methods which simplify calculation from low to high dim space |
LDA | Supervised classification problem, find axes which maximize separation between means of classes and minimize the variance within the classes. |
kNN | Supervised Classification and regression algo, average/mode of k closest data points. |
Method | Description |
---|---|
Moving Average | Smoothing method, give equal importance to all data in the context window |
Simple exponential smoothing | Smoothing method, gives exponentially decay weight for moving average on neighboring data in context window. |
Holt’s exponential smoothing | Double exponential smoothing, Can forecast predict level+trend |
Winter-Holt’s exponential smoothing | Tripple exponential smoothing, can forecast level+trend+season, thus next cycle |
AR | Forecast using lags |
MA | Forecast using error/residual on previous lags |
ARMA | Forecast using both lags (AR->PACF) and residual of previous lags (MA->ACF, Lunj-box test). Do Grid search to find AR and MA order which minimizes AIC the most while training |
ARIMA | If the time series is not stationary. I in ARIMA stands which previous lags should be subtracted to remove the trend for the Time series to make it stationary. |
SARIMA | Same as ARIMA, except takes seasonality into account. Thus its additionally, seasonality AR, MA, I order. |
VAR | AR to forecast TS when forecast depends on lags from other correlated time series as well. Apply Gangar Causality to check correlation |
Hybrid models | Use of feature transformers like linear regression to predict trend and target transformers like tree-based, NN to predict seasonality. |
Prophet | Curve fitting algo by Meta, can take weekly and yearly seasonality, holidays, and can have influx points for trend. |
biLSTM | bi-directional LSTM module. Can handle multiple time series models as well to predict one TS. needs normalization but no stationarity of TS. |
ARCH/GARCH | Forecast using the residual ARCH, forecast using the error in the residual GARCH. Example, apply it in %change of stock price series or when you want to understand the residual of some time series when they exhibit trends/seasonality |
Method | Type | Description |
---|---|---|
K-means | Hard, Centroid | Requires number of clusters prior (can be found via WCSS or Silhouette score), updates the centroid of each cluster iteratively. |
Generalised K means | Hard, Centroid | Can have clusters of different sizes and shapes, resistant to outliers |
K mediods | Hard, Centroid | Requires number of clusters prior. Uses actual data point as centroid |
Algomerative hierarchical | Hard, Hierarchial | Merge data points by closest distance iteratively up to some distance threshold. |
DBSCAN | Hard, Density | Does not require number of clusters prior, but min (radial) distance of a cluster (found by kNN distance plot) and number of points (default=2*features) |
Spectral | Hard | First step: move to the lower dimension by using PCA or graph-based dimension reduction (node as points, edge as distance). Then apply a clustering algorithm like Kmeans |
Fuzzy C means | Soft, Centroid | Update probability a point belongs to a centroid of a cluster iteratively. Needs number of clusters prior. |
Gaussian Mixture models | Soft, Distribution | Use Gaussian distribution to define clusters, optimal param of Gaussian distribution found using Expectation-Minimization method |
Paper | Date | Description |
---|---|---|
LetNet-5 | December 1998 | CNN-AvgPool NN for softmax MNIST classifier, 1st usage of CNNs & weight sharing idea, activations: tanh, softmax |
AlexNet | September 2012 | 1000 class classifier, Similar to LeNet but deeper, 1st usage of Dropout, ReLU |
VGGNet-16/19 | September 2014 | Very deep (16/19 layers) and narrow CNN, large number of small filters 3X3 to capture more complex and granular features, stacking of many layers |
InceptionNet | September 2014 | Stacking of parallel modules, 1X1 followed by either 3X3 or 5X5, or 1X1 modules. 1X1 reduces channels dim & time (bottleneck layers) and diff nXn extracts low & high-level features. Global avg pooling counters overfitting & no. of params, Vanishing gradient countered by 2 auxiliary classifiers using same labels (0.3 weighted in final loss). |
InceptionNetv2 & v3 | December 2015 | Factorize nXn conv to 1Xn and nX1, thus improve speed. BN in auxiliary classifier, label smoothing for overfitting, RMSProp optimizer |
ResNet | December 2015 | Residual connection in CNN blocks of deep CNN for vanishing grads. |
InceptionNetV4 and InceptionResNet | February 2016 | More uniform inception modules with new dim reduction blocks, skip connect module o/p to i/p (same channel dim achieved by 1X1) & replaced pooling operation. Scale residual activation to 0.3 to decrease vanishing grads. |
DenseNet | August 2016 | Each layer in block receives input from all previous layers counter vanishing grads, transition layers between dense blocks reduce spatial dim to balance computational cost and accuracy. |
Xception | October 2016 | Based on Inception v3, instead of inception modules, it uses depthwise separable (depth + pointwise) convolutions (split channels in input and run separate splitted filters (like 10 filters and 10 IP channels, one for each), concatenate it, and follow it by 1X1 so that the concatented block from that splitted filter operations can get mixed among the channels) over input image (grouped convolutions: i/p channel-wise conv, concatenate and apply pointwise 1X1 conv) |
ResNeXt | November 2016 | CNN with several Depthwise separable convolutions with ResNet. |
MobileNetv1 | April 2017 | Depthwise separable convolutions network for mobile and embedded devices. |
MobileNetv2 | January 2018 | Inverted Residual connections (skip in narrow-wide-narrow type block with wide obtained by depthwise separable convolutions). Loss of extra info due to ReLU not coupled by wide-narrow-wide channel type block, linear activation in used in the final layer of block (linear bottleneck) |
MobileNetv3 | May 2019 | AutoML tools, MnasNet to select coarse architecture using reinforcement learning and NetAdapt to fine tune. Use squeeze-and-excitation block (Squeeze: HXWXC -> 1X1XC, getting per channel overview, Excitation: 1X1XC -> 1X1XC using NN, use this now as per channel weight for input and continue with model), remove 3 expensive layers from v2. |
EfficientNet | May 2019 | AutoML automated neural architecture search (NAS) to select balanced dim of width, depth and resolution (compound scaling method). Usage of efficient building blocks like inverted residuals, linear bottleneck |
Paper | Date | Description |
---|---|---|
RCNN | November 2013 | Selective search for regions proposal (~2000), warp each RoI and pass each through CNN for feature extraction, linear SVM for classification and offset bbox regression. Non-max suppression to select from >1 bbox of same object |
Fast RCNN | April 2015 | Pass entire image through large backbone CNN, selective search on output, crop, wrap each RoI and do RoI pooling and pass through small detection head NN for offset bbox regression and classification and objectness score. Faster inference due to usage of truncated SVD on model weights to retain imp weights/nodes from head. About 25X faster than RCNN. |
Faster RCNN | June 2015 | Replace selective search in Fast RCNN by Region proposal neural network (RPN), 9 anchors different shape boxes to select best bbox per object. About X10 faster than Fast RCNN |
Mask RCNN | March 2017 | Add extra output to Faster RCNN to perform instance segmentation to classify each RoI. |
Cascade RCNN | December 2017 | Detection head bbox output depends on RPN performance in Faster RCNN, coupled by poor inference conditional on IoU. To counter, cascade of individual RPN is used each trained using the preceding bbox RPN prediction trained for increasing IoU. |
Paper | Date | Description |
---|---|---|
Vision Transformer | October 2020 | Encoder block of original transformer. Segment images into patches, flatten each to pass and use them as tokens for trainable embedding layer (patch embeddings), and treat them as tokers. Standard learnable 1D position embeddings (no gains with 2D-aware position embeddings). MLP layers with GELU non-linearity. CLS token for image classification. SOTA results when trained on large data (14M-300M images). Major issue is it is supervised pre-trained using ImageNet unlike BERT. BEiT was thus developed to overcome this. |
Data-efficient Image Transformer (DeiT) | December 2020 | Similar architecture like ViT, but distilled version->distillation token learns from teacher. Input: CLS, Patch1, Patch2.., Distil tokens. Distil token minimizes distillation loss (soft: KL-div with teacher, hard: CE with teacher). Cosine sim between distill and CLS embedding 0.93 (expected <1.0 by construction). Found ConvNet teacher better than Transformer. Distil token is better than CLS token for classification. Joint CLS+distil token gives a middle performance. |
Swin transformer | March 2021 | Hierarchial transformer, smaller ip patches (16X16 in ViT->4X4). RGB img -> Patch partition (4X4) -> Stage1(Linear embedding (NN) -> Swin transformer block)-> Stage2(Patch merging -> Swin transformer block) -> Stage3(Patch merging -> Swin transformer block)->Classification. Swin transformer block: 1st block does window attention (attention within window only, quick), 2nd block: shifted window attention (like sliding windows in convnet, cyclic rotation of patches in image and do attention so that if obj in between patch is captured in a new patch.). Patch merging/hierarchical structure helps go from local info capture to global info capture. The output of each hierarchical level can be used for object detection/segmentation algo backbone. (eg: with Mask R-CNN). |
BEiT (BERT-iT) | June 2021 | BERT-like self-sueprvised MIM pre-training. No vocabulary or tokenizer like BERT for NLP task. For tokenizer, we associate every patch of image with a visual token obtained using a separate dVAE. Pre-training: ViT backbone. Tokens for image patches: dVAE (discrete) learns Visual tokens (tokens for each patch) extracted from Tokenizer->Visual tokens (bottleneck)->Decoder(IMG) model. i.e the pre-training MIM task is to predict discrete tokens of masked patches. RGB img -> Patches -> patch wise masking (40% max) -> Flatten patch for patch embedding+Position embedding-> ViT (BEiT encoder)-> MIM (mask img modelling) task to learn visual tokens corresponding to masked patch. Fine tune: Image classification: Linear classifier (global pooling+softmax classifier). Semantic Segmentation: pre-trained BEiT as backbone encoder, several deconvolution layers for decoder to produce segmentation. Object detection. |
Paper | Date | Description |
---|---|---|
Transformer | June 2017 | Encoder-decoder model, Multi-head self-attention (Query, Key, Value) in encoder and Masked multihead attention MLM and multi-head cross attention (K,V from encoder output, V from decoder), skip connections, Layer normalisation (across all features of a token), BPE used as tokenizer, trainable text 512 length embedding, fixed non-trainable absolute position encoding (sin/cos), trained on WMT English-German, English-French dataset. |
Paper | Date | Description |
---|---|---|
BERT | October 2018 | Encoder transformer model, max input 512 tokens, WordPiece tokenizer, trainable text embedding length 768, fixed non-trainable absolute position encoding (sin/cos), multi-head self-attention (Query, Key, Value), final linear-relu-softmax layer, trained on left-right context (bidirectional embedding), Pre-training (MLM task (MASK token), Next Sentence Prediction NSP task using CLS-sent1-SEP-sent2-SEP tokens), BERT fine-tuning (Text Classification CLS token, Question-Answer Task CLS-ques-SEP-context-SEP). |
RoBERTa | July 2019 | Optimize BERT, Dynamic masking of tokens in MLM in training pre-training BERT, no NSP due to low value, ByteEncoder, X10 train data (160GB), larger batch size |
ELECTRA | March 2020 | Replaced Token Detection instead of MLM (Generator BERT predicts MSK tokens, Discriminator BERT predicts isOriginal or isReplaced) thus focus on all tokens in sequence and not just MSK token in BERT, no NSP |
ALBERT | September 2019 | Cross param sharing across BERT blocks, Embedding matrix factorization (AXB=(AXN)*(NXB)) leading to 1/10 size of BERT. Pre-training MLM and Sentence order prediction SOP, no NSP since it makes more sense to know inter-sentence coherence. |
DistilBERT | October 2019 | 60X faster, 0.40 size, 97% BERT accurate. Teacher-student knowledge distillation. Train: train teacher BERT for MLM task. Student BERT soft prediction with teacher BERT soft target using KL divergence loss, student BERT hard prediction with hard target using CE loss and cosine similarity loss between soft target and soft prediction embeddings. |
TinyBERT | September 2019 | Reduce Student BERT to 4 encoder block, 312 emb size (Teacher BERT has 12 blocks, 786 emb size). Knowledge distillation is not only at Prediction layer (as in DistilBERT), but also at the Embedding layer and transformer layer by minimizing MSE between them. Dimension mis-match b/w S and T solved by matrix factorization S(NX312) x Weight(312 X 768) = T(Nx768). Learn weight matrix also. |
Paper | Date | Description |
---|---|---|
GPT | June 2018 | Generative pre-training and discriminatory fine tuning. Architecture is similar to the decoder of original transformer. Unsupervised pre-training (thus able to use large corpus) using long Book texts (thus capture long range info) for standard language model objective of maximizing likelihood of sum(log P(xi\x1..x_i-1)). The pre-trained language model supervised fine tuned for text classification, text entailment x=(premise$hypothesis)->y=class, similarity x=(sent1$sent2) and x=(sent2$sent1)->y=1/0, question and answer mcq x(context$ans1), x(context$ans2)->softmax(2). |
GPT-2 | February 2019 | Demonstrated 0-shot learning (i.e. it learns language processing task without explicit supervision) with Large LM possible. Used new dataset WebText, modified BPE tokenizer, minor modification to decoder, context size 512->1024, 1.5B params. SOTA perplexity on language modelling (predict x_t\x1..x_t-1), accuracy on NER, perplexity/accuracy on LAMBDA data (predict last word of long range sent), accuracy on common sense reasoning, not SOTA ROUGE F1 in summarisation, BLUE in translation, comparable in accuracy in question-answer. These evaluation done without explicit training (fine tuning) |
GPT-3 | May 2020 | 175B params, Auto-regressive language model. Similar model like GPT-2. SOTA on Language model, 35->20 on perplexity, accuracy score on 0 shot, one-shot, few shot; in question answer in 1/3 dataset over T5, not best for translation since 93% training in English, mixed for common sense reasoning, arithmetic task performance improves with shots |
BLOOM | November 2022 | 176B largest multilingual open source LL. Decoder only transformer similar to GPT-3, trained on 1.6TB hugging face dataset, 46 languages, 13 programming languages. Major architecture change using ALiBi Positional Embedding (attenuates attention score by the position info from distance between key and query, instead of addition position info to embedding layer), LayerNorm after embedding layer leading to better stability in training (effects zero-shot generalisation though). Pretrained, and the fine tuning for multitask and contrastive led to a multilingual information retrieval model and multilingual semantic textual similarity (STS) model. Performance is not effected on English-only task even though multilingual training. |
Paper | Date | Description |
---|---|---|
LayoutLM | December 2019 | Uses BERT as backbone model, takes trainable text embedding and new additional 4 positional embeddings (bbox of each token) to get layoutLM embedding. Text and bbox were extracted using pre-built OCR reader. At the same time, the ROI/bbox text in image form is passed through Faster-RCNN to get image embeddings. They are then used with LayoutLM emb. for fine-tuning tasks. Pre-trained for Masked Visual Language Model MLVM to learn text MSK using its position amb and other text+pos emb, Multi-label document classification to learn document representation in CLS token. Fine-tuned for form understanding task, receipt understanding task, and document image classification task i.e. text, layout info in pre-training and visual info in fine-tuning. WordPiece tokenizer, BIESO tags for token labeling. |
Table Net | January 2020 | Extracts table from a scanned document. A segmentation model similar to UNet, but strided conv for upsampling. VGG-19 as encoder, two separate decoder: table segmentation and column segmentation targets. Table segmentation op helps to select only those bboxes and text from tessarct-OCR lying inside the table and column segmentation. Row detected by looking for demarcation lines between vertically placed words via Randon transform. For multi rows structure, look for rows when all horizontal entries there. Defaults to every horizontal line as a row. |
LayoutLMv2 | December 2020 | Multi-modal emb with Spatial-Aware self-attention Mechanism (self-attention weights with bias vector to get relative spatial info). Pretraining using text, layout and image emb. Text emb = token emb + 1D position emb for token index + segment emb for different text segments. Visual emb = flatten features from ResNeXt-FPN + 1D position emb for token index + segment emb for different text segment, Layout emb - 6 bbox co-od. Pre-training: MVLM, Text image alignment TIA: cover images of token lines, predict covered or not, Text-Image matching: CLS token to predict if image is from the same text. Fine tune: Document image classification, token-level classification, visual question answering on document images. |
Donut | November 2021 | OCR-free Document understanding transformer avoids inefficiency of OCR like text not detected, doc structure). Swin Transformer as encoder and BART as decoder. Trained using teacher forcing strategy. Input image to encoder, input Promt ( |
LayoutLMv3 | April 2022 | … |
UDOP | December 2022 | Unifies Vision-Text-Layout through VLT transformer. A VLT encoder and 2 decoders: TL and V. Encoder-TL decoder follows T5 architecture (generate text and layout toekns in seq-to-seq manner). V decoder is Masked Auto-Encoder (MAE) decoder (generate img pixel). T-V embedding: divide image in patches, add tokens of patch and corr. text if present as unified TV embedding. For TV-L embedding, discrete bbox and add them to TV embedding to create VLT embedding. Generative pretraining (input is prompt): Self-supervised (layout modelling: bbox/text, img; visual text recognition: text/bbox; text-layout recognition: text,bbox/img; MIM: img/text, bbox), Supervised pretraining (classification, layout analysis:give bbox if prompt para; info extraction, question answer) |
DocLLM | January 2024 |