The NLP Transformer is a novel architecture intended to solve sequence-to-sequence tasks while managing long-range dependencies. Capturing relationships in sentences and the sequence of words as such is vital for a machine to understand a natural language. That is a vital role of transformers in Natural Language Processing (NLP). Over recent years transformer has gained popularity due to its wide usage in solving sequential data problems in NLP tasks like machine translations and language modeling.
To participate in this revolution of transformer evolutions in computer vision Facebook-related a "Detection Transformers (DETR)" an innovative new method for object detection that is based on completely different architecture.
DETR is a set-based global loss that imposes specific predictions by bipartite matching and an encoder-decoder architecture for transformers. Current deep learning algorithms perform multi-step object detection which leads to the problem of false positives. DETR intends to simplify this innovatively and efficiently.
The original paper is at https://arxiv.org/pdf/2005.12872.pdf
This new model is quite simple and powerful. With the help of an encoder-decoder architecture based on transformers, DETR treats an object detection problem as a direct set prediction issue. DETR demonstrates much better results for large objects.
Below is the architecture of DETR:
It primarily has key components like a CNN backbone, an Encoder-Decoder transfer, and a feed-forward network. The CNN backbone creates a property map from the image data, then, the CNN backbone output is transformed into a one-dimensional function map that is passed as input to the Transformer encoder. This encoder 's output is N number of fixed length embeddings (vectors), where N is the number of objects in the model's assumed image. With the help of self and encoder-decoder attention mechanism, transformer decoder decodes these embeddings into bounding box coordinates and feed-forward neural networks predict normalized center coordinates, bounding box height, and width, and the class label predicts using a softmax function.
The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors, but more test cases and performance evaluations are needed to check its performance.
Have you tried this yet? Let me know
If you found this Article interesting, why not review the other Articles in our archive.