Python-Transformers v4.15.0: Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

icon
Latest Release: v4.15.0

New Model additions

WavLM

WavLM was proposed in WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.

WavLM sets a new SOTA on the SUPERB benchmark.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=wavlm

  • Add WavLM by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14354

Wav2Vec2Phoneme

Wav2Vec2Phoneme was proposed in Simple and Effective Zero-shot Cross-lingual Phoneme Recognition by Qiantong Xu, Alexei Baevski, Michael Auli. Wav2Vec2Phoneme allows to do phoneme classification as part of automatic speech recognition

  • [Wav2Vec2 Phoneme] Let phonemizer lang default to tokenizer's settings by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14829

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=phoneme-recognition

UniSpeech-SAT

Unispeech-SAT was proposed in UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.

UniSpeech-SAT is especially good at speaker related tasks.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=unispeech-sat

UniSpeech

Unispeech was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. Three new models are released as part of the ImageGPT integration: ImageGPTModel, ImageGPTForCausalImageModeling, ImageGPTForImageClassification, in PyTorch.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=unispeech

New Tasks

Speaker Diarization and Verification

Wav2Vec2-like architecture now have a speaker diarization and speaker verification head added to their architectures. You can try out the new task here: https://huggingface.co/spaces/microsoft/wavlm-speaker-verification

  • Add Speaker Diarization and Verification heads by @anton-l in https://github.com/huggingface/transformers/pull/14723

What's Changed

  • Move import to avoid circular import by @sgugger in https://github.com/huggingface/transformers/pull/14787
  • PoC for conserving old links by @sgugger in https://github.com/huggingface/transformers/pull/14754
  • Removes images to put them in a dataset by @LysandreJik in https://github.com/huggingface/transformers/pull/14781
  • Post sphinx-clean up and contributing guide updates by @sgugger in https://github.com/huggingface/transformers/pull/14790
  • Fix the build documentation job by @sgugger in https://github.com/huggingface/transformers/pull/14788
  • Update CONTRIBUTING.md by @kamalkraj in https://github.com/huggingface/transformers/pull/14799
  • Update CONTRIBUTING.md by @kamalkraj in https://github.com/huggingface/transformers/pull/14800
  • Train step fix by @Rocketknight1 in https://github.com/huggingface/transformers/pull/14796
  • [Generate] Make generate multi-modal by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14784
  • Remove require_datasets testing utility by @LysandreJik in https://github.com/huggingface/transformers/pull/14795
  • [WavLM] Correct position bias computation by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14805
  • Fix Perceiver multi GPU test by @NielsRogge in https://github.com/huggingface/transformers/pull/14810
  • [WavLM] Layerdrop is not allowed for first layer by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14811
  • [Generate] Correct input_ids detection by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14815
  • Implement head_mask for Flax BERT and other models copied from BERT by @stancld in https://github.com/huggingface/transformers/pull/14620
  • Convert rst to mdx bert by @LysandreJik in https://github.com/huggingface/transformers/pull/14806
  • Wav2Vec2 meets phonemes by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14353
  • [ImageGPT] Deprecate pixel_values input name to input_ids by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14801
  • [Seq2SeqTrainer] Remove model input name hack by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14802
  • [WavLM] Fix slow tests by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14845
  • Add SD and SV heads for WavLM by @anton-l in https://github.com/huggingface/transformers/pull/14847
  • Add an argument to set bucket_cap_mb for PyTorch DDP by @changlan in https://github.com/huggingface/transformers/pull/14756
  • Update CONTRIBUTING.md by @kamalkraj in https://github.com/huggingface/transformers/pull/14835
  • Fix dead link to benchmarks.ipynb by @DerekChia in https://github.com/huggingface/transformers/pull/14842
  • [Perceiver] Skip multi-gpu tests for now by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14813
  • Add 'with torch.no_grad()' to DeBERTa integration test forward pass by @henholm in https://github.com/huggingface/transformers/pull/14821
  • Add 'with torch.no_grad()' to BERT integration test forward pass by @henholm in https://github.com/huggingface/transformers/pull/14820
  • Add a main_input_name attribute to all models by @sgugger in https://github.com/huggingface/transformers/pull/14803
  • [doc] typo by @stas00 in https://github.com/huggingface/transformers/pull/14849
  • [logging] implement warning_advice / TRANSFORMERS_NO_ADVISORY_WARNINGS by @stas00 in https://github.com/huggingface/transformers/pull/14669
  • Make the onnx submodule init lazy by @sgugger in https://github.com/huggingface/transformers/pull/14855
  • Convert docstrings of modeling files by @sgugger in https://github.com/huggingface/transformers/pull/14850
  • [Bart] better error message by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14854
  • Only create the model card on process 0 by @sgugger in https://github.com/huggingface/transformers/pull/14857
  • [ASR example] Improve example + add more examples by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14848
  • Fix the value error typo of AdamW's betas' valid values checking by @dourgey in https://github.com/huggingface/transformers/pull/14780
  • Add custom stopping_criteria and logits_processor to generate by @lvwerra in https://github.com/huggingface/transformers/pull/14779
  • Replace commit sha by commit url for update jobs by @sgugger in https://github.com/huggingface/transformers/pull/14852
  • [examples/summarization] deal with None in data records by @stas00 in https://github.com/huggingface/transformers/pull/14816
  • [doc porting] several docs by @stas00 in https://github.com/huggingface/transformers/pull/14858
  • Mass conversion of documentation from rst to Markdown by @sgugger in https://github.com/huggingface/transformers/pull/14866
  • Fix FLAX_MULTIPLE_CHOICE_SAMPLE typo by @mishig25 in https://github.com/huggingface/transformers/pull/14871
  • Fixes in marian doc by @sgugger in https://github.com/huggingface/transformers/pull/14872
  • Fix FlaxMarianMTModel return block. by @sgugger in https://github.com/huggingface/transformers/pull/14873
  • Fix doc mistakes by @sgugger in https://github.com/huggingface/transformers/pull/14874
  • Convert model files from rst to mdx by @LysandreJik in https://github.com/huggingface/transformers/pull/14865
  • update the arguments add_prefix_space and trim_offsets in backend_tokenizer.post_processor of RobertaTokenizerFast by @SaulLu in https://github.com/huggingface/transformers/pull/14752
  • Feature/fix slow test in mluke by @Ryou0634 in https://github.com/huggingface/transformers/pull/14749
  • Updated deberta attention by @guillaume-be in https://github.com/huggingface/transformers/pull/14625
  • IterableDatasetShard should use per device batch size instead of real… by @SysuCharon in https://github.com/huggingface/transformers/pull/14714
  • Fix Perceiver code example by @NielsRogge in https://github.com/huggingface/transformers/pull/14879
  • Fix pytorch image classification example by @mariosasko in https://github.com/huggingface/transformers/pull/14883
  • Onnx enable tasks for supported models (part 2) by @michaelbenayoun in https://github.com/huggingface/transformers/pull/14700
  • Properly indent return block by @sgugger in https://github.com/huggingface/transformers/pull/14887

New Contributors

  • @changlan made their first contribution in https://github.com/huggingface/transformers/pull/14756
  • @DerekChia made their first contribution in https://github.com/huggingface/transformers/pull/14842
  • @henholm made their first contribution in https://github.com/huggingface/transformers/pull/14821
  • @dourgey made their first contribution in https://github.com/huggingface/transformers/pull/14780
  • @SysuCharon made their first contribution in https://github.com/huggingface/transformers/pull/14714

Full Changelog: https://github.com/huggingface/transformers/compare/v4.14.0...v4.15.0

Source code(tar.gz)
Source code(zip)



Build GitHub Documentation GitHub release

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0

? Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pretrained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.

Features

  • High performance on NLU and NLG tasks
  • Low barrier to entry for educators and practitioners

State-of-the-art NLP for everyone

  • Deep learning researchers
  • Hands-on practitioners
  • AI/ML/NLP teachers and educators

Lower compute costs, smaller carbon footprint

  • Researchers can share trained models instead of always retraining
  • Practitioners can reduce compute time and production costs
  • Dozens of architectures with over 1,000 pretrained models, some in more than 100 languages

Choose the right framework for every part of a model's lifetime

  • Train state-of-the-art models in 3 lines of code
  • Deep interoperability between TensorFlow 2.0 and PyTorch models
  • Move a single model between TF2.0/PyTorch frameworks at will
  • Seamlessly pick the right framework for training, evaluation, production
Section Description
Installation How to install the package
Model architectures Architectures (with pretrained weights)
Online demo Experimenting with this repo’s text generation capabilities
Quick tour: Usage Tokenizers & models usage: Bert and GPT-2
Quick tour: TF 2.0 and PyTorch Train a TF 2.0 model in 10 lines of code, load it in PyTorch
Quick tour: pipelines Using Pipelines: Wrapper around tokenizer and models to use finetuned models
Quick tour: Fine-tuning/usage scripts Using provided scripts: GLUE, SQuAD and Text generation
Quick tour: Share your models Upload and share your fine-tuned models with the community
Migrating from pytorch-transformers to transformers Migrating your code from pytorch-transformers to transformers
Migrating from pytorch-pretrained-bert to pytorch-transformers Migrating your code from pytorch-pretrained-bert to transformers
[Documentation](v2.5.0)(v2.4.0/v2.4.1)(v2.3.0)(v2.2.0/v2.2.1/v2.2.2) (v2.1.1) (v2.0.0) (v1.2.0) (v1.1.0) (v1.0.0) (master) Full API documentation and more

Installation

This repo is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for examples) and TensorFlow 2.0.

You should install ? Transformers in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

Create a virtual environment with the version of Python you're going to use and activate it.

Now, if you want to use ? Transformers, you can install it with pip. If you'd like to play with the examples, you must install it from source.

With pip

First you need to install one of, or both, TensorFlow 2.0 and PyTorch. Please refer to TensorFlow installation page and/or PyTorch installation page regarding the specific install command for your platform.

When TensorFlow 2.0 and/or PyTorch has been installed, ? Transformers can be installed using pip as follows:

pip install transformers

From source

Here also, you first need to install one of, or both, TensorFlow 2.0 and PyTorch. Please refer to TensorFlow installation page and/or PyTorch installation page regarding the specific install command for your platform.

When TensorFlow 2.0 and/or PyTorch has been installed, you can install from source by cloning the repository and running:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .

When you update the repository, you should upgrade the transformers installation and its dependencies as follows:

git pull
pip install --upgrade .

Run the examples

Examples are included in the repository but are not shipped with the library.

Therefore, in order to run the latest versions of the examples, you need to install from source, as described above.

Look at the README for how to run examples.

Tests

A series of tests are included for the library and for some example scripts. Library tests can be found in the tests folder and examples tests in the examples folder.

Depending on which framework is installed (TensorFlow 2.0 and/or PyTorch), the irrelevant tests will be skipped. Ensure that both frameworks are installed if you want to execute all tests.

Here's the easiest way to run tests for the library:

pip install -e ".[testing]"
make test

and for the examples:

pip install -e ".[testing]"
pip install -r examples/requirements.txt
make test-examples

For details, refer to the contributing guide.

Do you want to run a Transformer model on a mobile device?

You should check out our swift-coreml-transformers repo.

It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains GPT-2, DistilGPT-2, BERT, and DistilBERT) to CoreML models that run on iOS devices.

At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models to productizing them in CoreML, or prototype a model or an app in CoreML then research its hyperparameters or architecture from TensorFlow 2.0 and/or PyTorch. Super exciting!

Model architectures

? Transformers currently provides the following NLU/NLG architectures:

  1. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  2. GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
  3. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
  4. Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
  5. XLNet (from Google/CMU) released with the paper ​XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
  6. XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
  7. RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
  8. DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.
  9. CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
  10. CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
  11. ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  12. T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  13. XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
  14. MMBT (from Facebook), released together with the paper a Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
  15. FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
  16. BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
  17. ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
  18. DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
  19. Reformer (from Google Research) released with the paper Reformer: The Efficient Transformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
  20. MarianMT Machine translation models trained using OPUS data by Jörg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.
  21. Longformer (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  22. Other community models, contributed by the community.
  23. Want to contribute a new model? We have added a detailed guide and templates to guide you in the process of adding a new model. You can find them in the templates folder of the repository. Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the documentation.

Online demo

Write With Transformer, built by the Hugging Face team at transformer.huggingface.co, is the official demo of this repo’s text generation capabilities. You can use it to experiment with completions generated by GPT2Model, TransfoXLModel, and XLNetModel.

? Write with transformer is to writing what calculators are to calculus.”

write_with_transformer

Quick tour

Let's do a very quick overview of the model architectures in ? Transformers. Detailed examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the full documentation.

import torch
from transformers import *

# Transformers has a unified API
# for 10 transformer architectures and 30 pretrained weights.
#          Model          | Tokenizer          | Pretrained weights shortcut
MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]

# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`

# Let's encode some text in a sequence of hidden-states using each model:
for model_class, tokenizer_class, pretrained_weights in MODELS:
    # Load pretrained model/tokenizer
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights)

    # Encode text
    input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples

# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

# All the classes for an architecture can be initiated from pretrained weights for this architecture
# Note that additional weights added for fine-tuning are only initialized
# and need to be trained on the down-stream task
pretrained_weights = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)
for model_class in BERT_MODEL_CLASSES:
    # Load pretrained model/tokenizer
    model = model_class.from_pretrained(pretrained_weights)

    # Models can return full list of hidden-states & attentions weights at each layer
    model = model_class.from_pretrained(pretrained_weights,
                                        output_hidden_states=True,
                                        output_attentions=True)
    input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
    all_hidden_states, all_attentions = model(input_ids)[-2:]

    # Models are compatible with Torchscript
    model = model_class.from_pretrained(pretrained_weights, torchscript=True)
    traced_model = torch.jit.trace(model, (input_ids,))

    # Simple serialization for models and tokenizers
    model.save_pretrained('./directory/to/save/')  # save
    model = model_class.from_pretrained('./directory/to/save/')  # re-load
    tokenizer.save_pretrained('./directory/to/save/')  # save
    tokenizer = BertTokenizer.from_pretrained('./directory/to/save/')  # re-load

    # SOTA examples for GLUE, SQUAD, text generation...

Quick tour TF 2.0 training and PyTorch interoperability

Let's do a quick example of how a TensorFlow 2.0 model can be trained in 12 lines of code with ? Transformers and then loaded in PyTorch for fast inspection/tests.

import tensorflow as tf
import tensorflow_datasets
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
data = tensorflow_datasets.load('glue/mrpc')

# Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

# Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)

# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
sentence_0 = "This research was consistent with his findings."
sentence_1 = "His findings were compatible with this research."
sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")

Quick tour of the fine-tuning/usage scripts

Important Before running the fine-tuning scripts, please read the instructions on how to setup your environment to run the examples.

The library comprises several example scripts with SOTA performances for NLU and NLG tasks:

  • run_glue.py: an example fine-tuning sequence classification models on nine different GLUE tasks (sequence-level classification)
  • run_squad.py: an example fine-tuning question answering models on the question answering dataset SQuAD 2.0 (token-level classification)
  • run_ner.py: an example fine-tuning token classification models on named entity recognition (token-level classification)
  • run_generation.py: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation
  • other model-specific examples (see the documentation).

Here are three quick usage examples for these scripts:

run_glue.py: Fine-tuning on GLUE tasks for sequence classification

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

Before running any of these GLUE tasks you should download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

You should also install the additional packages required by the examples:

pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python ./examples/text-classification/run_glue.py \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_device_eval_batch_size=8   \
    --per_device_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

Fine-tuning XLNet model on the STS-B regression task

This example code fine-tunes XLNet on the STS-B corpus using parallel training on a server with 4 V100 GPUs. Parallel training is a simple way to use several GPUs (but is slower and less flexible than distributed training, see below).

export GLUE_DIR=/path/to/glue

python ./examples/text-classification/run_glue.py \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --do_eval   \
    --task_name=sts-b     \
    --data_dir=${GLUE_DIR}/STS-B  \
    --output_dir=./proc_data/sts-b-110   \
    --max_seq_length=128   \
    --per_device_eval_batch_size=8   \
    --per_device_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=1200  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120

On this machine we thus have a batch size of 32, please increase gradient_accumulation_steps to reach the same batch size if you have a smaller machine. These hyper-parameters should result in a Pearson correlation coefficient of +0.917 on the development set.

Fine-tuning Bert model on the MRPC classification task

This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.

python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classification/run_glue.py   \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name MRPC \
    --do_train   \
    --do_eval   \
    --data_dir $GLUE_DIR/MRPC/   \
    --max_seq_length 128   \
    --per_device_eval_batch_size=8   \
    --per_device_train_batch_size=8   \
    --learning_rate 2e-5   \
    --num_train_epochs 3.0  \
    --output_dir /tmp/mrpc_output/ \
    --overwrite_output_dir   \
    --overwrite_cache \

Training with these hyper-parameters gave us the following results:

  acc = 0.8823529411764706
  acc_and_f1 = 0.901702786377709
  eval_loss = 0.3418912578906332
  f1 = 0.9210526315789473
  global_step = 174
  loss = 0.07231863956341798

run_squad.py: Fine-tuning on SQuAD for question-answering

This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --per_device_eval_batch_size=3   \
    --per_device_train_batch_size=3   \

Training with these hyper-parameters gave us the following results:

python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
{"exact_match": 86.91579943235573, "f1": 93.1532499015869}

This is the model provided as bert-large-uncased-whole-word-masking-finetuned-squad.

run_generation.py: Text generation with GPT, GPT-2, CTRL, Transformer-XL and XLNet

A conditional generation script is also included to generate text from a prompt. The generation script includes the tricks proposed by Aman Rusia to get high-quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).

Here is how to run the script with the small version of OpenAI GPT-2 model:

python ./examples/text-generation/run_generation.py \
    --model_type=gpt2 \
    --length=20 \
    --model_name_or_path=gpt2 \

and from the Salesforce CTRL model:

python ./examples/text-generation/run_generation.py \
    --model_type=ctrl \
    --length=20 \
    --model_name_or_path=ctrl \
    --temperature=0 \
    --repetition_penalty=1.2 \

Quick tour of model sharing

Starting with v2.2.2, you can now upload and share your fine-tuned models with the community, using the CLI that's built-in to the library.

First, create an account on https://huggingface.co/join. Optionally, join an existing organization or create a new one. Then:

transformers-cli login
# log in using the same credentials as on huggingface.co

Upload your model:

transformers-cli upload ./path/to/pretrained_model/

# ^^ Upload folder containing weights/tokenizer/config
# saved via `.save_pretrained()`

transformers-cli upload ./config.json [--filename folder/foobar.json]

# ^^ Upload a single file
# (you can optionally override its filename, which can be nested inside a folder)

If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command:

--organization organization_name

Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above:

"username/pretrained_model"
# or if an org:
"organization_name/pretrained_model"

Please add a README.md model card to the repo under model_cards/ with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc.

Your model now has a page on huggingface.co/models ?

Anyone can load it from code:

tokenizer = AutoTokenizer.from_pretrained("namespace/pretrained_model")
model = AutoModel.from_pretrained("namespace/pretrained_model")

List all your files on S3:

transformers-cli s3 ls

You can also delete unneeded files:

transformers-cli s3 rm …

Quick tour of pipelines

New in version v2.3: Pipeline are high-level objects which automatically handle tokenization, running your data through a transformers model and outputting the result in a structured object.

You can create Pipeline objects for the following down-stream tasks:

  • feature-extraction: Generates a tensor representation for the input sequence
  • ner: Generates named entity mapping for each word in the input sequence.
  • sentiment-analysis: Gives the polarity (positive / negative) of the whole input sequence.
  • text-classification: Initialize a TextClassificationPipeline directly, or see sentiment-analysis for an example.
  • question-answering: Provided some context and a question refering to the context, it will extract the answer to the question in the context.
  • fill-mask: Takes an input sequence containing a masked token (e.g. <mask>) and return list of most probable filled sequences, with their probabilities.
  • summarization
  • translation_xx_to_yy
>>> from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
>>> nlp = pipeline('sentiment-analysis')
>>> nlp('We are very happy to include pipeline into the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

# Allocate a pipeline for question-answering
>>> nlp = pipeline('question-answering')
>>> nlp({
...     'question': 'What is the name of the repository ?',
...     'context': 'Pipeline have been included in the huggingface/transformers repository'
... })
{'score': 0.5135612454720828, 'start': 35, 'end': 59, 'answer': 'huggingface/transformers'}

Migrating from pytorch-transformers to transformers

Here is a quick summary of what you should take care of when migrating from pytorch-transformers to transformers.

Positional order of some models' keywords inputs (attention_mask, token_type_ids...) changed

To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask, token_type_ids...) has been changed.

If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids), this should not cause any change.

If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids), you may have to double check the exact order of input arguments.

Migrating from pytorch-pretrained-bert to transformers

Here is a quick summary of what you should take care of when migrating from pytorch-pretrained-bert to transformers.

Models always output tuples

The main breaking change when migrating from pytorch-pretrained-bert to transformers is that every model's forward method always outputs a tuple with various elements depending on the model and the configuration parameters.

The exact content of the tuples for each model is detailed in the models' docstrings and the documentation.

In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in pytorch-pretrained-bert.

Here is a pytorch-pretrained-bert to transformers conversion example for a BertForSequenceClassification classification model:

# Let's load our model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

# Now just use this line in transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]

# In transformers you can also have access to the logits:
loss, logits = outputs[:2]

# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs

Using hidden states

By enabling the configuration option output_hidden_states, it was possible to retrieve the last hidden states of the encoder. In pytorch-transformers as well as transformers the return value has changed slightly: all_hidden_states now also includes the hidden state of the embeddings in addition to those of the encoding layers. This allows users to easily access the embeddings final state.

Serialization

Breaking change in the from_pretrained() method:

  1. Models are now set in evaluation mode by default when instantiated with the from_pretrained() method. To train them, don't forget to set them back in training mode (model.train()) to activate the dropout modules.

  2. The additional *input and **kwargs arguments supplied to the from_pretrained() method used to be directly passed to the underlying model's class __init__() method. They are now used to update the model configuration attribute instead, which can break derived model classes built based on the previous BertForSequenceClassification examples. We are working on a way to mitigate this breaking change in #866 by forwarding the the model's __init__() method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.

Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory) if you were using any other serialization method before.

Here is an example:

### Let's load a model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Do some stuff to our model and tokenizer
# Ex: add new tokens to the vocabulary and embeddings of our model
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))
# Train our model
train(model)

### Now let's save our model and tokenizer to a directory
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')

Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

The two optimizers previously included, BertAdam and OpenAIAdam, have been replaced by a single AdamW optimizer which has a few differences:

  • it only implements weights decay correction,
  • schedules are now externals (see below),
  • gradient clipping is now also external (see below).

The new optimizer AdamW matches PyTorch Adam optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.

The schedules are now standard PyTorch learning rate schedulers and not part of the optimizer anymore.

Here is a conversion examples from BertAdam with a linear warmup and decay schedule to AdamW and the same schedule:

# Parameters:
lr = 1e-3
max_grad_norm = 1.0
num_training_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1

### Previously BertAdam optimizer was instantiated like this:
optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_training_steps)
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    optimizer.step()

### In Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
### and used like this:
for batch in train_data:
    model.train()
    loss = model(batch)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

Citation

We now have a paper you can cite for the ? Transformers library:

@article{Wolf2019HuggingFacesTS,
  title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
  author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R'emi Louf and Morgan Funtowicz and Jamie Brew},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.03771}
}

Comments

  • Fix loss calculation in TFXXXForTokenClassification models
    Fix loss calculation in TFXXXForTokenClassification models

    Jan 22, 2022

    What does this PR do?

    The current loss calculation in TFFunnelForTokenClassification (and other TF token classification models) don't align with the loss in FunnelForTokenClassification

    TF https://github.com/huggingface/transformers/blob/6ac77534bfe97c00e0127bb4fc846ae0faf1c9c5/src/transformers/models/funnel/modeling_tf_funnel.py#L1709

    PT https://github.com/huggingface/transformers/blob/6ac77534bfe97c00e0127bb4fc846ae0faf1c9c5/src/transformers/models/funnel/modeling_funnel.py#L1470-L1481

    (which further uses the attention_mask).

    This PR aims to fix this. Currently only TFFunnelForTokenClassification is fixed. I would like to have some feedback first before fixing all involved models.

    More information 1

    Check the loss difference between PT/TF version 300 times,

    For (TF)BertForTokenClassification

    without this PR:
    
      max_diff:   0.022978603839874268
      mean_diff: 0.006492746472358704
    
    with this PR (applied locally):
    
      max_diff:   1.7881393432617188e-07
      mean_diff: 3.3974647521972653e-08
    

    For (TF)FunnelForTokenClassification

    without this PR:
    
      max_diff:   0.2821081280708313
      mean_diff: 0.05486685564120611
    
    with this PR:
    
      max_diff:   3.5762786865234375e-07
      mean_diff: 6.020069122314453e-08
    

    More information 2

    Current transformers's version doesn't have a PT-TF test that tests the equivalence in the cases where labels is passed to the models. In PR #15256, such a test is introduced (to make sure more PT/TF equivalence). The inconsistency of loss calculation in PT/TF XXXForTokenClassification models requires the error tolerance to be not smaller than 0.2 in order to pass the test (and this doesn't even guarantee).

    https://github.com/huggingface/transformers/blob/6134cc69527aa9df9a4f2bc9e545222456a34524/tests/test_modeling_tf_common.py#L413-L441

    It would be better to address this inconsistency, and lower that threshold once this PR is merged.

    Reply
  • feat(flax): leave restored weights on CPU
    feat(flax): leave restored weights on CPU

    Jan 22, 2022

    What does this PR do?

    When restoring a flax model with .from_pretrained(), leave the weights on CPU.

    The removed section was linked to issue https://github.com/google/flax/issues/1261 which is now closed. When calling jnp.array(), the tensors are converted from numpy arrays and placed on default device (typically GPU or TPU), which can cause issues when loading very large models that don't fit on one single instance.

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [X] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    @patil-suraj @patrickvonplaten

    Reply
  • Question: how to evaluate word similarity with context?
    Question: how to evaluate word similarity with context?

    Jan 23, 2022

    ❓ Questions & Help

    Hello, I am trying to evaluate word similarity (particularly named entities).

    Example: I flew to Paris last month. I met a famous celebrity, Paris Hilton.

    Most named entity recognition models (e.g. the default pipeline('ner')) will return Paris and Paris Hilton as named entities. They look similar, but they represent different entities. My question is:

    How can I determine that two hyponymic words/phrases are different (from context)?

    I tried this solution: https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2 and then compared with cosine similarity, but as mentioned in this issue https://github.com/huggingface/transformers/issues/2298, cosine similarity does not work well with BERT (and similar like models) embeddings.

    I also looked into the topic of WSD (Word Similarity Disambiguation), but these solution work rather for "standard" hyponyms (like baseball bat and cave bat), but named entities are usually special and are NOT present in dictionary.

    Any help or suggestion will be more than welcome! 🤗

    Reply
  • Prevent `block_emb` from sending to cuda in `RealmForOpenQA` model
    Prevent `block_emb` from sending to cuda in `RealmForOpenQA` model

    Jan 23, 2022

    What does this PR do?

    This PR overrides torch.nn.Module.to function to prevent self.block_emb, which would largely consume gpu resources, from unnecessarily sending to cuda. Therefore, the fine-tuning process can be run on a GPU with 12 Gb Vram.

    The original TF code also limits block_emb to cpu: https://github.com/google-research/language/blob/61fa7260ac7d690d11ef72ca863e45a37c0bdc80/language/orqa/utils/scann_utils.py#L39-L64

    Fixes # (issue)

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    @patrickvonplaten

    Reply
  • Fix the inconsistency of loss calculation between PT/TF XLNetLMHeadModel
    Fix the inconsistency of loss calculation between PT/TF XLNetLMHeadModel

    Jan 23, 2022

    What does this PR do?

    The loss calculation in XLNetLMHeadModel doesn't cut the logits/labels (unlike other models say BertLMHeadModel). However, TFXLNetLMHeadModel works as other TF Causal LM models, and cut the logits. This causes the loss difference higher than 4e-2 sometimes.

    I believe XLNet works somehow differently from the usual causal LM models, and the provided labels and computed logits shouldn't be cut, as XLNetLMHeadModel does.

    This PR fixes this inconsistency.

    Reply
  • Positive Constraint Decoding PR #1
    Positive Constraint Decoding PR #1

    Jan 23, 2022

    Disjunctive Positive Constraint Decoding

    @patrickvonplaten @LysandreJik @sgugger @patil-suraj @yjernite @thomwolf

    Fixes #14081.

    I apologize if this isn't a proper way to deal with feature contributions, but this is an incomplete PR request. I simply thought this was a good place to check-in and checkpoint on the progress & direction of the implementation. We can just work by keep adding commits to this PR request and progress until it's ready for final merge right?

    Steps left:

    • [ ] Applying positive constraints disjunctively.
    • [ ] Writing tests

    Here is an example of how one could use this functionality:

    from transformers import GPT2Tokenizer, GPT2LMHeadModel
    from transformers.generation_beam_constraints import (
        PhrasalConstraint
    )
    device = "cuda"
    
    model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    
    force_text = " big monsters"
    force_text_2 = " crazy"
    force_tokens = tokenizer.encode(force_text, return_tensors="pt").to(device)[0]
    force_tokens_2 = tokenizer.encode(force_text_2, return_tensors="pt").to(device)[0]
    
    constraints = [
        PhrasalConstraint(force_tokens),
        PhrasalConstraint(force_tokens_2)
    ]
    
    input_text = ["The baby is crying because"] 
    
    model_inputs = tokenizer(input_text, return_tensors="pt")
    
    for key, value in model_inputs.items():
        model_inputs[key] = value.to(device)
    
    k = model.generate(
        **model_inputs,
        constraints=constraints,
        num_beams=7,
        num_return_sequences=7,
        no_repeat_ngram_size=2
    )
    
    for out in k:
        print(tokenizer.decode(out))
    

    For some example outputs:

    The baby is crying because she's been told crazy big monsters are going to come and kill her.
    The baby is crying because she's been told crazy big monsters are coming for her.
    The baby is crying because she's been told crazy big monsters are going to come after her.
    

    1. General Constraint Framework

    Users can define their own constraints by inheriting the Constraint interface class and this framework is ensured to work as desired, because the Constraint class is quite strictly defined. If an implementation passes the self.test() function of this interface then it necessarily works as desired. An incorrect implementation will lead to an error.

    # https://github.com/cwkeam/transformers/blob/master/src/transformers/generation_beam_constraints.py#L16
    class Constraint(ABC):
        r"""Abstract base class for all constraints that can be applied during generation.
        It must define how the constraint can be satisfied.
    
        All classes that inherit Constraint must follow the requirement that
        
        ```
        completed = False
        while(not completed):
            _, completed = constraint.update(constraint.advance())
        ```
        
        will always terminate (halt). 
        """
        def __init__(self):
            # test for the above condition
            self.test()
    
        def test(self):
            '''
            Tests whether this constraint has been properly defined.
            '''
            counter = 0
            completed = False
            while not completed:
                if counter == 1:
                    self.reset()
                advance = self.advance()
                assert self.does_advance(advance)
                stepped, completed, reset = self.update(advance)
                counter += 1
    
                if counter > 10000:
                    raise Exception("update() does not fulfill the constraint.")
    
            assert self.remaining() == 0        
            
        def advance(self):
            '''
            When called, returns the token that would take this constraint
            one step closer to being fulfilled.
    
            returns:
                token_ids(`torch.tensor`): Must be a tensor of a list of indexable tokens, not some integer.
            '''
            raise NotImplementedError(
                f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
            )
    
        def does_advance(self, token_id: int):
            """
            Reads in a token and returns whether it creates progress.
            """
            raise NotImplementedError(
                f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
            )
    
        def update(self, token_id: int):
            """
            Reads in a token and returns booleans that indicate the progress made by it.
            This function will update the state of this object unlikes `does_advance(self, token_id: int)`.
    
            This isn't to test whether a certain token will advance the progress; it's to update its state
            as if it has been generated. This becomes important if token_id != desired token 
            (refer to else statement in PhrasalConstraint)
    
            Args:
                token_id(`int`):
                    The id of a newly generated token in the beam search.
            returns:
                stepped(`boolean`):
                    Whether this constraint has become one step closer to being fulfuilled.
                completed(`boolean`):
                    Whether this constraint has been completely fulfilled by this token being generated.
                reset (`boolean`):
                    Whether this constraint has reset its progress by this token being generated.
            """
            raise NotImplementedError(
                f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
            )
        
        def reset(self):
            """
            Resets the state of this constraint to its initialization.
            We would call this in cases where the fulfillment of a constraint is abrupted by an unwanted token.
            """
            raise NotImplementedError(
                f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
            )
    
        def remaining(self):
            '''
            Returns the number of remaining steps of `advance()` in order to complete this constraint.
            '''
            raise NotImplementedError(
                f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
            )
    
        def copy(self, stateful=False):
            '''
            Creates a new instance of this constraint.
    
            Args:
                stateful(`boolean`): Whether to not only copy the constraint for new instance, but also its state.
            Returns:
                constraint(`Constraint`): The same constraint as the one being called from.
            '''
            raise NotImplementedError(
                f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
            )
    

    For now, I've defined TokenConstraint for forcing the generation of a specific token and PhrasalContstraint for forcing the generation of a sequence of tokens that are not broken in the output. The example use of the latter is in the example code above.

    2. model.generate() Mixin

    # https://github.com/cwkeam/transformers/blob/master/src/transformers/generation_utils.py#L780
     def generate(
            self,
            inputs: Optional[torch.Tensor] = None,
            max_length: Optional[int] = None,
            ...
            stopping_criteria: Optional[StoppingCriteriaList] = StoppingCriteriaList(),
            constraints: Optional[List[Constraint]] = None,
            output_attentions: Optional[bool] = None,
            ...
            **model_kwargs,
        ) 
    

    Leads to:

    #https://github.com/cwkeam/transformers/blob/master/src/transformers/generation_utils.py#L1077
    
    # 6. determine generation mode
    is_constraint_gen_mode = constraints is not None
    is_greedy_gen_mode = (num_beams == 1) and (num_beam_groups == 1) and do_sample is False and constraints is None
    is_sample_gen_mode = (num_beams == 1) and (num_beam_groups == 1) and do_sample is True and constraints is None
    is_beam_gen_mode = (num_beams > 1) and (num_beam_groups == 1) and do_sample is False and constraints is None
    is_beam_sample_gen_mode = (num_beams > 1) and (num_beam_groups == 1) and do_sample is True and constraints is None
    is_group_beam_gen_mode = (num_beams > 1) and (num_beam_groups > 1) and constraints is None
    

    Which ends up defining a ConstrainedBeamSearchScorer and initiates the beam search:

    elif is_constraint_gen_mode:
      if num_return_sequences > num_beams:
          raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")
    
      if stopping_criteria.max_length is None:
          raise ValueError("`max_length` needs to be a stopping_criteria for now.")
    
      # 10. prepare beam search scorer
      constrained_beam_scorer = ConstrainedBeamSearchScorer(
          constraints=constraints,
          batch_size=batch_size,
          ...,
      )
      # 11. interleave input_ids with `num_beams` additional sequences per batch
      input_ids, model_kwargs = self._expand_inputs_for_generation(
          input_ids, expand_size=num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs
      )
      # 12. run beam search
      return self.constrained_beam_search(
          input_ids,
          constrained_beam_scorer=constrained_beam_scorer,
          ...
      )
    
    

    3. Future Steps

    1. Disjunctive Constraints

    This doesn't yet do the Disjunctive decoding explained in the Issue #14081 . But this can be very easily implemented by simply defining a new Constraint sub-class. I will follow up with this with another commit.

    2. Tests

    I was unsure how to approach testing this generation function, especially since it's almost identical to the existing approaches, with just another step included that guides the generation.

    Reply
  • Added missing code in exemplary notebook - custom datasets fine-tuning
    Added missing code in exemplary notebook - custom datasets fine-tuning

    Jan 23, 2022

    What does this PR do?

    Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification. The missing code concerns adding labels for all but the first token in a single word. The added code was taken directly from huggingface official example - this colab notebook.

    Fixes # (issue)

    Before submitting

    • [X] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    • maintained examples (not research project or legacy): @sgugger, @patil-suraj
    Reply
  • Getting error while saving model
    Getting error while saving model

    Jan 23, 2022

    Environment info

    • transformers version: 4.5.1
    • Platform: linux
    • Python version : 3.6
    • PyTorch version (GPU?): gpu
    • Tensorflow version (GPU?):
    • Using GPU in script?: yes
    • Using distributed or parallel set-up in script?: yes
    • @sgugger Need help with trainer module

    Models:

    • BERT

    • I am using BERT model. The problem arises in trainer module

    #15300 File "~/lib/python3.6/site-packages/transformers/trainer.py", line 1608, in save_model ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp TypeError: 'in ' requires string as left operand, not ShardedDDPOption

    I am training a Bert Model for a multi-class classification task

    To reproduce

    Steps to reproduce the behavior:

    1. Code
       import logging
      

    import os from statistics import mean, stdev import sys from typing import Callable, Dict import pandas as pd

    import numpy as np from pprint import pformat from scipy.special import softmax import tensorboard import torch

    from transformers import ( AutoTokenizer, AutoConfig, HfArgumentParser, Trainer, EvalPrediction, set_seed )

    from multimodal_args import ModelArguments, MultiModalDataArguments, MultiModalTrainingArguments from evaluation import calc_classification_metrics, calc_regression_metrics from load_dataset import load_datadir from config import TabularConfig from auto_fusion_model import AutoModelFusion from utils import create_dir_if_not_exists

    os.environ['COMET_MODE'] = 'DISABLED' logger = logging.getLogger(name)

    def main():

    #Define text and tabular features
    text_cols = ['keywords',"browse_node_name","pod","ORDERING_GL_PRODUCT_GROUP","gl_product_group_desc"]
    label_col = 'label'
    cat_features = []
    non_num_col = text_cols + ["shipping_address_id","postal_code","browse_node_id","label","asin","customer_id","order_day"]
    #features = pd.read_csv("/efs/avimodi/static_model/feature_importance_static_rhm.csv")
    #features_list = features.head(50)["Feature"].to_list()
    logger.info("Reading sample File")
    sample = pd.read_csv("/efs/avimodi/.MultiModal_Model/input_sample/val.csv")
    features_list = sample.columns.to_list()
    num_features = [col for col in features_list if col not in non_num_col]
    logger.info(len(num_features))
    label_list = ["0","1","2"] # what each label class represents
    column_info_dict = {
    'text_cols': text_cols,
    'num_cols': num_features,
    'cat_cols': cat_features,
    'label_col': 'label',
    'label_list': ["0","1","2"]
    }
    
    
    model_args = ModelArguments(
        model_name_or_path='bert-base-uncased'
    )
    
    data_args = MultiModalDataArguments(
        data_path='/efs/avimodi/.MultiModal_Model/input_sample',
        fusion_method='attention',
        features_info=column_info_dict,
        task='classification',
        numerical_encoding='min-max',
        categorical_encoding = 'none'
    )
    
    training_args = MultiModalTrainingArguments(
        output_dir="/efs/avimodi/unified_model/run_sample/output",
        logging_dir="/efs/avimodi/unified_model/run_sample/logs",
        overwrite_output_dir=True,
        do_train=True,
        do_eval=True,
        per_device_train_batch_size=256,
        per_device_eval_batch_size=256,
        num_train_epochs=10,
        evaluate_during_training=True,
        logging_steps=25,
        eval_steps=500,
        save_steps=500,
        debug_dataset=True,
        report_to = ["tensorboard"],
    )
    
    set_seed(training_args.seed)
    
    # Setup logging
    create_dir_if_not_exists(training_args.output_dir)
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
        datefmt="%m/%d/%Y %H:%M:%S",
        filename = os.path.join(training_args.output_dir,'train_log.txt'),
        filemode = 'w+'
    )
    
    
    
    
    logger.info(f"======== Model Args ========\n{(model_args)}\n")
    logger.info(f"======== Data Args ========\n{(data_args)}\n")
    logger.info(f"======== Training Args ========\n{(training_args)}\n")
    
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )
    
    
    train_dataset, val_dataset, test_dataset = load_datadir(
        data_args.data_path,
        data_args.features_info['text_cols'],
        tokenizer,
        label_col=data_args.features_info['label_col'],
        label_list=data_args.features_info['label_list'],
        categorical_cols=data_args.features_info['cat_cols'],
        numerical_cols=data_args.features_info['num_cols'],
        categorical_encoding=data_args.categorical_encoding,
        numerical_encoding=data_args.numerical_encoding,
        sep_text_token_str=tokenizer.sep_token,
        max_token_length=training_args.max_token_length,
        debug=training_args.debug_dataset
    )
    train_datasets = [train_dataset]
    val_datasets = [val_dataset]
    test_datasets = [test_dataset]
    
    train_dataset = train_datasets[0]
    
    num_labels = len(np.unique(train_dataset.labels)) if data_args.num_classes == -1 else data_args.num_classes
    
    def compute_metrics_fn(p: EvalPrediction):
        if data_args.task == "classification":
            preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
            preds_labels = np.argmax(preds, axis=1)
            if p.predictions.shape[-1] == 2:
                pred_scores = softmax(preds, axis=1)[:, 1]
            else:
                pred_scores = softmax(preds, axis=1)
            return calc_classification_metrics(pred_scores, preds_labels,
                                                p.label_ids)
        elif data_args.task == "regression":
            preds = np.squeeze(p.predictions)
            return calc_regression_metrics(preds, p.label_ids)
        else:
            return {}
    
    total_results = []
    for i, (train_dataset, val_dataset, test_dataset) in enumerate(zip(train_datasets, val_datasets, test_datasets)):
        logger.info(f'======== Fold {i+1} ========')
        config = AutoConfig.from_pretrained(
            model_args.config_name if model_args.config_name else model_args.model_name_or_path,
            cache_dir=model_args.cache_dir,
        )
        tabular_config = TabularConfig(
                               num_labels=num_labels,
                               cat_feat_dim=train_dataset.cat_feats.shape[1] if train_dataset.cat_feats is not None else 0,
                               numerical_feat_dim=train_dataset.numerical_feats.shape[1] if train_dataset.numerical_feats is not None else 0,
                               **vars(data_args)
                               )
        config.tabular_config = tabular_config
    
        model = AutoModelFusion.from_pretrained(
            model_args.config_name if model_args.config_name else model_args.model_name_or_path,
            config=config,
            cache_dir=model_args.cache_dir
        )
        if i == 0:
            logger.info(tabular_config)
            logger.info(model)
    
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            compute_metrics=compute_metrics_fn
        )
        if training_args.do_train:
            train_result = trainer.train(
                resume_from_checkpoint=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
            )
            metrics = train_result.metrics
            # max_train_samples = (
            #     data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
            # )
            metrics["train_samples"] = 500 if training_args.debug_dataset else len(train_dataset)
            trainer.save_model()  # Saves the tokenizer too for easy upload
            trainer.log_metrics("train", metrics)
            trainer.save_metrics("train", metrics)
            trainer.save_state()
    
    
        # Evaluation
        eval_results = {}
        if training_args.do_eval:
            logger.info("*** Evaluate ***")
            eval_result = trainer.evaluate(eval_dataset=val_dataset)
            logger.info(pformat(eval_result, indent=4))
    
            output_eval_file = os.path.join(
                training_args.output_dir, f"eval_metric_results_{task}_fold_{i+1}.txt"
            )
            if trainer.is_world_master():
                with open(output_eval_file, "w") as writer:
                    logger.info("***** Eval results {} *****".format(task))
                    for key, value in eval_result.items():
                        logger.info("  %s = %s", key, value)
                        writer.write("%s = %s\n" % (key, value))
    
            eval_results.update(eval_result)
    
        if training_args.do_predict:
            logging.info("*** Test ***")
    
            predictions = trainer.predict(test_dataset=test_dataset).predictions
            output_test_file = os.path.join(
                training_args.output_dir, f"test_results_{task}_fold_{i+1}.txt"
            )
            eval_result = trainer.evaluate(eval_dataset=test_dataset)
            logger.info(pformat(eval_result, indent=4))
            if trainer.is_world_master():
                with open(output_test_file, "w") as writer:
                    logger.info("***** Test results {} *****".format(task))
                    writer.write("index\tprediction\n")
                    if task == "classification":
                        predictions = np.argmax(predictions, axis=1)
                    for index, item in enumerate(predictions):
                        if task == "regression":
                            writer.write("%d\t%3.3f\t%d\n" % (index, item, test_dataset.labels[index]))
                        else:
                            item = test_dataset.get_labels()[item]
                            writer.write("%d\t%s\n" % (index, item))
                output_test_file = os.path.join(
                    training_args.output_dir, f"test_metric_results_{task}_fold_{i+1}.txt"
                )
                with open(output_test_file, "w") as writer:
                    logger.info("***** Test results {} *****".format(task))
                    for key, value in eval_result.items():
                        logger.info("  %s = %s", key, value)
                        writer.write("%s = %s\n" % (key, value))
                eval_results.update(eval_result)
        del model
        del config
        del tabular_config
        del trainer
        torch.cuda.empty_cache()
        total_results.append(eval_results)
    aggr_res = aggregate_results(total_results)
    logger.info('========= Aggr Results ========')
    logger.info(pformat(aggr_res, indent=4))
    
    output_aggre_test_file = os.path.join(
        training_args.output_dir, f"all_test_metric_results_{task}.txt"
    )
    with open(output_aggre_test_file, "w") as writer:
        logger.info("***** Aggr results {} *****".format(task))
        for key, value in aggr_res.items():
            logger.info("  %s = %s", key, value)
            writer.write("%s = %s\n" % (key, value))
    

    def aggregate_results(total_test_results): metric_keys = list(total_test_results[0].keys()) aggr_results = dict()

    for metric_name in metric_keys:
        if type(total_test_results[0][metric_name]) is str:
            continue
        res_list = []
        for results in total_test_results:
            res_list.append(results[metric_name])
        if len(res_list) == 1:
            metric_avg = res_list[0]
            metric_stdev = 0
        else:
            metric_avg = mean(res_list)
            metric_stdev = stdev(res_list)
    
        aggr_results[metric_name + '_mean'] = metric_avg
        aggr_results[metric_name + '_stdev'] = metric_stdev
    return aggr_results
    

    if name == 'main': main()

    2. Error
    Traceback (most recent call last):
     File "run.py", line 289, in <module>
       main()
     File "run.py", line 191, in main
       trainer.save_model()  # Saves the tokenizer too for easy upload
     File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/transformers/trainer.py", line 1608, in save_model
       **ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp
    TypeError: 'in <string>' requires string as left operand, not ShardedDDPOption**
    Killing subprocess 122966
    Traceback (most recent call last):
     File "/python3.6/runpy.py", line 193, in _run_module_as_main
       "__main__", mod_spec)
     File /lib/python3.6/runpy.py", line 85, in _run_code
       exec(code, run_globals)
     File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in <module>
       main()
     File "/home/avimodi/anaconda3/envs/chakanik_transformer/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
       sigkill_handler(signal.SIGTERM, None)  # not coming back
     File "/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
       raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
    
    3.
    
    <!-- If you have code snippets, error messages, stack traces please provide them here as well.
        Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
        Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->
    
    ## Expected behavior
    
    <!-- A clear and concise description of what you would expect to happen. -->
    
    Reply
  • RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
    RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

    Jan 23, 2022

    Environment info

    • transformers version: 4.12.5
    • Platform: Linux-5.10.90+-x86_64-with-debian-bullseye-sid
    • Python version: 3.7.12
    • PyTorch version (GPU?): 1.9.1 (True)
    • Tensorflow version (GPU?): 2.6.2 (True)
    • Using GPU in script?: Yes
    • Using distributed or parallel set-up in script?: No

    Information

    I am using Captum for interpreting the attributions of the tokens in each layer using Layer-Conductance lc = LayerConductance(squad_pos_forward_func, model.bert.encoder.layer[i])

    Now, in the line layer_attributions = lc.attribute(inputs=input_ids, baselines=ref_input_ids, additional_forward_args=(attention_mask,)) RuntimeError is generating.

    A helper function to perform forward pass of the model and make predictions. def squad_pos_forward_func(input_ids, attention_mask=None): outputs, attention_weights = model(input_ids=input_ids, attention_mask=attention_mask) preds = torch.softmax(outputs, dim = 1)[0][1].unsqueeze(0) return preds

    Model I am using: "google/muril-base-cased"

    #2952

    2 1

    Any help would be greatly appreciated.

    Reply
  • Nan when training LayoutLM_V2 Model
    Nan when training LayoutLM_V2 Model

    Jan 23, 2022

    Environment info

    • transformers version: 4.13.0
    • Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
    • Python version: 3.7.12
    • PyTorch version (GPU) : 1.10.0+cu111
    • Tensorflow version (GPU): 2.7.0
    • Flax version: not installed
    • Jax version: not installed
    • JaxLib version: not installed

    Who can help

    @NielsRogge

    Information

    The model used is LayoutLMv2:

    The problem arises when using:

    • [x] my own modified scripts:

    The tasks I am working on is:

    • [x] Document streaming segmentation In my script, I try to determine when starts a new document with the objective to divide into segments a streaming of folders, so each segment can be interpreted as an independent document.

    To reproduce

    Steps to reproduce the behavior:

    1. Access to the colab notebook created to train LayoutLM_V2 (https://colab.research.google.com/drive/1MsEkj_WlGYDOs3vFcm1JxmMNLWj_Se78?usp=sharing)
    2. Execute every cell in order
    3. In the training loop, Accuracy, loss, and output will be printed, and there will be a moment when the output, Accuracy and Loss will become Nan.

    Expected behavior

    The model trains, and despite if it accomplishes its task or not, the training loop ends without any Nan.

    Reply
  • What to do about this warning message:
    What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification"

    Jun 30, 2020

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
    

    returns this warning message:

    Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
    - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    

    This just started popping up with v.3 so I'm not sure what is the recommended action to take here. Please advise if you can. Basically, any of my code using the AutoModelFor<X> is throwing up this warning now.

    Thanks.

    Reply
  • Sharded DDP training fails with seq2seq models
    Sharded DDP training fails with seq2seq models

    Dec 16, 2020

    Information

    Model I am using (Bert, XLNet ...): T5/BART/mBART/Marian

    The problem arises when using:

    • [x] the official example scripts: (give details below)
    • [ ] my own modified scripts: (give details below)

    The tasks I am working on is:

    • [x] an official GLUE/SQUaD task: seq2seq
    • [ ] my own task or dataset: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    Run

    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
    --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
    ~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
    --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
    --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
    --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
    --n_train 500 --sharded_ddp
    

    will fail with

    Traceback (most recent call last):
    File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
    main()
    File "examples/seq2seq/finetune_trainer.py", line 316, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
    File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
    self.optimizer.step()
    File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
    File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
    self._broadcast_params()
    File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
    if self.should_bucket_param[param]:
    KeyError: Parameter containing:
    tensor([[-0.0296,  0.0038],
    [ 0.0000,  0.0000],
    [ 0.0298,  0.0385],
    ...,
    [-0.0161, -0.0024],
    [ 0.0022, -0.0576],
    [ 0.0053,  0.0256]], device='cuda:1')
    0%|   
    

    Using FP16 also fails.

    Expected behavior

    The script should run to completion.

    Reply
  • Feature extraction for sequential labelling
    Feature extraction for sequential labelling

    Nov 28, 2018

    Hi, I have a question in terms of using BERT for sequential labeling task. Please correct me if I'm wrong. My understanding is:

    1. Use BertModel loaded with pretrained weights instead of MaskedBertModel.
    2. In such case, take a sequence of tokens as input, BertModel would output a list of hidden states, I only use the top layer hidden states as the embedding for that sequence.
    3. Then to fine tune the model, add a linear fully connected layer and softmax to make final decision.

    Is this entire process correct? I followed this procedure but could not have any results.

    Thank you!

    Discussion wontfix 
    Reply
  • Pegasus finetuning: OOM
    Pegasus finetuning: OOM

    Aug 25, 2020

    Epoch 0: 91% 5747/6331 [39:52<04:03, 2.40it/s, loss=75.765, v_num=2]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) tcmalloc: large alloc 1083260928 bytes == 0x1aece0000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1354080256 bytes == 0x21e5c000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1692606464 bytes == 0x7f10651ce000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2115764224 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2644705280 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 3305881600 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 4132356096 bytes == 0x7f0e530f2000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 5165449216 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 ./finetune_pegasus_xsum.sh: line 15: 876 Killed

    I appreciate any help. Thank you.

    Reply
  • How to use fine-tuned BART for prediction?
    How to use fine-tuned BART for prediction?

    Apr 18, 2020

    ❓ Questions & Help

    Details

    I fine-tuned the BART model on a custom summarization dataset using the transformers/examples/summarization/bart/finetune.py and transformers/examples/summarization/bart/run_train.sh files in the repository for training (which generated three checkpointepoch=*.ckpt files) and prediction (which generated a .txt file with the test loss scores).

    I have two questions on using this model for prediction:

    • How can I modify finetune.py to generate predictions for the test set, in addition to the loss scores? I see some test functions in finetune.py, but I'm not sure how to use these for generating a .txt file with the predictions.

    • How can I load the generated .ckpt files into BartForConditionalGeneration()? A config.json file was not generated along with the checkpoint files; there doesn't seem to be a TFBartForConditionalGeneration; and the convert_tf_checkpoint_to_pytorch.py script in the repo doesn't seem to support BART yet.

    Thank you for your time!

    Discussion wontfix 
    Reply
  • FP16 overflow with GPT-Neo when using sequence lengths of 2048.
    FP16 overflow with GPT-Neo when using sequence lengths of 2048.

    Apr 5, 2021

    Environment info

    • transformers version: 4.5.0.dev0
    • Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
    • Python version: 3.8.5
    • PyTorch version (GPU?): 1.8.0+cu111
    • Tensorflow version (GPU?): N/A
    • Using GPU in script?: Yes
    • Using distributed or parallel set-up in script?: No

    Who can help

    @stas00

    Models:

    • GPT-Neo 1.3b

    Library:

    • deepspeed: @stas00

    Information

    Model I am using (Bert, XLNet ...):

    The problem arises when using:

    • [ ] the official example scripts: (give details below)
    • [x] my own modified scripts: (give details below)

    The tasks I am working on is:

    • [ ] an official GLUE/SQUaD task: (give the name)
    • [x] my own task or dataset: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    1. Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
    2. Enable FP16 and set max_length to 2048
    3. Observe that all loses reported are NaN

    Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

    When the max_length is shorter (512) this overflow does not occur.

    Expected behavior

    I expected no overflows.

    Aside

    I'm reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

    Reply
  • [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu
    [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu

    Feb 4, 2021

    Managed to train t5-11b on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB)

    Thank you, @PeterAJansen for letting me use your hardware!

    Thank you, @jeffra and @samyam, for not believing that it is not possible to train t5-11b on 1x 40GB gpu w/ Deepspeed and supporting me that lead me to find a few bugs in the integration.

    Sharing details for those who need.

    If you want to try this at home please make sure you use transformers master as some bug fixes were just merged in

    Well, it's similar to the t5-3b on 24GB success reported here and here. But this time t5-11b on 1x 40GB gpu (or 4x if you wanted things faster)

    As someone asked me before you need a huge amount of general RAM to use ZeRO-Offload for a huge model:

    • for t5-3b on 1x 24GB gpu: ~71GB RAM
    • for t5-11b on 1x 40GB gpu: ~234GB RAM

    I was using /usr/bin/time -v program to get the peak memory measurement - it's the Maximum resident set size entry in the final report.

    Question: I don't think /usr/bin/time does the right thing for multi-process - I think it only measures the parent process. e.g. with 4x gpus it reported only 102GB RAM, but I clearly saw in top that it was around 240GB. If you have an easy way to measure peak memory that takes into an account forked processes I'm all ears.

    Batch sizes on one gpu:

    • with buffers of 5e8 I was able to run BS=2, which might be too small for training,
    • but with 2e8 I managed to squeeze in BS=10 for training, but OOMed on prediction

    I'm referring to these batch sizes in ds_config.json:

            "allgather_bucket_size": 2e8,
            "reduce_bucket_size": 2e8,
    

    And I tested for 2x and 4x DDP as well, BS=16 OOMed, BS=8 was good so I used that - but could probably squeeze some more.

    edit1: later tests show that my test was too short and wasn't getting the CPU Adam optimizer kick in, as it skips the first 20 or so tests because of the overflow. So once it kicks in it takes more GPU memory, so the practical BS is much smaller - I think around 2 on this setup. So most likely you will need to use BS=2 for real work, until things get optimized even more.

    edit2: things are getting re-shuffling in the tests, so the default ds_config.json file has moved in master to a new, hopefully permanent home. It's now at examples/tests/deepspeed/ds_config.json so you will need to adjust the command line to reflect this new location or simply copy it over to where the old one used to be.

    here is the full benchmark:

    # 1 gpu: 
    # only training fits with this BS, eval needs a smaller BS
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 31.0897, 'train_samples_per_second': 0.257, 'epoch': 1.0}
    
    # 2 gpus:
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=2 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 17.9026, 'train_samples_per_second': 0.223, 'epoch': 1.0}
    
    # 4 gpus
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 10.4404, 'train_samples_per_second': 0.192, 'epoch': 1.0}
    

    Checkpointing should allow making even bigger batch sizes.

    DeepSpeed 
    Reply
  • GPT-J-6B
    GPT-J-6B

    Aug 6, 2021

    What does this PR do?

    Introduces the long awaited GPT J model class to HuggingFace! Concurrently with this PR being merged I will make a GPT J 6B checkpoint public on the EleutherAI HF page for people to use. The model has been evaluated as being within error tolerances of the GPT J 6B model we released in Jax two months ago.

    @patil-suraj was very helpful in assisting me to understand HF philosophy and how to make this PR most in line with the rest of the codebase. Other than that, the major design consideration was to make the configs compatible with GPT-2 rather than GPT-Neo. GPT-Neo has some usability limitations due to its configs having names unrelated to GPT-2’s (see #12183 for details). Given those problems and my hope that GPT-Neo will have it’s configs updated in the future, it seemed like a clear choice to align GPT J with GPT-2.

    Shout outs to @finetuneanon whose implementation this one is based off of, as well as @kumuruz for assistence optimizing and debugging.

    Supersedes #12243 #13010 #13022

    Closes #12098

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [X] Did you read the contributor guideline, Pull Request section?
    • [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. It was discussed in Slack with @patil-suraj
    • [X] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [X] Did you write any new necessary tests?

    Who can review?

    • gpt2: @patrickvonplaten, @LysandreJik, @patil-suraj
    Reply
  • Summarization Fine Tuning
    Summarization Fine Tuning

    May 16, 2020

    ❓ Questions & Help

    Details

    I tried using T5 and Bart but the abstraction summarization on scientific texts does not seem to give the results I want since I think they are both trained on news corpora. I have scraped all of the free PMC articles and I am thinking about fine-tuning a seq2seq model between the articles and their abstracts to make an abstractive summarizer for scientific texts. This Medium article (https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8) provides a bit of an introduction to how to approach this but does not quite go into detail so I am wondering how to approach this.

    I'm not really asking for help being stuck but I just don't really know how to approach this problem.

    A link to original question on Stack Overflow: https://stackoverflow.com/questions/61826443/train-custom-seq2seq-transformers-model

    Discussion wontfix 
    Reply
  • How to use BERT for finding similar sentences or similar news?
    How to use BERT for finding similar sentences or similar news?

    Jul 23, 2019

    I have used BERT NextSentencePredictor to find similar sentences or similar news, However, It's super slow. Even on Tesla V100 which is the fastest GPU till now. It takes around 10secs for a query title with around 3,000 articles. Is there a way to use BERT better for finding similar sentences or similar news given a corpus of news articles?

    Reply