Pretraining of a Swiss Long Legal BERT Model

We will scrape legal text in German, French and Italian to pretrain a Swiss Long Legal BERT model capable of performing NLP tasks better in the Swiss legal domain.

Factsheet

Situation

We see a clear research gap that BERT models capable of handling long mul- tilingual text are currently underexplored (gap 1). Additionally, to the best of our knowledge, there is no multilingual legal BERT model available yet (gap 2). Tay et al. [2020b] present a benchmark for evaluating BERT-like models capable of handling long input and conclude preliminarily that BigBird [Zaheer et al., 2020] is the currently best performing variant.

Course of action

We thus propose to pretrain a BERT-like model (likely BigBird) on multi- lingual long text to fill the first research gap. To fill the second gap, we propose to further pretrain [Gururangan et al., 2020] this model on multilingual legal text.

This project contributes to the following SDGs

  • 9: Industry, innovation and infrastructure
  • 16: Peace, justice and strong institutions