Pretraining of a Swiss Long Legal BERT Model

We will scrape legal text in German, French and Italian to pretrain a Swiss Long Legal BERT model capable of performing NLP tasks better in the Swiss legal domain.

Factsheet

  • Lead school Business School
  • Institute Institute for Public Sector Transformation
  • Research unit Digital Sustainability Lab
  • Funding organisation Others
  • Duration (planned) 15.12.2021 - 31.07.2022
  • Project management Prof. Dr. Matthias Stürmer
  • Head of project Joël Niklaus
  • Partner Schweizerisches Bundesgericht

Situation

We see a clear research gap that BERT models capable of handling long mul- tilingual text are currently underexplored (gap 1). Additionally, to the best of our knowledge, there is no multilingual legal BERT model available yet (gap 2). Tay et al. [2020b] present a benchmark for evaluating BERT-like models capable of handling long input and conclude preliminarily that BigBird [Zaheer et al., 2020] is the currently best performing variant.

Course of action

We thus propose to pretrain a BERT-like model (likely BigBird) on multi- lingual long text to fill the first research gap. To fill the second gap, we propose to further pretrain [Gururangan et al., 2020] this model on multilingual legal text.