Pretraining of a Swiss Long Legal BERT Model

We will scrape legal text in German, French and Italian to pretrain a Swiss Long Legal BERT model capable of performing NLP tasks better in the Swiss legal domain.

Factsheet

Schools involved Business School
Institute(s) Institute for Public Sector Transformation
Research unit(s) Digital Sustainability Lab
Funding organisation Others
Duration (planned) 15.12.2021 - 31.12.2022
Head of project Joel Niklaus
Project staff Alperen Bektas
Veton Matoshi
Partner Schweizerisches Bundesgericht

Situation

We see a clear research gap that BERT models capable of handling long mul- tilingual text are currently underexplored (gap 1). Additionally, to the best of our knowledge, there is no multilingual legal BERT model available yet (gap 2). Tay et al. 2020b present a benchmark for evaluating BERT-like models capable of handling long input and conclude preliminarily that BigBird Zaheer et al., 2020 is the currently best performing variant.

Course of action

We thus propose to pretrain a BERT-like model (likely BigBird) on multi- lingual long text to fill the first research gap. To fill the second gap, we propose to further pretrain Gururangan et al., 2020 this model on multilingual legal text.