Pretraining of a Swiss Long Legal BERT Model
We will scrape legal text in German, French and Italian to pretrain a Swiss Long Legal BERT model capable of performing NLP tasks better in the Swiss legal domain.
- Lead school Business School
- Institute Institute for Public Sector Transformation
- Research unit Digital Sustainability Lab
- Funding organisation Others
- Duration (planned) 15.12.2021 - 31.12.2022
- Project management Prof. Dr. Matthias Stürmer
- Head of project Joël Niklaus
Dr. Alperen Bektas
Adrian Joel Jörg
- Partner Schweizerisches Bundesgericht
We see a clear research gap that BERT models capable of handling long mul- tilingual text are currently underexplored (gap 1). Additionally, to the best of our knowledge, there is no multilingual legal BERT model available yet (gap 2). Tay et al. [2020b] present a benchmark for evaluating BERT-like models capable of handling long input and conclude preliminarily that BigBird [Zaheer et al., 2020] is the currently best performing variant.
Course of action
We thus propose to pretrain a BERT-like model (likely BigBird) on multi- lingual long text to fill the first research gap. To fill the second gap, we propose to further pretrain [Gururangan et al., 2020] this model on multilingual legal text.