The challenge we want to address here is automatically identifying what sections of textual documents refer to. Specifically, given a document that consists of pre-cut paragraphs, we want to classify these paragraphs into 6 predefined classes (see the section below for a description of each class). Typical use cases enabled after this classification are: navigating more easily through documents, selecting only specific sections for summarization purposes, or weighting sections differently as we index these documents in a search engine.
This dataset is intended to researchers working on natural language processing in French, especially on text classification relying on sequential information.
The dataset contains French court decisions from the Appeal Court and the High Court. A French court decision is generally structured as follows:
This dataset containts 12k decisions from:
The decisions have been anonymized by the original sources.
The dataset has been built automatically by using decisions that contained explicit section titles. We then removed those titles in order to simulate real cases where sections are to be automatically detected given their contents.
In the example below where the label associated with each section is visible, we deleted the second line that explicits the new section.
I-MOTIFS Les arguments du juge ...
B-DISPOSITIF Par ces motifs ,
I-DISPOSITIF Condamne M. X aux dépens d' appel .
Note that for paragraphs related to the pleas in law, they are sometimes mixed with the facts. We have thus decided to gather Facts and Plead in Laws into the same tag.
dataset_with_titles.zip
contains the data with explicit
titles
dataset_without_titles.zip
contains the data without
explicit titles
The dataset uses a JSON format. Each court decision corresponds to a row in the JSON file.
Each row contains:
id
stringlabelled_text
stringExample:
I-ENTETE La cour d' appel de Paris
I-COMPOSITION Madame Dupont , Présidente
I-COMPOSITION Monsieur Durand , Greffier
I-PARTIES EDF
I-FAITS ...
This dataset is under the LUDO v1.0 license, available at
this address.
Personal data contained in this dataset are handled according to
Doctrine's privacy policy.