The French Court Decision Structure dataset — FCD12K

Download the last version of the dataset WITH titles here! Download the last version of the dataset WITHOUT titles here!

Task details

The challenge we want to address here is automatically identifying what sections of textual documents refer to. Specifically, given a document that consists of pre-cut paragraphs, we want to classify these paragraphs into 6 predefined classes (see the section below for a description of each class). Typical use cases enabled after this classification are: navigating more easily through documents, selecting only specific sections for summarization purposes, or weighting sections differently as we index these documents in a search engine.

Dataset overview

This dataset is intended to researchers working on natural language processing in French, especially on text classification relying on sequential information.

The dataset contains French court decisions from the Appeal Court and the High Court. A French court decision is generally structured as follows:

Metadata ('En-tête' in French): court, number, date, etc., of the trial.
Tag: ENTETE
Parties ('Parties' in French): information about the claimants and defendants
Tag: PARTIES
Composition of the court ('Composition de la cour' in French): name of the President of the court, the Clerk, etc.
Tag: COMPOSITION
Facts ('Faits' in French): what happened.
Tag: FAITS
Pleas in law and main arguments ('Moyens' in French): arguments presented by the claimant and defendant.
Tag: FAITS (see in the next section why the tag is similar to the previous bullet point)
Grounds ('Motifs' in French): reasons and arguments used by the court for its final judgment.
Tag: MOTIFS
Operative part of the judgment ('Dispositif' in French): final decision of the court.
Tag: DISPOSITIF

Dataset specifications

This dataset containts 12k decisions from:

Direction de l'information légale et administrative: DILA , see documentation here
Institut national de la propriété industrielle INPI

The decisions have been anonymized by the original sources.

Methodology

The dataset has been built automatically by using decisions that contained explicit section titles. We then removed those titles in order to simulate real cases where sections are to be automatically detected given their contents.

In the example below where the label associated with each section is visible, we deleted the second line that explicits the new section.


          I-MOTIFS Les arguments du juge ...

          B-DISPOSITIF Par ces motifs ,

          I-DISPOSITIF Condamne M. X aux dépens d' appel .

Note that for paragraphs related to the pleas in law, they are sometimes mixed with the facts. We have thus decided to gather Facts and Plead in Laws into the same tag.

dataset_with_titles.zip contains the data with explicit titles
dataset_without_titles.zip contains the data without explicit titles

Format

The dataset uses a JSON format. Each court decision corresponds to a row in the JSON file.

Each row contains:

the identifier of the decision: id string
the labelled text: labelled_text string
each paragraph block is a paragraph of the decision
each paragraph block starts with the label of the paragraph, and then get all the tokens of the paragraph

Example:


          I-ENTETE La cour d' appel de Paris

          I-COMPOSITION Madame Dupont , Présidente

          I-COMPOSITION Monsieur Durand , Greffier

          I-PARTIES EDF

          I-FAITS ...

License

This dataset is under the LUDO v1.0 license, available at this address.
Personal data contained in this dataset are handled according to Doctrine's privacy policy.