On Very Large Corpora of French

Etienne Brunet

Résumé

Concerning French, it would be natural to turn to the French National Library, which is rich in 14 million documents including 11 million books on the Tolbiac site. This would be comparable to Google Books offer, if access was similarly electronic. Unfortunately the number of documents accessible on the Internet, mainly in the Gallica base, is far from reaching that figure. In reality, the most reliable texts of Gallica, aside from newer ones transmitted by publishers in digital form, are those coming from the Frantext legacy. Those owe nothing to scanning, whose invention in 1974 by Ray Kurzweil is after the initial capturing, carried out by keyboardists on perforated tape. This manual input, duly revised and corrected for fifty years, resisted all changes of systems or supports. To that reliability of texts, even when they are older editions, Frantext adds many other virtues: a balance between eras, allowing comparisons and pro¬viding a solid basis for analysing the evolution of the language; covering a wide chronological span of five centuries of publication; a desired homogeneity of texts whose choice is governed by specific criteria, concerning genre and language level; consistency in the services offered to the scientific community, the same soft¬ware being kept unchanged for twenty years on the Internet; a moderate increase and a controlled enrichment of data ensuring compatibility with the previous treatment. The catalogue of Frantext is now expanding by adding more recent production: it has currently 4000 references and 270 million words. The BNF weighs ten times more; Google Books is a thousand times more and its pace of growth is much faster. But other Institutional corpora ( we study Encarta, Wikipedia and some ones) are like huge tanks that distribute their content, word by word, as would a dictionary. The consultation can be only punctual. They do not allow any statistical overview, no overall analysis, as can be seen from three gigantic corpora of the French language built respectively in Germany (Wortschatz), in UK (Sketchengine) and in the USA (Google Books). Wortschatz was build at the University of Leipzig (with collaborators from the University of Neuchâtel). It is a corpus of the French language with 700 million words, 36 million sentences from newspapers (19 million), web (11 million) and Wikipedia (6 million). Sketchengine is an English website which offers (together with corpora of other languages) a corpus of the French language. Like many web-based corpora, Sketchengine is harvesting the web in order to build a large representative corpus of a language rather than to build corpora targeted at analyzing lexical innovations. Culturomics (or Google Books) is the biggest corpus of the French language, with a size 100 times greater than that of Sketchengine (89 billion words in 2012). One can be enthusiastic given the huge size of the corpora. But the doubt remains as to the validity of the statistical results. The doubt grows especial¬ly as the composition of the corpora are still “black boxes”. If the choices underlying the building of the corpus under scrutiny are unknown, the size of the data does not prevent the result from being very difficult to interpret.

On Very Large Corpora of French

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager