About the project

Aim of the project: Building of the corpus of Academic Lithuanian

The aim of the building of CorAlit is to accumulate a large database of authentic academic Lithuanian texts which could be used for objective studies of quantitative and qualitative parameters of academic discourse, would allow to reveal the interdisciplinary peculiarities and characteristics of genre variety, would give information about the existing or possible trends of the development or loss of the Lithuanian academic identity and provide evidence concerning factors influencing the processes mentioned. Considering its practical application, this corpus is indispensable in teaching and learning of academic writing. The factual material accumulated in the corpus is not created by linguists or just a few researchers, it represents the true linguistic situation and demonstrates the collective intuition and shows the real usage.

In order to ensure the representativeness of the corpus of academic Lithuanian it should include texts written in Lithuanian from all the areas, fields and branches of science developed in Lithuania. A detailed analysis of data sources was carried out (internet websites of all Lithuanian publishing houses were investigated, a qualitative analysis of publications and selection according to their research topics and nature was carried out). All factual material which is necessary for the building of the corpus will be collected in accordance with the order No. 30 of the Ministry of Education and Science “Concerning the Classification of the Areas, Fields and Branches of Science” approved on 9 January 1998 and will be classified in accordance with the description provided in the Appendix:

  • H - Humanities (architecture, fine art studies, ethnology, folklore studies, philosophy, linguistics, literary theory, librarianship, history, theology)
  • S – Social sciences (law, political science, economics, psychology, education, management)
  • P – Physical sciences (mathematics, astronomy, physics, chemistry, geography, geology and mineralogy, informatics)
  • B – Biomedical sciences (medicine, dental surgery, biology, botany, agronomy, animal husbandry, pharmacy, veterinary science, forestry studies)
  • T – Technological sciences (energy studies, chemical technology, materials science, mechanics, metrology, building construction, transport technology, agricultural and environmental sciences, management and informatics)

After a detailed analysis of various types of publications and different academic texts published there, a list of text types to be included in the corpus was compiled which included: a monograph, textbook, study book; research article, review article, book review, abstract, chronicle, foreword (in a journal or a collection of articles); professional advertisement; research report; course description; study programme description; summary of doctoral dissertation; doctoral dissertation; master thesis.

Objectives of the project:

  • Creation of standardised work methodology (definition of criteria, detailed analysis of corpus structure, texts, genres, sources)
  • Adaptation of software, data processing, encoding
  • Morphological and grammatical annotation, lemmatisation
  • Creation of the search system