Language Technology Research Group
Research area

The predecessor of the Language Technology Research Group, the Corpus Linguistics Department, was established in 1997 as a formal recognition of several years of ongoing research and development work in the field of language technology. Since then, the research group has accumulated widespread experience across various areas, such as building linguistic resources, developing language technology tools, and more recently, in training large language models (LLMs).

In terms of linguistic resources, the first version of the Hungarian National Corpus (HNC or MNSZ)  must be In the 2010s, significant scientific paradigm shifts had a profound impact on the research activities within the group. Following influential international research, we developed Hungarian versions of neural laThe significant scientific paradigm shifts in the 2010s profoundly impacted the  activity of our research group. By focusing on the most influential international advancements, we developed Hungarian versions of neural language models initially designed for English. Starting with static word embeddings, our scope has continuously expanded to include numerous transformer-based and generative contextual language models for Hungarian. Notably, we have developed HILBERT (a BERT-Large language model), PULI-GPT-3SX (the Hungarian version of GPT-3), and most recently, PULI LlumiX 32K (a Hungarian fine-tuned Llama-2 model). One of our most important recent initiatives is developing instruction-following models, resulting in the creation of the ParancsPULI and PULI LlumiX 32K Instruct models. Specific applications related to these language models can be tested on our demo page.

The development of high-quality LLMs requires multi-faceted test datasets in Hungarian, offering comprehensive information on the models’ accuracy. Therefore, the creation of Hungarian test datasets, so-called benchmark corpora, is a key focus of our research. These datasets, integrated into a web service, simplify the intricate evaluation of neural network-based technologies and enable easy comparison and publication of results. To this end, we’ve developed the Hungarian Language Understanding Evaluation Benchmark Kit (HuLU), modeled after the infrastructure of the GLUE and SuperGLUE test databases for English. Additionally, we’re currently in the process of developing benchmark datasets tailored for generative language models.

In recent years, it goes without saying that a vast amount of linguistic data is essential for LLMs to grasp the fundamental patterns of language. Consequently, data is becoming increasingly valuable in the digital realm, as it empowers machine learning algorithms to learn, predict, and make informed decisions. A balanced corpus, encompassing a diverse array of linguistic phenomena, equips language models to comprehend texts across different subjects and styles. Thus, the quantity and quality of accessible linguistic data directly influence the effectiveness and adaptability of LLMs.

The Language Technology Research Group has nearly two decades of experience in constructing corpora. The first major textual database for Hungarian, the Hungarian National Corpus (MNSZ), was finalized in 2005. Consisting of 187.6 million words, MNSZ includes varieties of Hungarian from beyond the borders. An enhanced version of the Hungarian National Corpus, MNSZ2 was released in 2014. MNSZ2 not only contains nearly ten times the amount of text (1.5 billion words) but also covers new and important text types, such as social media. Furthermore, the quality of linguistic analysis has significantly improved compared to its predecessor.

The importance of big amount and high-quality data motivates our ongoing corpus construction efforts: as part of the Science for the Hungarian Language National Program (Tudomány a magyar nyelvért nemzeti program), our goal is to create MNSZ3, the extended version of MNSZ2, to include 10 billion words while preserving the variety of genres and dialects in MNSZ2.

Another key objective is to collect textual data directly to pretrain LLMs. For doing so, Hungarian-language textual content of Common Crawl is downloaded and preprocessed. Common Crawl is a nonprofit organization that provides access to large amounts of textual content by regularly crawling websites and making the data available via the Amazon Web Services.

However, we also focus on more normative, curated texts. To this end, in a collaboration with the Library and Information Centre of the Hungarian Academy of Sciences, also as part of the Science for the Hungarian Language National Program, the textual content of the REAL repository is being processed via NLP tools. Our main objective is to make a massive volume of scientific publications of PDF format more searchable by processing the content of PDF files and providing automatically extracted metadata, such as authors, affiliations, named entities, and terminology. We hope that the processing of the REAL repository’s content will not only assist researchers working in various fields in using the collection but also potentially benefit any knowledgeable enthusiast.

Over the years, our research group has developed numerous tools. One of the most significant is the Spelling Advisory Portal (helyesiras.mta.hu), created to automatically assist with the normative spelling of Hungarian. Supported by the Hungarian Academy of Sciences, the portal was launched in 2013. While it was cutting-edge at the time, it has since become outdated and requires renovation both in terms of its software platform and user experience. This renovation work is currently underway.

Another important tool developed in collaboration with numerous partner institutions is e-magyar Digital Language Processing Toolchain and its enhanced, modularized successor, emtsv, which enable comprehensive analysis of natural language texts in Hungarian.

The research group also contributed to HuWordNet, the Hungarian version of the Princeton WordNet lexical database. HuWordNet, the result of three years of work, maps the Hungarian vocabulary onto a hierarchical structure according to the meaning of lexical items. First, words are organized into  synonym sets, then the synonym sets are ordered based on various semantic relations.

The research group was involved in machine translation, as well. Our basic objective was to further develop the transformer-based machine translation system created for the English-Hungarian language pair towards multilingual direction, enabling translation not only between two languages but from multiple input languages to one or more target languages. Improving the translation quality in existing systems was also among our top priorities, especially in the case of Hungarian as the target language.

Héja EnikőResearch Group Leader:

Enikő Héja, PhD
Email: urwn.ravxb@alghq.uha-era.uh
Phone: +36 (1) 3429372 / 6043
Current international project proposalsStart – end
Alliance for Language Technologies European Digital Infrastructure Consortium2024.05.27. –
Current national project proposalsStart – end
Supporting the digital sustainability of the Hungarian language2020.12.01. – 2026.11.30.
Digital support for the Hungarian language is Hungarian
in the service of science
2020.12.01. – 2026.11.30.
Major closed international project proposalsStart – end
CURLICAT: Curated Multilingual Language Resources for CEF AT2020.06.01. – 2022.11.30.
MARCELL: Multilingual Resources for CEF.AT in the Legal Domain2018.10.01. – 2021.03.31.
Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams (TrendMiner)2013 – 2014
Innovative Networking in Infrastructure for Endangered Languages (INNET)2011 – 2013
European Media Monitor – Hungarian modul2012
Central and South-East European Resources (CESAR)2011 – 2013
Internet Translators for all European Languages (iTranslate4)2010 – 2012



Major closed national project proposalsStart – end
e-magyar.hu: Open, integrated Hungarian language technology research
building infrastructure.
2015.01.01. – 2016.06.30.
Hungarian Generative Diachronic Syntax 22014 – 2018
helyesiras.mta.hu – Spelling Advisory Portal2008 – 2013
Disclosure of BSI-22008 – 2012
Dictionary of Hungarian Verb Phrase Constructions2008 – 2010
Building of the Hungarian WordNet ontology and its applications in information extraction systems (Hungarian WordNet)2005 – 2007



*A detailed list of the closed tenders can be found here.

Language Technology Research Group
Staff

Ágnes BÁNFI
software developer

Institute for Language Technologies and Applied Linguistics

Alexandra KIS
research assistant

Institute for Language Technologies and Applied Linguistics

Bence SÁROSSY
junior research fellow

Institute for Language Technologies and Applied Linguistics

Enikő HÉJA
research group leader, research fellow

Institute for Language Technologies and Applied Linguistics

Flóra FÖLDESI
software developer

Institute for Language Technologies and Applied Linguistics

Gábor MADARÁSZ
junior research fellow

Institute for Language Technologies and Applied Linguistics

Gábor PRÓSZÉKY
director general, research professor

Institute for Language Technologies and Applied Linguistics

Gergő FERENCZI
IT director

Institute for Language Technologies and Applied Linguistics

István FEKETE
IT specialist (Linux / Unix Supervisor, Devops architect)

Institute for Language Technologies and Applied Linguistics

Kristóf VARGA
software developer

Institute for Language Technologies and Applied Linguistics

Mariann LENGYEL
junior fellow research

Institute for Language Technologies and Applied Linguistics

Mátyás OSVÁTH
fejlesztőmérnök

Institute for Language Technologies and Applied Linguistics

Noémi LIGETI-NAGY
research fellow

Institute for Language Technologies and Applied Linguistics

Péter HATVANI
software developer

Institute for Language Technologies and Applied Linguistics

Réka DODÉ
junior research fellow

Institute for Language Technologies and Applied Linguistics

Tamás VÁRADI
deputy director-general, director, senior research fellow

Institute for Language Technologies and Applied Linguistics

Zijian Győző YANG
research fellow

Institute for Language Technologies and Applied Linguistics

Zsófia SZANISZLÓ
software developer

Institute for Language Technologies and Applied Linguistics

Language Technology Research Group
Research

Building a data infrastructure by correcting OCR errors in curated texts

The production of language models requires a corpus of billions of words, the most obvious source of which is the Internet. However, most of the texts available here are of uncertain origin and quality, often with little metadata. As part of the cooperation with the Arcanum Database Publisher, we have a collection of curated texts of approximately nine billion words at our disposal. This collection is the result of the publisher’s many years of OCR scanning (Optical Character Recognition). Yet, ...

Building and publishing benchmark corpora

One of the prerequisites for following cutting-edge NLP is the standardized measurement of development results in the Hungarian language. This requires a whole series of test databases, so-called benchmark corpora, created according to a strict methodology, which serve as a reference for measuring the level of development of new technologies and devices.However, benchmark databases serve more than just the purpose of comparing the performance of different language models. Their important new rol ...

Development of language-centered artificial intelligence (language models)

The neural language models becoming dominant in the last decade have brought about a paradigm shift in language technology as a whole. The creation of these general-purpose language models requires extraordinary computing capacity and enormous amounts of data. Our main task is to adapt the world-class language models for the Hungarian language and make them available to the Hungarian language technology sector.The latest type of large-scale language models have already taken a significant step t ...

Language Technology Research Group
Contacts

Partner institutions

Alliance for Language Technologies European Digital Infrastructure Consortium

On May 27, 2024, Hungary was elected as a member of the European Digital Infrastructure Consortium Alliance for Language Technologies (ALT-EDIC). The representation of Hungary will be handled by the Research Institute for Linguistics of HUN-REN, commissioned by the Ministry of Culture and Innovation.

European Federation of National Institutions for Language

Tamás Váradi has been the secretary of the EFNIL organization since 2010, and the institute has been handling the secretarial tasks for EFNIL since 2010.

European Language Resource Coordination (ELRC)

The European Language Resource Coordination (ELRC) workshop in Hungary was organized by the Research Center, within which we engage in dialogue with industry and government stakeholders about the state and prospects of Hungarian language technology. Developers and users of language technology share their experiences, needs, and ideas on how language technology solutions can support the digital interactions of a multilingual Europe.

Indamedia Sales Kft.

In an ongoing collaboration between NYTK and Indamedia Sales Kft., NYTK has received and processed the entire content of the news portal index.hu. Active negotiations are underway to expand the collaboration to apply Hungarian-language artificial intelligence in publishing work.

Library and Information Centre of the Hungarian Academic of Sciences

With the involvement of language technology tools, the material of the MTA Library REAL repository can become searchable in a more efficient manner than the current state. Processing the content of PDF texts is already underway: we are making the content of a massive volume of scientific publications easily searchable by automatically extracting metadata (such as authors, affiliations, named entities, and terminology) from the PDF format

National Archives of Hungary

In a successful collaboration, NYTK and MNL processed the more than 600,000 personal records of Hungarian prisoners of war deported to Soviet Union camps. In an ongoing joint project, they are processing a database containing approximately 5 million index cards collected for the Comprehensive Dictionary of the Hungarian Language, about 50% of which are handwritten. The optical character recognition of Hungarian handwriting has opened a new dimension in the development of artificial intelligence.

National Széchényi Library

The Research Group provides language technology assistance for processing the materials of the National Széchényi Library (OSZK) in exchange for access to the OSZK's web harvesting and other digital collections.

Telekom System Integration Ltd.

Advisory services for the development of T-COM's artificial intelligence-based applications.