OWS.EU Partner in Focus: CERN

CERN, the European Organization for Nuclear Research, is one of the world’s largest and most respected centres for scientific research. Its business is fundamental physics, finding out what the Universe is made of and how it works. Within the OpenWebSearch.EU (OWS) project CERN plays a crucial role not only with regards to supercomputing infrastructure, but also via its contributions to ethical and legal assessments, as well as project management and communications support. 

The CERN project team is led on by Andreas Wagner, IT Solutions Architect and complemented by Noor Afshan Fathima with whom we spoke in her role as Data Infrastructure Engineer, about the project progress thus far.

Noor Afshan Fathima, CERN, Section: IT-PW-WA, Research Fellow

Please describe your organisation’s tasks in the project. What is your field of expertise that you bring to the project?

CERN contributes across 6 workpackages – WP1 (Fill in Crawlers), WP4 (Search Applications), WP5 (Federated Data Infrastructures), and WP6 (Ethical, Legal, and Societal Aspects) – bringing expertise in infrastructure engineering, development of science search applications, ELSA considerations, and governance of the federated open search infrastructure. It also contributes to WP7 – Dissemination and Communication, WP8 – Project Management.

In WP1, we developed two purpose-built authenticated web crawlers for CERN’s internal web estate: cern-owler (Java/Playwright for HTML, producing 125 WARC archives totalling 3.1 GB) and owler_auth_pdf (Python/Tika for PDFs, extracting 2,211 documents from 90+ domains). Together they delivered 3.3 GB of content across 287 files to AccGPT (Accelerator GPT), CERN’s experimental AI-powered chatbot for the accelerator complex, covering 25,292 seed URLs across 182 CERN domains. We also participate in project-level coordination and contribute to the governance of the federated open search infrastructure, drawing on CERN’s institutional experience in managing large-scale, multi-partner scientific collaborations.

In WP4, we developed two complementary POC search applications demonstrating MOSAIC’s flexibility. The first is an institutional search engine built from custom-crawled WARC archives of CERN’s public web content, fed through the full OWS preprocessing pipeline (resilipipe, open-web-indexer, lucene-ciff) into MOSAIC, indexing 4,352 documents from 6,738 crawled pages. The second is Nooon, a vertical search engine for disability-related knowledge, built from a 2.59-million-document OWI (Open Web Index) slice extracted via the command line interface OWILIX. Nooon is designed to support HR and Diversity & Inclusion offices in fair hiring and inclusive policy development.

In WP5, we operate and document the production server fleet that underpins the Open Web Index at CERN. This includes the URL Frontier coordination service — where we drove the migration from OpenSearch to ScyllaDB to handle 94.7 million operations per day across 6.68 billion URL records — the web crawling infrastructure processing up to 3 TB of content per day, an iRODS data federation spanning four sites across five European data centers (CERN, LRZ, DLR, CSC Finland), load balancers, metrics collection, and the application hosting servers. Our systematic documentation methodology, developed specifically for this project, covers discovery, deep-dive analysis, checklist completion, and academic chapter creation for each server.

In WP6, we contribute to the ethical, legal, and societal dimensions of the project. This includes work on ELSA (Ethical, Legal, and Societal Aspects) as they apply to open web search — particularly around privacy-preserving information retrieval for vulnerable populations, knowledge sovereignty, and the responsible handling of disability-related data. Our OSSYM 2025 publication and CERN preprint on empirical ethics in disability information retrieval directly address these concerns. We also contributed to the governance of the federated data infrastructure.

In WP7 (Dissemination and Communication), we have contributed to raising the visibility of the OpenWebSearch.EU project through major CERN communication channels. Three feature articles were published on home.cern and in the CERN Courier: “A European project to make web search more open and ethical” and “Ethical, open and non-commercial: Open Web Search project designed to provide Europe with an alternative” on the CERN news site, and “Towards an unbiased digital world” in the CERN Courier. These articles reached CERN’s global audience of researchers, engineers, and policy-makers, highlighting both the technical infrastructure and the ethical dimensions of building a European open web index. Beyond written dissemination, we have presented the project at multiple international venues including OSSYM 2024 and 2025, CS3 2025, EGI 2024, and the Cambridge Forum on AI, contributing to community building around open search infrastructure.

How is the project progressing? Which major milestones did you achieve?

The project is progressing well, with all CERN-side deliverables on track. Our major milestones include:

URL Frontier evolution: We completed the migration of the URL coordination service from OpenSearch to ScyllaDB, resolving critical performance bottlenecks caused by JVM garbage collection pauses and write-heavy workloads (99.88% writes). The production ScyllaDB deployment now handles 24.3 billion total operations with zero failures and continuous uptime, storing 5.07 TB across the crawl state database.

Authenticated crawling and AccGPT delivery: We developed two purpose-built crawlers — cern-owler (Java/Playwright for HTML) and owler_auth_pdf (Python/Tika for PDFs) — capable of navigating CERN’s Keycloak SSO. Together they delivered 287 files totalling 3.3 GB to AccGPT’s S3-based knowledge base, covering 25,292 seed URLs across 182 CERN domains.

Search application deployments: The institutional search engine indexes 4,352 documents from 6,738 crawled pages through the complete OWS pipeline. Nooon serves 2.59 million disability-focused documents through MOSAIC, demonstrating the OWI-to-vertical-search workflow at scale.

iRODS data federation: We established the CERN node in a five-site iRODS federation (CERN ↔ LRZ ↔ DLR ↔ IT4I ↔ CSC Finland), which in future enabling cross-institutional data sharing for the Open Web Index across three European countries.

Infrastructure documentation: We completed comprehensive documentation for 8+ production servers using our systematic five-phase methodology, producing deliverable-ready chapters for D5.3 covering the full infrastructure stack from load balancers to database clusters.

Publications:CERN’s work in the project has produced a substantial publication record. As first author, six papers span disability information retrieval and infrastructure architecture: two at OSSYM 2025 (Knowledge Sovereignty in Disability IR; Architecting the URL Frontier datastore), two at OSSYM 2024 (Federated Data Infrastructure for the Open Web Search; Architecting the OpenSearch service at CERN), one accepted at the Cambridge Forum Journal on AI: Culture and Society (empirical ethics, article in progress), and one submitted to SEASON — the Search Engines and Society Network (Ethical Privacy in Disability Data Retrieval). As co-author, contributions include the Springer book chapter on the Open Web Index (2024), the JASIST journal article on Open Web Index impact (2023), federated infrastructure papers at CS3 2025 and EGI 2024/2025, plus two Zenodo deliverables (Pilot Infrastructure Launch; Training Material for Partners). A companion preprint is deposited at CERN’s document server (CERN-OPEN-2025-004). Three feature articles were published on home.cern and in the CERN Courier as part of WP7 dissemination.

What are the challenges you have been facing (regarding your tasks)?

Authenticated crawling at institutional scale. CERN’s web estate sits behind a Keycloak Single Sign-On layer that conventional crawlers cannot penetrate. Building browser-based authentication into the crawling pipeline — using Playwright to handle OAuth2 flows, session tokens, and cookie management at scale — required significant engineering effort and careful coordination with various CERN’s teams.

Network access complexity. CERN’s network security model requires two-hop SSH access (desktop → lxplus gateway → target server) with different authentication patterns per server type. Communication between servers and workstations require staging through intermediate nodes, which was a very interesting challenge to work on.

Which milestones do you plan to achieve in the remaining months?

In the remaining project period, we are focusing on completing and polishing our deliverable contributions and extending the search applications:

D5.3 completion: Finalise the remaining server documentation chapters and integrate all CERN infrastructure sections into the consolidated deliverable, including the URL Frontier evolution narrative, crawler infrastructure, and federation topology.

D4.4 integration: Complete the CERN search applications section with final evaluation results, upload the Zenodo reproducibility artifact package, and integrate figures and cross-references into the consolidated document.

Full estate crawling: Extend the institutional search from the current 6,738-page public subset to CERN’s full 25,000+ page web estate, integrating authenticated content into the MOSAIC index with appropriate access controls.

Nooon enhancements: Implement topic-level Curlie filtering for semantic corpus construction beyond keyword matching, and explore cross-corpus comparison capabilities (e.g., Disability in Employment vs. Disability in Education) tailored to HR and D&I workflows.

Frontier integration: Connect the authenticated crawlers to CERN’s URL Frontier infrastructure for continuous, scheduled crawling rather than the current manual campaign-based approach.

What makes the OWS project special to you?

The OWS project represents something genuinely rare: the attempt to build a public, European alternative to the commercial search infrastructure that shapes how billions of people access information. Working on this at CERN feels especially fitting — the web was born here, and now we are contributing to ensuring it remains open and searchable by everyone, not just by those who can afford to build their own index.

What makes it personally meaningful is the Nooon component. Building a search engine specifically for disability-related knowledge – one that surfaces voices and resources that mainstream search systematically underrepresents – connects the project’s technical ambitions to real human outcomes. When an HR professional can discover evidence-based accommodation guidelines or a disability advocate can find peer-reviewed employment research through an open, privacy-preserving infrastructure, that is the kind of impact that motivates the work.

The project also demonstrates that European research institutions can collaborate on infrastructure at scale. The iRODS federation across five sites in three countries, the shared URL Frontier coordinating billions of URLs, the OWILIX tooling that lets anyone extract a thematic slice of the web – these are building blocks for digital sovereignty that go beyond any single institution’s capability.

Do you already have plans for the time after the project ends?

Yes, several strands of work are designed to continue beyond the project timeline:

Nooon and fair hiring: Nooon is supported through the Open Search Foundation’s Ethics working group and CERN’s Disability Network within the Diversity and Inclusion programme. There is active interest from HR departments in exploring fair hiring tools built on open search infrastructure. We plan to extend Nooon to multilingual and multimodal corpora incorporating lived-experience contributions from disabled people and caregivers, with client-side preprocessing to protect sensitive employment data.

AccGPT integration: The authenticated crawling infrastructure continues to support AccGPT’s knowledge base requirements independently of the OWS project. The 3.3 GB already delivered serves as the foundation, with plans to extend coverage to CERN’s full web estate and establish continuous crawl schedules.

Infrastructure sustainability: The MOSAIC deployments on open-science-search serve as reference implementations for institutional search at CERN. The documented infrastructure and the systematic methodology we developed for server documentation provide templates that other institutions can adapt for their own open search deployments.

Open science artifacts: All reproducibility artifacts – seed inventories, crawler source code, pipeline outputs, and evaluation data – will be publicly available on Zenodo, ensuring that our contributions remain accessible and reproducible for the broader research community working on open web search.

Thank you for the interview!

Read more about CERN: https://home.cern/

Watch our interview with Noor about the search engine Nooon:

From shop counter to online catalogue: Inside the DTCommerce project

A Slovenian team set out to build open-source tools that help small retailers go digital easily, by importing product descriptions from a spreadsheet into an online shop – with AI-enhanced descriptions and images, in just a few clicks

For small to medium sized Brick-and-Mortar retailers, the move from physical shops to e-commerce is a long and cost intensive process. These businesses typically have an accounting system with a list of products, perhaps a supplier’s website with technical specifications, and neither the time nor the budget to manually write product descriptions, source images, and populate an online shop for hundreds or thousands of items. The result is that many small retailers either delay their digital transition or end up with online catalogues that are sparse, poorly described, and unappealing to customers.

The main challenge is not a lack of products but a lack of digital product content. A physical shop’s inventory usually exists as a list of names, SKU (stock keeping units) codes, and prices in accounting software. An online shop needs other specifications : well-crafted product descriptions, high-quality images, metadata, and engaging presentations. Creating this content  quickly becomes a substantial undertaking.

The DTCommerce project, carried out by the Slovenian company ZenLab under the European OpenWebSearch.eu project funding, set out to solve this problem with an automated solution. The idea is simple: take the product list a shop already has, find the corresponding product information on the web, enhance the descriptions using AI, and deliver the result as a ready-to-use online shop – with minimal manual effort.

The Approach: Automated Extraction and AI Enhancement

The DTCommerce system operates in two stages. The first stage is a web crawling process that, given a list of product URLs from supplier or manufacturer websites, automatically extracts the key product information, such as: title, description, imagery, price, and technical specifications. The crawler is built on Scrapy, a well-established open-source web scraping framework, and includes support for structured data formats (JSON-LD) as well as domain-specific extractors for particular target sites.

The second stage is where AI comes in. The raw product descriptions extracted from supplier websites are often technical, dry, and written for a trade audience rather than end consumers. DTCommerce feeds these descriptions to an AI language model (Perplexity AI’s sonar-pro), which rephrases them into clearer, more engaging copy while preserving every technical detail – dimensions, model numbers, and specifications. The original description is retained alongside the enhanced version, so nothing is lost. The result is a set of enriched product records in a standardised format, ready to be imported into an e-commerce system.

From Pipeline to Plugin: A Few Clicks to a Full Shop

To make the pipeline usable for non-technical shop owners, the ZenLab team built a WordPress/WooCommerce plugin that wraps the entire workflow into a simple administrative interface. The process works as follows: the shop owner exports a product list from their accounting software as an Excel file and uploads it to the plugin. The plugin creates basic product entries in WooCommerce, sends them to the enrichment service, and automatically populates each product page with enhanced descriptions and images – all without requiring the shop owner to edit a single product manually.

An Honest Detour: When the Open Web Index Didn’t Have What Was Needed

DTCommerce was originally designed to use the Open Web Index (OWI) as its primary data source for finding product information across the web. The vision was that a shop owner could provide a product name or SKU code, and the system would search the OWI to find matching products on supplier and manufacturer websites, automatically retrieving descriptions and images.

In practice, the specific e-commerce sites that the project’s use cases required were not present in the OWI at the time of development. This is not surprising: the OWI is still being built, and its coverage of niche commercial sites – particularly smaller B2B suppliers – is not yet comprehensive. The team adapted by switching to direct web scraping of predefined supplier URLs, which allowed the project to deliver its core functionality on schedule.

Why the project matters

DTCommerce addresses a real and widespread problem. Across Europe, millions of small retailers face pressure to establish an online presence but lack the resources to do so effectively. By automating the most labour-intensive part of the process – creating digital product content – the project lowers the barrier to entry in a meaningful way. The fact that the tools are open source and built on widely used platforms (WordPress, WooCommerce, Scrapy) means they are accessible to a broad audience and can be adapted to different markets and product domains.

The project also illustrates a type of application that open web search infrastructure is well suited to support. The ability to search an open web index for product information – matching a local shop’s inventory against the broader web – is precisely the kind of use case that depends on open, non-proprietary access to web data. As the OWI matures, tools like DTCommerce stand to benefit directly. The project overall also demonstrates both the potential of the OWI-based approach and its current practical limits.

Final Outlook

The DTCommerce activities will follow with further development of tools compatible also with other e-commerce platforms. The tool will remain as open source, the company will be developing and automated portal for data exchange and enrichment, available on demand for various e-commerce integrations.

Find the full project report here: https://zenodo.org/records/18300935

The DT Commerce project was funded under the OpenWebSearch.eu initiative (Horizon Europe, Grant Agreement 101070014, Call #2).

 

The case for Neural Crawling: Inside the FUN project

A research team from Pisa and Glasgow proposes that AI language models should decide which web pages to download – and shows why this matters for the future of search

Before a search engine can find anything, it must first build a collection of web pages to search through. This collection is assembled by a crawler – a piece of software that systematically visits web pages, follows links, and downloads content. The decisions the crawler makes about which pages to prioritise determine, in a very direct way, what the search engine will eventually be able to find.

For over two decades, the dominant approach to crawling prioritisation has been PageRank and related link-analysis methods: pages that are linked to by many other important pages are assumed to be important themselves. This was a reasonable assumption in the era of keyword search. But search is changing. Users increasingly ask questions in natural language rather than typing keywords, and automated systems like retrieval-augmented generation (RAG) pipelines issue their own queries to search engines. These new kinds of queries demand pages with rich, coherent, meaningful content – and there is no guarantee that such pages are also the most popular or the most linked-to.

The FUN project – Focused Neural Crawling – funded under the European OpenWebSearch.EU project and carried out by researchers at the University of Pisa and the University of Glasgow, tackles this problem head-on. It proposes a new paradigm: instead of using link popularity to decide what to crawl, use AI language models to estimate the semantic quality of web pages and prioritise accordingly.

Why crawling matters more than you might think

It is easy to focus on the visible parts of a search engine – the ranking algorithms, the interface, the speed of results – and overlook the crawler. But the crawler is the primary content filter in the entire search pipeline. It decides what gets downloaded, stored, and indexed. Everything that happens downstream – indexing, ranking, retrieval – operates only on the content the crawler has already collected. A sophisticated ranking algorithm cannot compensate for a poor crawling strategy: if valuable pages were never downloaded, they simply do not exist as far as the search engine is concerned.

The web is vast, and no crawler can download everything. Choices must be made, and the heuristics that guide those choices shape the quality of the entire search corpus. Traditional heuristics like PageRank assume that a page’s importance can be inferred from its position in the web’s link structure. This works well when search queries are short keyword strings and when the most popular pages tend to be the most useful. But the FUN team argues that this assumption is increasingly outdated.

The Shift: From link popularity to semantic quality

The core idea behind neural crawling is straightforward: instead of asking “How popular is this page?”, the crawler asks “How likely is this page to contain content that would be useful for answering a search query?” To answer this question, the system uses a neural quality estimator – a small language model that has been trained to predict, from the text of a document alone, whether that document is likely to be relevant to any query. The model does not need to know what the query will be; it assesses the intrinsic quality of the text itself: its coherence, informativeness, and semantic richness.

There is an obvious practical problem: the crawler needs to decide whether to prioritise a page before it has downloaded that page. It cannot read the text of a page it has not yet fetched. The FUN team addresses this with two quality propagation strategies. The first is based on the observation that web pages tend to link to other pages of similar quality. If a high-quality page links to an unknown page, there is a reasonable probability that the unknown page is also of decent quality. The crawler can therefore use the quality of already-downloaded pages as a proxy for the likely quality of the pages they link to.
The second strategy works at the domain level: pages within the same domain tend to have similar quality. Once the crawler has downloaded a few pages from a domain, it can estimate the quality of the domain as a whole and use that estimate to prioritise other pages from the same domain.

What the experiments show

The team tested their approach through large-scale simulations on ClueWeb22-B, a web corpus of 87 million pages, using two different sets of test queries. One set consisted of traditional keyword queries; the other consisted of natural language questions.

The results are striking. On natural language queries, the neural crawling strategies consistently outperformed PageRank in both the quality of the crawled corpus and the effectiveness of downstream search results. The domain-level strategy (DomQ) was particularly strong, building corpora that led to substantially better retrieval performance. On traditional keyword queries, the neural strategies performed comparably to PageRank – they did not lose ground on the type of search that PageRank was designed for.

The efficiency results were also notable. Neural crawlers collected relevant pages faster than PageRank in the early stages of the crawl, meaning they built useful search corpora more quickly and with less wasted bandwidth downloading low-value pages. This matters in practice, because crawling the web is expensive in terms of network resources, storage, and computing time.

A key finding underpinning the domain-level approach is that the semantic quality of a web page is strongly correlated with the average quality of other pages on the same domain (Pearson correlation of 0.649). By contrast, the equivalent correlation for PageRank scores is much weaker (0.272). In other words, knowing that a domain tends to host high-quality content is a much better predictor of individual page quality than knowing that a domain is well-linked.

Why it matters

The FUN project is significant for the OpenWebSearch.EU project in a very direct way. Running an open European web index requires crawling decisions – and the quality of those decisions determines the quality of the index. If an open web index is built using traditional crawling heuristics, it inherits the biases of those heuristics: a preference for popular, well-linked content at the expense of semantically rich but less connected pages. Neural crawling offers a way to build corpora that are better suited to the modern demands of natural language search and AI-powered information retrieval.

To make this practical, the FUN team produced not just research findings but usable software. Their quality scoring tools are compatible with the OWS parquet file format and integrated into Resilipipe, the open-source content analysis framework used by OpenWebSearch.eu.

The FUN project demonstrates that the way we crawl the web should evolve alongside the way we search it. As search queries become more conversational and AI systems become major consumers of search infrastructure, the assumption that link popularity is the best guide for crawling priorities is no longer sufficient. Neural quality estimation offers a complementary – and in many cases superior – signal.

What’s next

Future work could explore combining neural and link-based signals in hybrid strategies, using ensembles of quality estimators that assess different dimensions of page quality (spam, machine-generated content, factual accuracy), and evaluating neural crawling with more advanced retrieval models beyond BM25. The approach could also be adapted for other tasks that depend on corpus quality, such as building high-quality training data for large language models.

To read the full technical report, go here: https://zenodo.org/records/17359141

The FUN project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

How Dutch municipalities are sharing Search Intelligence to serve citizens better: Inside the CIFFIL Service project

The CIFFIL Service project shows that open web index standards can help small municipalities improve their search quality by accessing results from larger ones

Search engines work best when they have a lot of data to learn from. The more documents in a collection, the better the system can distinguish between common words and genuinely informative ones – and therefore the better it can identify what is relevant to a query. This is a well-known principle in information retrieval, and it creates an obvious problem for anyone who needs to search a small collection of documents: the search results are simply not as good as they could be.

The CIFFIL Service project, funded under the European OpenWebSearch.EU project, tackled exactly this problem – in a setting with direct consequences for citizens. Spinque, a Dutch search technology company, builds search systems for municipalities that allow council members and residents to search through publicly available government documents. Some of these municipal collections are small, containing fewer than 10,000 documents, and are full of domain-specific jargon. The result is that search quality suffers. The CIFFIL project set out to fix this by allowing municipalities to share their search index data with one another through an open standard.

The problem: Small collections, unreliable statistics

Most search engines use some variant of a ranking algorithm called BM25. At its core, BM25 judges the relevance of a document to a query by looking at how often the query terms appear in the document and how rare those terms are across the collection as a whole. Terms that do not appear in a lot of documents signal relevance.

This is where small collections often fall short. When a collection has only a few thousand documents, the estimates of how common or rare a term is very unreliable. The ranking algorithm, relying on these skewed statistics, makes poor decisions about what is relevant. The result for the user is a search experience that feels hit-or-miss.

The solution: Sharing index data through an open solution

The CIFFIL team’s approach is simple. If a small municipality’s search system suffers from unreliable statistics because its collection is too small, why not supplement those statistics with data from a larger municipality that deals with similar types of documents? After all, Dutch municipal documents share a common vocabulary of administrative, legal, and policy language.

The technical mechanism for this sharing is the Common Index File Format, or CIFF – an open standard developed in the information retrieval research community for exchanging inverted index data between systems. An inverted index is the core data structure behind a search engine: it maps every term in a collection to the documents in which that term appears, along with statistics such as how often it appears and in how many documents.

Spinque integrated CIFF support into its search platform, Spinque Desk. This involved building a CIFF reader (to import index data), a CIFF writer (to export index data), and – critically – a modified BM25 ranking component that can combine the statistics from a local collection with those from an external CIFF index. When a small municipality’s search system uses this combined approach, it effectively “borrows” the larger municipality’s understanding of which terms are common and which are rare, while still searching its own documents.

Proof of concept

The team implemented tests for the functionalities implemented for this project. Specifically, they did manual testing by doing experiments using CIFF exports to see if they could replicate effectiveness results on open datasets. Additionally, they implemented unit tests to ensure the parser and writer were producing indexes according to the CIFF specifications.

The results were clear. The small collection performed substantially worse than the baseline, confirming that skewed statistics degrade search quality. But when the small collection borrowed statistics from the larger one, performance not only recovered but actually slightly exceeded the baseline – because the small collection, now ranked with accurate statistics, contained a higher concentration of relevant documents.

In practice

The project created CIFF indices for four major Dutch municipalities: Amsterdam, Utrecht, Nijmegen, and Almere. A live deployment was initiated for the municipality of Nieuwegein, a smaller city near Utrecht, using the Utrecht index as the background collection. Evaluation of the real-world impact on user experience is ongoing.

All of the CIFF tools developed during the project have been released as open-source software, and the export service ensures that published indices are automatically updated when the underlying data changes.

Why it matters

The CIFFIL project illustrates a principle that is central to the OpenWebSearch.eu core idea: that open, interoperable standards can enable forms of cooperation that proprietary systems cannot. By sharing index statistics through CIFF, municipalities can improve their search quality without sharing their actual documents, without depending on a single commercial provider, and without each needing to build a large collection of their own. It is a form of search infrastructure as a public good.

The approach is also notable for its simplicity. It does not require neural models, large language models, or expensive computational resources. It works by making better use of data that already exists, through a well-understood ranking algorithm and an open file format.

What’s next

The immediate priorities are completing the open publication of all four municipal indices, conducting user-experience evaluations in the live deployments, and publishing the experimental findings as a research paper. Longer-term, the approach could be extended to additional municipalities and to other domains where small document collections need better search – such as cultural heritage institutions, local archives, or specialised libraries. The underlying principle – that sharing standardised index data can improve search quality without centralising control – has broad applicability wherever open, cooperative search infrastructure is valued.

Find the full project report here: https://zenodo.org/records/17750643

The CIFFIL Service project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

OWS.EU Partner in Focus: IT4I

The next partner we are introducing is IT4Innovations National Supercomputing center. IT4Innovations is a university research institute of VSB – Technical University of Ostrava, providing technical support and supercomputing power for the Open Web Index. The research team is guided and supported by the managing director of IT4Innovations, Vít Vondrák. The team includes Jan Martinovič, head of the Advanced Data Analysis and Simulations Lab, Kateřina Slaninová, deputy head of the Advanced Data Analysis and Simulations Lab, Martin Golasowski, senior researcher, and Markéta Dobiašová, research outreach and exploitation specialist. The IT4Innovations team has two main functions, it contributes to the infrastructure work-package and actively participates in dissemination and communication of the OpenWebSearch.eu project.

Martin Golasowski is leading technical activities related to establishing a federated data infrastructure, and Kateřina Slaninová is the task leader of the dissemination activities. We asked them about crucial milestones for the last project period and beyond.

Please describe your organisation’s tasks in the project. What is your field of expertise that you bring to the project?

Martin: Within the OpenWebSearch.eu project I focus mainly on creating tools allowing efficient data processing, movement and publishing on major infrastructure providers across Europe. We have been able to establish a distributed data infrastructure, and our tools are used to orchestrate complex computing workflows across the connected data and computing infrastructure. Here we use our expertise in data processing and analysis on HPC infrastructure to enable its efficient usage for web data processing and index generation. Our contribution has been important for this project as it helps to build basic infrastructure for building a transparent Open Web Index which can be easily extended by new infrastructure providers.

Kateřina: As the dissemination task leader, IT4Innovations has been responsible for planning, coordinating, and executing project dissemination activities, mainly the actions aimed at spreading project results to relevant stakeholders (researchers, policymakers, industry, or SMEs).

How is the project progressing? Which major milestones did you achieve?

Martin: During the project, we were able to leverage European computing centres for web data processing and publishing, such as IT4Innovations in the Czech Republic, LRZ in Germany, and CSC in Finland. Our tooling based on the LEXIS Platform allowed the project to efficiently utilise these powerful infrastructures and provide access to the project data products for general public through a web interface.

Kateřina: The project is progressing very well. One of our most significant achievements was the successful dissemination of the European Open Web Index (OWI), the first federated, pan‑European web index designed to support fair, transparent, and unbiased web search. The index is now publicly available for research and development use.

What are the challenges you have been facing?

Martin: Establishing a common data and computing infrastructure across providers located in different countries always means dealing with specific technical and policy challenges. We have been able to achieve our goals thanks to fruitful collaboration with the individual computing centres and their teams.

Kateřina: Open Web Search team provides a complex, large‑scale technical system involving crawling, data processing, and indexing on a distributed computing infrastructure. Translating these technical milestones into messages understandable to general stakeholders takes ongoing effort. Also, aligning communication across 14 diverse partner institutions is challenging. With many partners from universities, supercomputing centres and research institutions coordination requires consistent messaging and clear dissemination workflows.

Which milestones do you plan to achieve in the remaining months?

Martin: Towards the end of the project, we are focusing on contribution to the final deliverables and transition to the sustainability phase, which includes preparation of the infrastructure for operation after the end of the project and contribution to its documentation, which will be publicly available.

Kateřina: We aimed to increase engagement with researchers, developers, and innovators to encourage them to use the public OWI datasets. In the last year of the project, we organised and joined various events where we promoted the OWI, namely ISC High Performance 2025, NGI Forum 2025, #OSSYM 2025, EBDVF 2025, a parliamentarian breakfast, and lately we promoted the OWI at SCA/HPCAsia 2026.

What makes the OWS project special to you?

Martin: The project is a unique undertaking aiming to provide a transparent way to access indexed data from the public web. We were able to contribute thanks to our extensive experience with high-performance computing infrastructure and technologies like iRODS, the LEXIS Platform and our collaboration with European initiatives like EUDAT. Being a part of this project has also given us an opportunity to validate our tools in this specific domain of large-scale web data processing.

Kateřina: For me, what makes OpenWebSearch.eu truly special is its mission: restoring open, unbiased, transparent access to information in Europe, and reducing dependence on large global tech companies that control search infrastructure. The project’s collaborative nature creates a unique community working towards a shared vision of a more democratic web. It is the first EU‑funded initiative to build a public web index that anyone can reuse – researchers, innovators, SMEs, and even future search engine developers. This aligns with strong European values: openness, ethics, legal clarity, and digital sovereignty.

Do you already have plans for the time after the project ends?

Kateřina: IT4Innovations will organise its future exploitation activities into the programme that covers follow‑up European projects, cooperation with the public sector, service deployment, and growth of its federated ecosystem. It will work with the Czech EOSC initiative on aligned services and training activities, prepare and implement Horizon Europe continuation projects, and continue pilots with public authorities, focused on citizen‑oriented and web‑intelligence services. The Czech national supercomputing centre will also contribute together with CSC to sovereign AI services using Open Web Data in the LUMI AI Factory, and create reusable workflow templates for public‑sector, industrial, SME, and startup use cases.

Thank you for the interview!

Read more about IT4I: https://openwebsearch.eu/partners/vsb-technical-university-of-ostrava-it4innovations/

Building trustworthy access to medical information: Inside the TILDE project

The TILDE project builds a health search system that doesn’t just find answers – it checks them for bias, explains the underlying reasoning, and let’s users explore the evidence visually

Search for a health question online and you will get plenty of results. But how much can you trust what you find? Are the top results there because they are the most accurate, or because they happen to be the most popular? Are they showing you the full picture, or a skewed one – biased toward certain demographics, viewpoints, or types of sources?

The TILDE project – Trustworthy Access to Knowledge from the Indexed Web – funded under the European OpenWebSearch.eu project and carried out by Know Center Research GmbH in Austria, tackles this issue directly. It builds a health domain search system on top of the Open Web Index that goes beyond finding relevant documents to actively examining search results for bias, ensuring viewpoint diversity, and providing  visual tools to help users explore the evidence for themselves.

The Problem: Bias in health search

Health information is one of the most searched-for categories on the web, and also one of the most consequential. A search for COVID-19 treatment options, for example, should ideally return results that are medically accurate, drawn from credible sources, and representative of different perspectives – official health guidance, clinical research, patient experiences. In practice, standard search systems optimise for another kind of relevance, which is often approximated by popularity and click behaviour. This can systematically favour certain types of content while marginalising others.
The problem is compounded when large language models are involved. LLM-based systems, including RAG (retrieval-augmented generation) pipelines, inherit and can amplify biases present in both their training data and the documents they retrieve.
A search result list that is geographically skewed, lacks viewpoint diversity, or reinforces stereotypes about particular demographic groups is not just an academic concern – it can directly affect how people understand their health options.

The Approach: Three modules for trustworthy search

TILDE addresses this through three integrated modules, each tackling a different dimension of the problem.The first module extracts medical knowledge from the Open Web Index. Starting from approximately 200,000 health-related websites identified in the OWI, the team extracted medical entities – diseases, symptoms, drugs, procedures – using a named entity recognition model, then standardised these entities against the UMLS clinical ontology (a comprehensive medical terminology system). This creates a structured knowledge layer on top of the raw web content. The extracted entities and their relationships form a medical knowledge graph that links websites to each other and to clinical concepts. A hybrid search engine combines entity-based retrieval (finding pages that mention specific medical concepts) with semantic similarity search (finding pages whose content is meaningfully related to the query), fusing the results to balance precision and recall.

The second module checks search results for fairness and trustworthiness. This is TILDE’s most distinctive contribution. Built on DSPy, a Stanford framework for programmatic LLM pipelines, the trustworthiness module processes search results through three stages. First, each candidate document is enriched with fairness-related attributes: its viewpoint (official guidance, patient narrative, investigative journalism), its source credibility (from high-authority institutional sources down to user-generated content), whether its content is factual or anecdotal, and a gender neutrality score. Second, an intelligent re-ranker uses these attributes to reorder results according to a strict hierarchy: maximise fairness first, then filter for credibility, then ensure viewpoint diversity. The system uses chain-of-thought reasoning, meaning it explains its re-ranking decisions step by step. Third, a stereotype audit inspired by established bias benchmarks checks both the system’s internal reasoning and its user-facing output for harmful stereotypes – a safety net against the system itself introducing bias.

The third module provides visual aids to help users understand the evidence. Rather than presenting search results as a flat list of links, the visual web interface allows users to explore medical information through multiple lenses: highlighted medical concepts within document text, faceted search by entity type, tag clouds and bar charts showing the frequency of different symptoms or drugs across results, co-occurrence matrices revealing relationships between medical concepts, and an interactive knowledge graph that can be expanded and filtered.

Why It Matters: Making fairness operational

There is no shortage of academic research on bias in search systems. What is less common is work that takes established fairness metrics – like the NFaiRR measure of retrieval fairness – and turns them into actionable components within a working search pipeline. TILDE does exactly this. The re-ranking module does not merely measure bias after the fact; it actively uses fairness criteria in real time to reorder results, while maintaining credibility and diversity as additional constraints. The chain-of-thought reasoning makes the process transparent: users and auditors can see why results were ranked the way they were.

The health domain is the testbed, but the approach is not limited to it. The same pipeline – entity extraction, hybrid retrieval, fairness-aware re-ranking with transparent reasoning, and visual analytics – could be applied to any domain where search results carry real-world consequences: legal information, financial advice, educational content, public policy. The fact that it is built on the Open Web Index, rather than on a proprietary search engine, means the underlying data is open and the approach is reproducible.

What’s Next

The immediate next step is completing the integration of the hybrid search across all components of the visual interface. Longer-term priorities include optimising the trustworthiness pipeline for real-time performance, extending the approach to additional health sub-domains, and conducting user studies to understand how fairness-aware re-ranking and visual analytics actually affect the way people seek and evaluate health information.

To read the full technical report, go here: https://zenodo.org/records/17542369

The TILDE project was funded under the OpenWebSearch.eu initiative (Horizon Europe, Grant Agreement 101070014, Call #2).

Teaching Search Engines to understand arguments: Inside the AKASE Project

A European research project is building a knowledge graph of public argumentation – and using it to make web search smarter

When you search the web for a contentious topic – e.g. how AI should be regulated – you get a list of links ranked by relevance to your keywords. What you do not get is any indication of whether the arguments in those documents are well-structured, logically sound, or represent a balanced range of perspectives. The search engine has no understanding of argumentation. It cannot tell you which pages contain strong reasoning and which are riddled with logical fallacies.

The AKASE project – Argumentation Knowledge-Graphs for Advanced Search Engines – set out to change this. Funded under the European OpenWebSearch.EU project and carried out at the University of Groningen, AKASE has built a large-scale computational map of public argumentation, extracted from tens of thousands of web documents, and used it to power two new kinds of tools: a search engine that ranks results by argumentative quality, and a multi-agent deliberation platform where humans and AI reason together.

The Problem: Arguments are everywhere, but remain unleveraged in web search

Public debate on the internet is vast. People argue about climate policy, healthcare, technology regulation, and countless other topics across news articles, opinion pieces, forums, and dedicated debating platforms. But the argumentation threads are scattered, unstructured, and variable in quality. Some arguments are carefully reasoned and well-supported; others rely on logical fallacies or present only one side of an issue.

Project AKASE addresses these challenges by developing a computational framework for extracting, organizing, and presenting argumentative content from the web in a coherent, scalable, and actionable way.

The Approach: Mapping the Structure of Public Debate

The AKASE team’s approach begins with a simple question: what are people actually arguing about? To answer it, they collected nearly 30,000 arguments from five online debating platforms and used a combination of advanced text embeddings and clustering algorithms to identify the distinct “issues” – the specific questions or sub-problems – that these arguments revolve around. After removing duplicates and merging near-identical formulations using large language models, they arrived at a set of roughly 16,000 unique issues, organised into 16 thematic domains ranging from politics and technology to health and ethics.

But structured debating platforms represent only a fraction of online argumentation. Most arguments exist in ordinary web pages – news articles, opinion columns, policy documents – where they are expressed in natural language without explicit labels. To capture this unstructured content, the AKASE team developed an automated pipeline that reads web documents, identifies which sentences are argumentative, classifies them as claims or supporting premises, and determines the relationships between them.

The team went further by enriching these arguments with two additional layers of analysis. First, they annotated arguments with the human values they express – freedom, equality, security, and so on – capturing not just what people argue but the moral commitments that underpin their reasoning. Second, they developed methods for assessing argument quality: a system that generates probing critical questions to test an argument’s assumptions, and a multi-agent framework where multiple AI models deliberate with each other to detect logical fallacies.

The result: A Knowledge Graph of Argumentation

All of this analysis feeds into the project’s central artefact: the Argumentation Knowledge Graph, or AKG. This is a large, interconnected data structure that links topics, issues, claims, and premises across thousands of documents. It captures the logical and rhetorical relationships between argumentative units – which claims support each other, which ones conflict, and which are essentially underpinning the same point in different words.

Starting from an initial set of around 50,000 documents retrieved from the Open Web Index, the team extracted nearly half a million argumentative units and identified millions of relationships between them. A second processing phase expanded the data source to over 105 million documents. The resulting graph contains tens of thousands of interconnected nodes, with over 90 per cent belonging to a single connected component – meaning that you can navigate from virtually any argument to any other through a chain of related reasoning.

Two Applications: Smarter Search and Structured Deliberation

The AKASE team translated this knowledge graph into two practical tools. The first is an argument-aware search engine. When you submit a query, the system retrieves relevant documents as a conventional search engine would – but then it reranks them based on the argumentative quality of each document. Three criteria were used:

  • how well claims are justified
  • how coherently the argument is structured
  • whether the document presents a balanced range of perspectives rather than a one-sided view

The system also generates a concise summary of the top results and suggests related issues from the knowledge graph, helping users explore the broader landscape of debate around their query.

The second tool is ArgsBase, a multi-agent deliberation platform. ArgsBase creates a structured discussion involving multiple AI agents, a human user, and a moderator. The AI agents contribute arguments, counterarguments, and refinements; the moderator manages the flow of discussion; and a real-time analyser tracks the evolving state of the debate, producing summaries and visual argument maps. In an initial user study, participants found this multi-agent format more useful than interacting with a single AI, precisely because the diversity of perspectives and the structured format encouraged deeper thinking.

Why it matters

Today’s information environment does not lack arguments – it lacks tools for navigating them. By building a computational infrastructure that can extract, organise, evaluate, and present argumentative content from the open web, AKASE offers a different model of information access: one where the quality of reasoning is a first-class signal, not an afterthought.

The ArgsBase platform, in particular, points toward an intriguing future for human–AI interaction. Rather than using AI as an oracle that delivers answers, it positions AI models as participants in a structured reasoning process – one where disagreement is productive, perspectives are made explicit, and the human user remains an active agent rather than a passive recipient. This is a model of AI-assisted thinking that takes critical reasoning seriously.

What’s Next

The AKASE team has identified several directions for future work: expanding the knowledge graph dynamically as new arguments emerge on the web, incorporating multi-modal content (not just text), and refining the deliberation platform through more extensive user studies focused on practical decision-making scenarios. The argument-aware search engine will also benefit from reduced latency and broader domain coverage.

To read the full technical report, go here: https://zenodo.org/records/17674255

The AKASE project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

Fighting Misinformation with the Open Web Index: Inside the VERITAS project

How a European research team built a browser-based fact-checking assistant powered by the #OpenWebIndex

In an information environment where misleading claims can spread in real time, the ability to verify what you read online is a necessity. The VERITAS project, funded under the European OpenWebSearch.EU project and conducted by DEXAI, set out to build a practical tool for exactly this purpose: an AI-powered assistant that sits in your browser, answers your questions with sourced evidence, and draws its knowledge not from a proprietary index controlled by a single corporation, but from an open, European web search infrastructure.

The Problem: Misinformation and the Limits of Conventional Search

The War in Ukraine has been accompanied by an unprecedented volume of online misinformation – from fabricated reports and manipulated imagery to subtly misleading narratives. For journalists trying to verify claims, researchers analysing media coverage, and ordinary citizens attempting to understand what is actually happening, conventional search engines offer limited help. They return ranked lists of links, but they do not assess the credibility of sources, provide citations for specific claims, or explain the basis for their answers. The burden of verification falls entirely on the user.

At the same time, the emergence of AI chatbots has introduced a new set of problems. Large language models can produce fluent, confident-sounding answers that are entirely fabricated – a phenomenon known as hallucination. Without mechanisms to ground their outputs in verifiable evidence, these systems risk becoming part of the misinformation problem rather than the solution.

The Approach: Retrieval-Augmented Generation via the Open Web Index

The DEXAI/VERITAS team adopted an approach known as RAG (Retrieval-Augmented Generation). The core idea: instead of prompting an AI model to generate answers from whatever it absorbed during training, you first retrieve relevant documents from a trusted knowledge base and then ask the model to compose its answer based specifically on those documents. Every claim in the response can thus be traced back to an identifiable source.

Rather than relying on a commercial search engine or a static dataset, the system draws its evidence from the Open Web Index (OWI).

In practice, the system works as follows. The VERITAS pipeline fetches the latest crawled web pages from the OWI (pulling the latest 30 days of crawled content). The dataset is then indexed using a semantic embedding model, which converts text passages into numerical vectors that capture their meaning. When a user poses a question, the system converts it into a similar vector, finds the most relevant passages in the index, and passes them – together with the question – to a language model (LLaMA 3.1), which generates a grounded response.

The chatbot supports multiple audiences, offering background information with explicit citations for journalists, metadata-rich summaries for researchers, and concise, jargon-free explanations for the general public.

What They Built: A Fact-Checking Assistant in Your Browser

The finished product is a Chrome browser extension. Once installed, it provides a small popup where users can type questions in natural language and receive answers accompanied by source references.

The system is designed to serve different types of users in different ways. Journalists receive background information with explicit citations. Researchers get metadata-rich summaries. Members of the general public are given concise, jargon-free explanations. In the current prototype, the system is focused specifically on the War in Ukraine – a deliberate scoping decision that allowed the team to develop and validate the approach within a well-defined domain.

Why It Matters: Open Infrastructure for Trustworthy Information

VERITAS is significant not only for what it does, but for how it does it. By building on the Open Web Index rather than a proprietary data source, the project demonstrates that open search infrastructure can serve as the foundation for practical applications.

The RAG approach itself addresses one of the most persistent criticisms of AI-generated text: the lack of verifiability. By requiring the model to base its answers on retrieved documents and by presenting those documents to the user, VERITAS moves away from the “trust me” paradigm of conventional chatbots towards a “check for yourself” model of AI-assisted information access.

What’s next?

Future development could also introduce user feedback mechanisms, allowing the quality of responses to be improved over time, as well as streaming responses for a more interactive user experience. Perhaps most importantly, the VERITAS approach could be applied to other domains where information verification is critical – from public health to climate science to electoral integrity.
An outstanding challenge overall lies in the growing complexity of multi-model ecosystems. As retrieval systems, ranking components, embedding models, and large language models are increasingly combined—often across organisational and infrastructural boundaries—the integrity of the final answer depends on the entire chain. Outputs generated by one model may be ingested, summarised, or re-ranked by another, creating feedback loops that are difficult to detect and audit. In such environments, disinformation cascades can emerge when misleading or low-quality content propagates across interconnected systems, gaining credibility through repetition and algorithmic reinforcement. Ensuring traceability, cross-model accountability, and robust provenance mechanisms will be essential to prevent systemic amplification of false or manipulated claims.

To read the full technical report, go here: https://zenodo.org/records/17588890

The VERITAS project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

#ossym26 – Call for papers for 8th International Symposium on Open Search

8th International Symposium on Open Search #ossym2026: Call for papers and demos is open until 1 March

From 7 to 9 October 2026, the Open Search community will meet in Berlin for the 8th International Open Search Symposium.
Until 1 March 2026, researchers, experts and practitioners can submit scientific papers, practical experience reports and software demonstrations for #ossym2026. The hybrid conference, which is organized by OpenWebSearch.eu partners Open Search Foundation and German Aerospace Center (DLR), will take place online and at CODE Berlin University of Applied Sciences.

The call for papers and demos is aimed at a wide range of experts – invited are, among others, researchers and speakers from research and informatics, data centres, libraries, technology companies, politics, education as well as legal, ethical and societal thought leaders.

Full papers and abstracts presented at #ossym2026 will be published in open access in online proceedings (including DOIs and ISBN) following the event.

Submit your ideas now: https://opensearchfoundation.org/events-osf/ossym26/

Read more

Reset Digital for Good I Fighting the Search Monopoly With an Open Source Index: An Interview With Michael Granitzer From OpenWebSearch

In a recent interview with Reset Digital for Good, our OpenWebSearch.eu project leader Michael Granitzer gave insights into why the already available pilot of the Open Web Index (OWI) is an essential cornerstone of European digital sovereignty. 

Our mission is to break up the silo of a single search engine. We’re doing this by crawling the web, collecting web pages and preparing them to be consumed by search engines. Preparing them involves cleaning advertisements and navigation links, then extracting the main content. This index can be used by individuals or organisations to build their own search engines.” he states.

But Granitzer argues beyond traditional web search, which in itself is shifting more and more towards generative and agentic AI solutions. 

…In an ideal scenario, the AI model is running on my machine and controlling my data. It’s a tool that helps me, conducts searches on my behalf and understands what I want to do. I’m talking about small language models rather than large language models. A model that helps me search, aggregate and synthesise information based on search endpoints that I choose.” he argues. 

A powerful web index as part of a larger web data infrastructure is a non-negotiable in that sense. 

Read the full interview here:

https://en.reset.org/fighting-the-search-monopoly-with-an-open-source-index-an-interview-with-michael-granitzer-from-openwebsearch/