From shop counter to online catalogue: Inside the DTCommerce project

A Slovenian team set out to build open-source tools that help small retailers go digital easily, by importing product descriptions from a spreadsheet into an online shop – with AI-enhanced descriptions and images, in just a few clicks

For small to medium sized Brick-and-Mortar retailers, the move from physical shops to e-commerce is a long and cost intensive process. These businesses typically have an accounting system with a list of products, perhaps a supplier’s website with technical specifications, and neither the time nor the budget to manually write product descriptions, source images, and populate an online shop for hundreds or thousands of items. The result is that many small retailers either delay their digital transition or end up with online catalogues that are sparse, poorly described, and unappealing to customers.

The main challenge is not a lack of products but a lack of digital product content. A physical shop’s inventory usually exists as a list of names, SKU (stock keeping units) codes, and prices in accounting software. An online shop needs other specifications : well-crafted product descriptions, high-quality images, metadata, and engaging presentations. Creating this content  quickly becomes a substantial undertaking.

The DTCommerce project, carried out by the Slovenian company ZenLab under the European OpenWebSearch.eu project funding, set out to solve this problem with an automated solution. The idea is simple: take the product list a shop already has, find the corresponding product information on the web, enhance the descriptions using AI, and deliver the result as a ready-to-use online shop – with minimal manual effort.

The Approach: Automated Extraction and AI Enhancement

The DTCommerce system operates in two stages. The first stage is a web crawling process that, given a list of product URLs from supplier or manufacturer websites, automatically extracts the key product information, such as: title, description, imagery, price, and technical specifications. The crawler is built on Scrapy, a well-established open-source web scraping framework, and includes support for structured data formats (JSON-LD) as well as domain-specific extractors for particular target sites.

The second stage is where AI comes in. The raw product descriptions extracted from supplier websites are often technical, dry, and written for a trade audience rather than end consumers. DTCommerce feeds these descriptions to an AI language model (Perplexity AI’s sonar-pro), which rephrases them into clearer, more engaging copy while preserving every technical detail – dimensions, model numbers, and specifications. The original description is retained alongside the enhanced version, so nothing is lost. The result is a set of enriched product records in a standardised format, ready to be imported into an e-commerce system.

From Pipeline to Plugin: A Few Clicks to a Full Shop

To make the pipeline usable for non-technical shop owners, the ZenLab team built a WordPress/WooCommerce plugin that wraps the entire workflow into a simple administrative interface. The process works as follows: the shop owner exports a product list from their accounting software as an Excel file and uploads it to the plugin. The plugin creates basic product entries in WooCommerce, sends them to the enrichment service, and automatically populates each product page with enhanced descriptions and images – all without requiring the shop owner to edit a single product manually.

An Honest Detour: When the Open Web Index Didn’t Have What Was Needed

DTCommerce was originally designed to use the Open Web Index (OWI) as its primary data source for finding product information across the web. The vision was that a shop owner could provide a product name or SKU code, and the system would search the OWI to find matching products on supplier and manufacturer websites, automatically retrieving descriptions and images.

In practice, the specific e-commerce sites that the project’s use cases required were not present in the OWI at the time of development. This is not surprising: the OWI is still being built, and its coverage of niche commercial sites – particularly smaller B2B suppliers – is not yet comprehensive. The team adapted by switching to direct web scraping of predefined supplier URLs, which allowed the project to deliver its core functionality on schedule.

Why the project matters

DTCommerce addresses a real and widespread problem. Across Europe, millions of small retailers face pressure to establish an online presence but lack the resources to do so effectively. By automating the most labour-intensive part of the process – creating digital product content – the project lowers the barrier to entry in a meaningful way. The fact that the tools are open source and built on widely used platforms (WordPress, WooCommerce, Scrapy) means they are accessible to a broad audience and can be adapted to different markets and product domains.

The project also illustrates a type of application that open web search infrastructure is well suited to support. The ability to search an open web index for product information – matching a local shop’s inventory against the broader web – is precisely the kind of use case that depends on open, non-proprietary access to web data. As the OWI matures, tools like DTCommerce stand to benefit directly. The project overall also demonstrates both the potential of the OWI-based approach and its current practical limits.

Final Outlook

The DTCommerce activities will follow with further development of tools compatible also with other e-commerce platforms. The tool will remain as open source, the company will be developing and automated portal for data exchange and enrichment, available on demand for various e-commerce integrations.

Find the full project report here: https://zenodo.org/records/18300935

The DT Commerce project was funded under the OpenWebSearch.eu initiative (Horizon Europe, Grant Agreement 101070014, Call #2).

 

The case for Neural Crawling: Inside the FUN project

A research team from Pisa and Glasgow proposes that AI language models should decide which web pages to download – and shows why this matters for the future of search

Before a search engine can find anything, it must first build a collection of web pages to search through. This collection is assembled by a crawler – a piece of software that systematically visits web pages, follows links, and downloads content. The decisions the crawler makes about which pages to prioritise determine, in a very direct way, what the search engine will eventually be able to find.

For over two decades, the dominant approach to crawling prioritisation has been PageRank and related link-analysis methods: pages that are linked to by many other important pages are assumed to be important themselves. This was a reasonable assumption in the era of keyword search. But search is changing. Users increasingly ask questions in natural language rather than typing keywords, and automated systems like retrieval-augmented generation (RAG) pipelines issue their own queries to search engines. These new kinds of queries demand pages with rich, coherent, meaningful content – and there is no guarantee that such pages are also the most popular or the most linked-to.

The FUN project – Focused Neural Crawling – funded under the European OpenWebSearch.EU project and carried out by researchers at the University of Pisa and the University of Glasgow, tackles this problem head-on. It proposes a new paradigm: instead of using link popularity to decide what to crawl, use AI language models to estimate the semantic quality of web pages and prioritise accordingly.

Why crawling matters more than you might think

It is easy to focus on the visible parts of a search engine – the ranking algorithms, the interface, the speed of results – and overlook the crawler. But the crawler is the primary content filter in the entire search pipeline. It decides what gets downloaded, stored, and indexed. Everything that happens downstream – indexing, ranking, retrieval – operates only on the content the crawler has already collected. A sophisticated ranking algorithm cannot compensate for a poor crawling strategy: if valuable pages were never downloaded, they simply do not exist as far as the search engine is concerned.

The web is vast, and no crawler can download everything. Choices must be made, and the heuristics that guide those choices shape the quality of the entire search corpus. Traditional heuristics like PageRank assume that a page’s importance can be inferred from its position in the web’s link structure. This works well when search queries are short keyword strings and when the most popular pages tend to be the most useful. But the FUN team argues that this assumption is increasingly outdated.

The Shift: From link popularity to semantic quality

The core idea behind neural crawling is straightforward: instead of asking “How popular is this page?”, the crawler asks “How likely is this page to contain content that would be useful for answering a search query?” To answer this question, the system uses a neural quality estimator – a small language model that has been trained to predict, from the text of a document alone, whether that document is likely to be relevant to any query. The model does not need to know what the query will be; it assesses the intrinsic quality of the text itself: its coherence, informativeness, and semantic richness.

There is an obvious practical problem: the crawler needs to decide whether to prioritise a page before it has downloaded that page. It cannot read the text of a page it has not yet fetched. The FUN team addresses this with two quality propagation strategies. The first is based on the observation that web pages tend to link to other pages of similar quality. If a high-quality page links to an unknown page, there is a reasonable probability that the unknown page is also of decent quality. The crawler can therefore use the quality of already-downloaded pages as a proxy for the likely quality of the pages they link to.
The second strategy works at the domain level: pages within the same domain tend to have similar quality. Once the crawler has downloaded a few pages from a domain, it can estimate the quality of the domain as a whole and use that estimate to prioritise other pages from the same domain.

What the experiments show

The team tested their approach through large-scale simulations on ClueWeb22-B, a web corpus of 87 million pages, using two different sets of test queries. One set consisted of traditional keyword queries; the other consisted of natural language questions.

The results are striking. On natural language queries, the neural crawling strategies consistently outperformed PageRank in both the quality of the crawled corpus and the effectiveness of downstream search results. The domain-level strategy (DomQ) was particularly strong, building corpora that led to substantially better retrieval performance. On traditional keyword queries, the neural strategies performed comparably to PageRank – they did not lose ground on the type of search that PageRank was designed for.

The efficiency results were also notable. Neural crawlers collected relevant pages faster than PageRank in the early stages of the crawl, meaning they built useful search corpora more quickly and with less wasted bandwidth downloading low-value pages. This matters in practice, because crawling the web is expensive in terms of network resources, storage, and computing time.

A key finding underpinning the domain-level approach is that the semantic quality of a web page is strongly correlated with the average quality of other pages on the same domain (Pearson correlation of 0.649). By contrast, the equivalent correlation for PageRank scores is much weaker (0.272). In other words, knowing that a domain tends to host high-quality content is a much better predictor of individual page quality than knowing that a domain is well-linked.

Why it matters

The FUN project is significant for the OpenWebSearch.EU project in a very direct way. Running an open European web index requires crawling decisions – and the quality of those decisions determines the quality of the index. If an open web index is built using traditional crawling heuristics, it inherits the biases of those heuristics: a preference for popular, well-linked content at the expense of semantically rich but less connected pages. Neural crawling offers a way to build corpora that are better suited to the modern demands of natural language search and AI-powered information retrieval.

To make this practical, the FUN team produced not just research findings but usable software. Their quality scoring tools are compatible with the OWS parquet file format and integrated into Resilipipe, the open-source content analysis framework used by OpenWebSearch.eu.

The FUN project demonstrates that the way we crawl the web should evolve alongside the way we search it. As search queries become more conversational and AI systems become major consumers of search infrastructure, the assumption that link popularity is the best guide for crawling priorities is no longer sufficient. Neural quality estimation offers a complementary – and in many cases superior – signal.

What’s next

Future work could explore combining neural and link-based signals in hybrid strategies, using ensembles of quality estimators that assess different dimensions of page quality (spam, machine-generated content, factual accuracy), and evaluating neural crawling with more advanced retrieval models beyond BM25. The approach could also be adapted for other tasks that depend on corpus quality, such as building high-quality training data for large language models.

To read the full technical report, go here: https://zenodo.org/records/17359141

The FUN project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

How Dutch municipalities are sharing Search Intelligence to serve citizens better: Inside the CIFFIL Service project

The CIFFIL Service project shows that open web index standards can help small municipalities improve their search quality by accessing results from larger ones

Search engines work best when they have a lot of data to learn from. The more documents in a collection, the better the system can distinguish between common words and genuinely informative ones – and therefore the better it can identify what is relevant to a query. This is a well-known principle in information retrieval, and it creates an obvious problem for anyone who needs to search a small collection of documents: the search results are simply not as good as they could be.

The CIFFIL Service project, funded under the European OpenWebSearch.EU project, tackled exactly this problem – in a setting with direct consequences for citizens. Spinque, a Dutch search technology company, builds search systems for municipalities that allow council members and residents to search through publicly available government documents. Some of these municipal collections are small, containing fewer than 10,000 documents, and are full of domain-specific jargon. The result is that search quality suffers. The CIFFIL project set out to fix this by allowing municipalities to share their search index data with one another through an open standard.

The problem: Small collections, unreliable statistics

Most search engines use some variant of a ranking algorithm called BM25. At its core, BM25 judges the relevance of a document to a query by looking at how often the query terms appear in the document and how rare those terms are across the collection as a whole. Terms that do not appear in a lot of documents signal relevance.

This is where small collections often fall short. When a collection has only a few thousand documents, the estimates of how common or rare a term is very unreliable. The ranking algorithm, relying on these skewed statistics, makes poor decisions about what is relevant. The result for the user is a search experience that feels hit-or-miss.

The solution: Sharing index data through an open solution

The CIFFIL team’s approach is simple. If a small municipality’s search system suffers from unreliable statistics because its collection is too small, why not supplement those statistics with data from a larger municipality that deals with similar types of documents? After all, Dutch municipal documents share a common vocabulary of administrative, legal, and policy language.

The technical mechanism for this sharing is the Common Index File Format, or CIFF – an open standard developed in the information retrieval research community for exchanging inverted index data between systems. An inverted index is the core data structure behind a search engine: it maps every term in a collection to the documents in which that term appears, along with statistics such as how often it appears and in how many documents.

Spinque integrated CIFF support into its search platform, Spinque Desk. This involved building a CIFF reader (to import index data), a CIFF writer (to export index data), and – critically – a modified BM25 ranking component that can combine the statistics from a local collection with those from an external CIFF index. When a small municipality’s search system uses this combined approach, it effectively “borrows” the larger municipality’s understanding of which terms are common and which are rare, while still searching its own documents.

Proof of concept

The team implemented tests for the functionalities implemented for this project. Specifically, they did manual testing by doing experiments using CIFF exports to see if they could replicate effectiveness results on open datasets. Additionally, they implemented unit tests to ensure the parser and writer were producing indexes according to the CIFF specifications.

The results were clear. The small collection performed substantially worse than the baseline, confirming that skewed statistics degrade search quality. But when the small collection borrowed statistics from the larger one, performance not only recovered but actually slightly exceeded the baseline – because the small collection, now ranked with accurate statistics, contained a higher concentration of relevant documents.

In practice

The project created CIFF indices for four major Dutch municipalities: Amsterdam, Utrecht, Nijmegen, and Almere. A live deployment was initiated for the municipality of Nieuwegein, a smaller city near Utrecht, using the Utrecht index as the background collection. Evaluation of the real-world impact on user experience is ongoing.

All of the CIFF tools developed during the project have been released as open-source software, and the export service ensures that published indices are automatically updated when the underlying data changes.

Why it matters

The CIFFIL project illustrates a principle that is central to the OpenWebSearch.eu core idea: that open, interoperable standards can enable forms of cooperation that proprietary systems cannot. By sharing index statistics through CIFF, municipalities can improve their search quality without sharing their actual documents, without depending on a single commercial provider, and without each needing to build a large collection of their own. It is a form of search infrastructure as a public good.

The approach is also notable for its simplicity. It does not require neural models, large language models, or expensive computational resources. It works by making better use of data that already exists, through a well-understood ranking algorithm and an open file format.

What’s next

The immediate priorities are completing the open publication of all four municipal indices, conducting user-experience evaluations in the live deployments, and publishing the experimental findings as a research paper. Longer-term, the approach could be extended to additional municipalities and to other domains where small document collections need better search – such as cultural heritage institutions, local archives, or specialised libraries. The underlying principle – that sharing standardised index data can improve search quality without centralising control – has broad applicability wherever open, cooperative search infrastructure is valued.

Find the full project report here: https://zenodo.org/records/17750643

The CIFFIL Service project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

Building trustworthy access to medical information: Inside the TILDE project

The TILDE project builds a health search system that doesn’t just find answers – it checks them for bias, explains the underlying reasoning, and let’s users explore the evidence visually

Search for a health question online and you will get plenty of results. But how much can you trust what you find? Are the top results there because they are the most accurate, or because they happen to be the most popular? Are they showing you the full picture, or a skewed one – biased toward certain demographics, viewpoints, or types of sources?

The TILDE project – Trustworthy Access to Knowledge from the Indexed Web – funded under the European OpenWebSearch.eu project and carried out by Know Center Research GmbH in Austria, tackles this issue directly. It builds a health domain search system on top of the Open Web Index that goes beyond finding relevant documents to actively examining search results for bias, ensuring viewpoint diversity, and providing  visual tools to help users explore the evidence for themselves.

The Problem: Bias in health search

Health information is one of the most searched-for categories on the web, and also one of the most consequential. A search for COVID-19 treatment options, for example, should ideally return results that are medically accurate, drawn from credible sources, and representative of different perspectives – official health guidance, clinical research, patient experiences. In practice, standard search systems optimise for another kind of relevance, which is often approximated by popularity and click behaviour. This can systematically favour certain types of content while marginalising others.
The problem is compounded when large language models are involved. LLM-based systems, including RAG (retrieval-augmented generation) pipelines, inherit and can amplify biases present in both their training data and the documents they retrieve.
A search result list that is geographically skewed, lacks viewpoint diversity, or reinforces stereotypes about particular demographic groups is not just an academic concern – it can directly affect how people understand their health options.

The Approach: Three modules for trustworthy search

TILDE addresses this through three integrated modules, each tackling a different dimension of the problem.The first module extracts medical knowledge from the Open Web Index. Starting from approximately 200,000 health-related websites identified in the OWI, the team extracted medical entities – diseases, symptoms, drugs, procedures – using a named entity recognition model, then standardised these entities against the UMLS clinical ontology (a comprehensive medical terminology system). This creates a structured knowledge layer on top of the raw web content. The extracted entities and their relationships form a medical knowledge graph that links websites to each other and to clinical concepts. A hybrid search engine combines entity-based retrieval (finding pages that mention specific medical concepts) with semantic similarity search (finding pages whose content is meaningfully related to the query), fusing the results to balance precision and recall.

The second module checks search results for fairness and trustworthiness. This is TILDE’s most distinctive contribution. Built on DSPy, a Stanford framework for programmatic LLM pipelines, the trustworthiness module processes search results through three stages. First, each candidate document is enriched with fairness-related attributes: its viewpoint (official guidance, patient narrative, investigative journalism), its source credibility (from high-authority institutional sources down to user-generated content), whether its content is factual or anecdotal, and a gender neutrality score. Second, an intelligent re-ranker uses these attributes to reorder results according to a strict hierarchy: maximise fairness first, then filter for credibility, then ensure viewpoint diversity. The system uses chain-of-thought reasoning, meaning it explains its re-ranking decisions step by step. Third, a stereotype audit inspired by established bias benchmarks checks both the system’s internal reasoning and its user-facing output for harmful stereotypes – a safety net against the system itself introducing bias.

The third module provides visual aids to help users understand the evidence. Rather than presenting search results as a flat list of links, the visual web interface allows users to explore medical information through multiple lenses: highlighted medical concepts within document text, faceted search by entity type, tag clouds and bar charts showing the frequency of different symptoms or drugs across results, co-occurrence matrices revealing relationships between medical concepts, and an interactive knowledge graph that can be expanded and filtered.

Why It Matters: Making fairness operational

There is no shortage of academic research on bias in search systems. What is less common is work that takes established fairness metrics – like the NFaiRR measure of retrieval fairness – and turns them into actionable components within a working search pipeline. TILDE does exactly this. The re-ranking module does not merely measure bias after the fact; it actively uses fairness criteria in real time to reorder results, while maintaining credibility and diversity as additional constraints. The chain-of-thought reasoning makes the process transparent: users and auditors can see why results were ranked the way they were.

The health domain is the testbed, but the approach is not limited to it. The same pipeline – entity extraction, hybrid retrieval, fairness-aware re-ranking with transparent reasoning, and visual analytics – could be applied to any domain where search results carry real-world consequences: legal information, financial advice, educational content, public policy. The fact that it is built on the Open Web Index, rather than on a proprietary search engine, means the underlying data is open and the approach is reproducible.

What’s Next

The immediate next step is completing the integration of the hybrid search across all components of the visual interface. Longer-term priorities include optimising the trustworthiness pipeline for real-time performance, extending the approach to additional health sub-domains, and conducting user studies to understand how fairness-aware re-ranking and visual analytics actually affect the way people seek and evaluate health information.

To read the full technical report, go here: https://zenodo.org/records/17542369

The TILDE project was funded under the OpenWebSearch.eu initiative (Horizon Europe, Grant Agreement 101070014, Call #2).

Teaching Search Engines to understand arguments: Inside the AKASE Project

A European research project is building a knowledge graph of public argumentation – and using it to make web search smarter

When you search the web for a contentious topic – e.g. how AI should be regulated – you get a list of links ranked by relevance to your keywords. What you do not get is any indication of whether the arguments in those documents are well-structured, logically sound, or represent a balanced range of perspectives. The search engine has no understanding of argumentation. It cannot tell you which pages contain strong reasoning and which are riddled with logical fallacies.

The AKASE project – Argumentation Knowledge-Graphs for Advanced Search Engines – set out to change this. Funded under the European OpenWebSearch.EU project and carried out at the University of Groningen, AKASE has built a large-scale computational map of public argumentation, extracted from tens of thousands of web documents, and used it to power two new kinds of tools: a search engine that ranks results by argumentative quality, and a multi-agent deliberation platform where humans and AI reason together.

The Problem: Arguments are everywhere, but remain unleveraged in web search

Public debate on the internet is vast. People argue about climate policy, healthcare, technology regulation, and countless other topics across news articles, opinion pieces, forums, and dedicated debating platforms. But the argumentation threads are scattered, unstructured, and variable in quality. Some arguments are carefully reasoned and well-supported; others rely on logical fallacies or present only one side of an issue.

Project AKASE addresses these challenges by developing a computational framework for extracting, organizing, and presenting argumentative content from the web in a coherent, scalable, and actionable way.

The Approach: Mapping the Structure of Public Debate

The AKASE team’s approach begins with a simple question: what are people actually arguing about? To answer it, they collected nearly 30,000 arguments from five online debating platforms and used a combination of advanced text embeddings and clustering algorithms to identify the distinct “issues” – the specific questions or sub-problems – that these arguments revolve around. After removing duplicates and merging near-identical formulations using large language models, they arrived at a set of roughly 16,000 unique issues, organised into 16 thematic domains ranging from politics and technology to health and ethics.

But structured debating platforms represent only a fraction of online argumentation. Most arguments exist in ordinary web pages – news articles, opinion columns, policy documents – where they are expressed in natural language without explicit labels. To capture this unstructured content, the AKASE team developed an automated pipeline that reads web documents, identifies which sentences are argumentative, classifies them as claims or supporting premises, and determines the relationships between them.

The team went further by enriching these arguments with two additional layers of analysis. First, they annotated arguments with the human values they express – freedom, equality, security, and so on – capturing not just what people argue but the moral commitments that underpin their reasoning. Second, they developed methods for assessing argument quality: a system that generates probing critical questions to test an argument’s assumptions, and a multi-agent framework where multiple AI models deliberate with each other to detect logical fallacies.

The result: A Knowledge Graph of Argumentation

All of this analysis feeds into the project’s central artefact: the Argumentation Knowledge Graph, or AKG. This is a large, interconnected data structure that links topics, issues, claims, and premises across thousands of documents. It captures the logical and rhetorical relationships between argumentative units – which claims support each other, which ones conflict, and which are essentially underpinning the same point in different words.

Starting from an initial set of around 50,000 documents retrieved from the Open Web Index, the team extracted nearly half a million argumentative units and identified millions of relationships between them. A second processing phase expanded the data source to over 105 million documents. The resulting graph contains tens of thousands of interconnected nodes, with over 90 per cent belonging to a single connected component – meaning that you can navigate from virtually any argument to any other through a chain of related reasoning.

Two Applications: Smarter Search and Structured Deliberation

The AKASE team translated this knowledge graph into two practical tools. The first is an argument-aware search engine. When you submit a query, the system retrieves relevant documents as a conventional search engine would – but then it reranks them based on the argumentative quality of each document. Three criteria were used:

  • how well claims are justified
  • how coherently the argument is structured
  • whether the document presents a balanced range of perspectives rather than a one-sided view

The system also generates a concise summary of the top results and suggests related issues from the knowledge graph, helping users explore the broader landscape of debate around their query.

The second tool is ArgsBase, a multi-agent deliberation platform. ArgsBase creates a structured discussion involving multiple AI agents, a human user, and a moderator. The AI agents contribute arguments, counterarguments, and refinements; the moderator manages the flow of discussion; and a real-time analyser tracks the evolving state of the debate, producing summaries and visual argument maps. In an initial user study, participants found this multi-agent format more useful than interacting with a single AI, precisely because the diversity of perspectives and the structured format encouraged deeper thinking.

Why it matters

Today’s information environment does not lack arguments – it lacks tools for navigating them. By building a computational infrastructure that can extract, organise, evaluate, and present argumentative content from the open web, AKASE offers a different model of information access: one where the quality of reasoning is a first-class signal, not an afterthought.

The ArgsBase platform, in particular, points toward an intriguing future for human–AI interaction. Rather than using AI as an oracle that delivers answers, it positions AI models as participants in a structured reasoning process – one where disagreement is productive, perspectives are made explicit, and the human user remains an active agent rather than a passive recipient. This is a model of AI-assisted thinking that takes critical reasoning seriously.

What’s Next

The AKASE team has identified several directions for future work: expanding the knowledge graph dynamically as new arguments emerge on the web, incorporating multi-modal content (not just text), and refining the deliberation platform through more extensive user studies focused on practical decision-making scenarios. The argument-aware search engine will also benefit from reduced latency and broader domain coverage.

To read the full technical report, go here: https://zenodo.org/records/17674255

The AKASE project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

Fighting Misinformation with the Open Web Index: Inside the VERITAS project

How a European research team built a browser-based fact-checking assistant powered by the #OpenWebIndex

In an information environment where misleading claims can spread in real time, the ability to verify what you read online is a necessity. The VERITAS project, funded under the European OpenWebSearch.EU project and conducted by DEXAI, set out to build a practical tool for exactly this purpose: an AI-powered assistant that sits in your browser, answers your questions with sourced evidence, and draws its knowledge not from a proprietary index controlled by a single corporation, but from an open, European web search infrastructure.

The Problem: Misinformation and the Limits of Conventional Search

The War in Ukraine has been accompanied by an unprecedented volume of online misinformation – from fabricated reports and manipulated imagery to subtly misleading narratives. For journalists trying to verify claims, researchers analysing media coverage, and ordinary citizens attempting to understand what is actually happening, conventional search engines offer limited help. They return ranked lists of links, but they do not assess the credibility of sources, provide citations for specific claims, or explain the basis for their answers. The burden of verification falls entirely on the user.

At the same time, the emergence of AI chatbots has introduced a new set of problems. Large language models can produce fluent, confident-sounding answers that are entirely fabricated – a phenomenon known as hallucination. Without mechanisms to ground their outputs in verifiable evidence, these systems risk becoming part of the misinformation problem rather than the solution.

The Approach: Retrieval-Augmented Generation via the Open Web Index

The DEXAI/VERITAS team adopted an approach known as RAG (Retrieval-Augmented Generation). The core idea: instead of prompting an AI model to generate answers from whatever it absorbed during training, you first retrieve relevant documents from a trusted knowledge base and then ask the model to compose its answer based specifically on those documents. Every claim in the response can thus be traced back to an identifiable source.

Rather than relying on a commercial search engine or a static dataset, the system draws its evidence from the Open Web Index (OWI).

In practice, the system works as follows. The VERITAS pipeline fetches the latest crawled web pages from the OWI (pulling the latest 30 days of crawled content). The dataset is then indexed using a semantic embedding model, which converts text passages into numerical vectors that capture their meaning. When a user poses a question, the system converts it into a similar vector, finds the most relevant passages in the index, and passes them – together with the question – to a language model (LLaMA 3.1), which generates a grounded response.

The chatbot supports multiple audiences, offering background information with explicit citations for journalists, metadata-rich summaries for researchers, and concise, jargon-free explanations for the general public.

What They Built: A Fact-Checking Assistant in Your Browser

The finished product is a Chrome browser extension. Once installed, it provides a small popup where users can type questions in natural language and receive answers accompanied by source references.

The system is designed to serve different types of users in different ways. Journalists receive background information with explicit citations. Researchers get metadata-rich summaries. Members of the general public are given concise, jargon-free explanations. In the current prototype, the system is focused specifically on the War in Ukraine – a deliberate scoping decision that allowed the team to develop and validate the approach within a well-defined domain.

Why It Matters: Open Infrastructure for Trustworthy Information

VERITAS is significant not only for what it does, but for how it does it. By building on the Open Web Index rather than a proprietary data source, the project demonstrates that open search infrastructure can serve as the foundation for practical applications.

The RAG approach itself addresses one of the most persistent criticisms of AI-generated text: the lack of verifiability. By requiring the model to base its answers on retrieved documents and by presenting those documents to the user, VERITAS moves away from the “trust me” paradigm of conventional chatbots towards a “check for yourself” model of AI-assisted information access.

What’s next?

Future development could also introduce user feedback mechanisms, allowing the quality of responses to be improved over time, as well as streaming responses for a more interactive user experience. Perhaps most importantly, the VERITAS approach could be applied to other domains where information verification is critical – from public health to climate science to electoral integrity.
An outstanding challenge overall lies in the growing complexity of multi-model ecosystems. As retrieval systems, ranking components, embedding models, and large language models are increasingly combined—often across organisational and infrastructural boundaries—the integrity of the final answer depends on the entire chain. Outputs generated by one model may be ingested, summarised, or re-ranked by another, creating feedback loops that are difficult to detect and audit. In such environments, disinformation cascades can emerge when misleading or low-quality content propagates across interconnected systems, gaining credibility through repetition and algorithmic reinforcement. Ensuring traceability, cross-model accountability, and robust provenance mechanisms will be essential to prevent systemic amplification of false or manipulated claims.

To read the full technical report, go here: https://zenodo.org/records/17588890

The VERITAS project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).

Making Open Maps Richer: Inside the OMMS project

Project OMMS (Open Mobile Maps Search) was conducted by E foundation with the aim to enhance OpenStreetMap data with web data from the #OpenWebIndex to feed these combined data into a competitive Open Source Maps App.

OpenStreetMap data is comprehensive in many areas of the world, but for the purposes of a maps app that aims to compete with Google or Apple’s offerings, the data freshness, data accuracy, and data richness all leave much to be desired. Additionally, OpenStreetMap almost without exception lacks authoritative information from business owners about their point of interest (POI). 

However, most larger businesses have invested some amount of energy into Search Engine Optimization (SEO), which involves surfacing this information online to be crawled and indexed by search engines. For the purpose of the project E Foundation used the OpenWebSearch tools to crawl and provide to users web-based POI information about businesses in the open- source and open- data mobile maps application of E Foundation.
The goal was to create a compelling mobile maps experience for users on mobile that will allow users to confidently explore, learn about and navigate to points of interest nearby. 

As a concrete solution e- Foundation set out to:
  • send the OpenWebSearch.eu team a list of URLs that are of interest to them so that OpenWebSearch can provide fresh crawl data as parquet files. 
  • create a web API that accepts a point of interest’s URL and returns information about that point of interest in a format that a maps app can easily ingest.
  • Additionally, E Foundation aimed to create a proxy that augments API responses from Pelias with metadata parsed from structured data provided by OpenWebSearch.

We will start with opening hours and contact information, and from there expand to images, services offered, FAQs and anything that we feel may enrich the user’s experience in a mobile UI.” stated the project team at the start of the project. 

Results

At the end of the project time, E Foundation has developed the following pieces of software: 

  • URL list generator to iterate through an OpenStreetMap extract and create a list of URLs to be crawled for structured data.
  • Batch processing program to transform the resulting Parquet files into .osc (OpenStreetMap changeset) files which amend OSM features to include fresh opening hours. This is option 1 for consuming crawl data.
  • Batch processing program to ingest the resulting Parquet files into a PostgreSQL database for use in a Point of Interest information server. This is option 2 for consuming crawl data.
  • POI Server. This connects to the PostgreSQL database and serves freshly updated opening hours, contact information, and FAQs for clients via an HTTP API. 
Difficulties

The team had initially started by crawling websites associated with points of interest in the metropolitan area surrounding Seattle, Washington, USA. Once they validated that the software worked, they moved on to crawling POI websites for the entire planet. They found that approximately 12% of the websites attached to OpenStreetMap points of interest contain structured data. This amount varies by the type of POI. Websites for department stores and fast food restaurants contain structured data more often than other POIs. 

The team ran into some trouble parsing opening hours. Some points of interest have opening hours that don’t conform to the standard format, but more often points of interest have opening hours listed for different parts of the same store. For example, grocery stores with pharmacies attached may have hours listed separately on the website, e.g. Mo-Fr 08:00-20:00, Mo-Fr 09:00-17:00. To avoid updating hours incorrectly, ambiguous data such as these were discarded. 

What’s next?

Being pleased with the achieved results, the team acknowledges that there are still many opportunities to be tackled to improve the completeness and accuracy of POI data. The main recommendations concern further services concerning Ranking and Backlinks. 

Open-data geocoders often struggle to rank points of interest in textually ambiguous queries because they don’t have context on which POIs are more commonly searched for. Traditional ranking systems like BM25 typically prevent this from happening, but they’re far from perfect.
Search giants use past user behavior to help rank results, but open-data geocoders don’t have this luxury. However, OpenWebSearch is well-positioned to publish POI rankings based on PageRank or a similar algorithm. Open-data geocoders could ingest this and use it to augment their existing ranking algorithms. 

Another problem with open-data maps apps is they generally lack the richness that comes from years of collecting user-generated data like reviews and photos. Fortunately, much of this information is published to the public internet on e.g. food blogs and travel information websites. 

We would like to explore using backlinks as an imperfect substitute for user-generated reviews and ratings. For example, a point of interest information page for a restaurant could contain links to a few food blog posts about that particular restaurant. Discovery of these backlinks is something that OpenWebSearch is uniquely positioned to do, and we think this is a very promising line of work to explore.” summarizes the OMMS team. 

To read the full report, go here: https://zenodo.org/records/17815218

The OMMS project was funded under the OpenWebSearch.EU initiative (Horizon Europe, Grant Agreement 101070014, Call #2).

Plugging a university supercomputer into Europe’s Open Search infrastructure: The NordLink project

The University of Oldenburg connects its data centre and HPC cluster to OpenWebSearch.eu, demonstrating how academic infrastructure can add to a distributed European web index

A European open web search infrastructure should not depend on a single data centre or a single organisation. Instead, the idea is to connect different institutions, different countries, different kinds of computing resources, who all contribute to the same shared infrastructure. The NordLink project, carried out by the University of Oldenburg following the OpenWebSearch.EU third-party funding call, is a concrete step in that direction: it connects a university’s high-performance computing resources to the already existing OpenWebSearch network.

What NordLink brings to the table

The University of Oldenburg’s contribution is not trivial. The resources committed to the project include 50 terabytes of S3-compatible cloud storage, two dedicated physical servers with 200 terabytes of combined storage, three virtual machines for testing and deployment, and – most notably – access to the university’s HPC cluster. This is a serious piece of computing infrastructure: 161 nodes, over 20,000 CPU cores, 145 terabytes of RAM, and 36 high-end NVIDIA GPUs (including A100 and H100 models) with a combined peak GPU performance exceeding 2 TFlop/s. The storage subsystem provides more than 4 petabytes of capacity.

For context, this is the kind of computing power typically used for large-scale scientific simulations, machine learning training runs, and data-intensive research. Making it available to the OpenWebSearch.eu project beautifully demonstrates that European academic HPC centres can play a meaningful role in search infrastructure – a domain traditionally dominated by commercial tech companies.

The Integration Challenge

The primary technical challenge for NordLink was integrating the university’s resources with the OWS infrastructure through HEAppE, a middleware system designed to provide HPC-as-a-Service capabilities. This middleware allows remote users and automated systems to submit computing jobs to the university’s cluster without needing direct access to the local systems.

The NordLink team deployed HEAppE on both virtual machines and physical servers, set up comprehensive monitoring using Prometheus and Grafana, configured the university’s S3 storage as a data staging area for the project, and provided the IP addresses of both physical servers and virtual machines for whitelisting to enable web crawling. A functional account was created to link the physical infrastructure to the HPC cluster, enabling job submission from the OWS network.

Challenges to consider

The team reported that the documentation for the HEAppE middleware was incomplete and difficult to follow, making deployment more laborious. Notably, the infrastructure provider EXAION reported the same issue in their final report as well.

Why university infrastructure matters for Open Search

European universities collectively operate enormous computing resources. HPC clusters, large-scale storage systems, high-bandwidth network connections, and skilled technical teams exist across hundreds of institutions. Most of this capacity is used for scientific research – climate modelling, genomics, particle physics, engineering simulations. But much of it also has periods of underutilisation, and the skills required to operate it overlap significantly with those needed for a web search infrastructure.

NordLink demonstrates that these resources can be connected to a shared infrastructure with reasonable effort.

What’s Next

Beyond the formal project period, the University of Oldenburg plans to maintain all committed infrastructure for a while – the S3 storage, physical servers, and virtual machines – and to complete the HEAppE integration with the HPC cluster. The team is also considering provisioning additional VMs running search index software such as OpenSearch or Vespa.ai, which would allow the university to host a searchable subset of the Open Web Index locally.

In conjunction with EXAION’s contribution from France, NordLink underpins the kind of infrastructure network that OpenWebSearch.eu is building: a distributed system where European organisations of different types – universities, data centre operators, research institutions – contribute to a shared, sovereign search infrastructure.

To read the full technical report, go here: https://zenodo.org/records/18259771

The project was funded under the OpenWebSearch.EU initiative (Horizon Europe, Grant Agreement 101070014, Call #3).

Building sovereign infrastructure for Open Web Search: Inside the EEI project

The French eco-responsible infrastructure provider EXAION (therefore the project name EEI) provides GPU-powered computing to the OpenWebSearch.eu project

The OpenWebSearch.eu research project aims to create and maintain an independent, open web search infrastructure, based in Europe. In order to establish powerful, sustainable and reliable Open Web Search services, a robust physical infrastructure is a basic requirement. The servers needed for crawling the web, processing and indexing billions of pages, have to exist somewhere. And where they exist, and who controls them, matters. In this context, the EEI project, funded under the OpenWebSearch.eu project, contributes high-performance computing infrastructure hosted in France, managed by European teams, and operated under European regulatory frameworks.

What EXAION provides

Exaion committed to providing GPU-accelerated bare-metal servers and virtual machines in its data centres. The hardware includes servers equipped with NVIDIA RTX A6000 GPUs – powerful graphics processing units increasingly used not just for rendering but for the computationally intensive tasks that modern search infrastructure demands, from training machine learning models to running web crawlers at scale.

The company commits to using circular-economy IT equipment – refurbished or second-life hardware – wherever possible, and to deploying open-source solutions in line with the broader ethos of the OpenWebSearch.eu project. All operations comply with GDPR and relevant European regulations.

What was done

The project unfolded in two phases over the course of a year. The first phase (September 2024 to March 2025) focused on setting up the infrastructure: deploying virtual machines, establishing Grafana monitoring systems, and assessing the feasibility of various integration options with the OWS technology stack. Some planned deployments, such as HPC middleware, were deferred because the matching use cases had not yet materialised.

The second phase (April to August 2025) delivered the project’s core objective: deploying and running the MASTODON crawler – one of the crawling components of the OpenWebSearch.eu infrastructure – on Exaion’s GPU servers. The experiment was tested and validated by Prof. Michael Granitzer from the University of Passau, who coordinates the overall OpenWebSearch.eu project. The crawler ran on five virtual machine instances, demonstrating that real OWS workloads can be effectively executed on sovereign European infrastructure.

Why it matters

Data sovereignty is not just about where data is stored but about who controls the infrastructure that processes it. The OpenWebSearch.eu project is designed as a distributed, cooperative infrastructure from the grounds up. Having computing resources available in multiple European locations, operated by different organisations, reduces single points of failure and concentration risk. Moreover, EXAION’s commitment to circular-economy hardware and direct management without subcontractors demonstrates that sovereign infrastructure can also be sustainable infrastructure.

What’s next?

Current ideas involve an extension of the partnership to include high-performance computing use cases with SIMVIA.

Find the full project report here: https://zenodo.org/records/17777285

The project was funded under the OpenWebSearch.eu initiative (Horizon Europe, Grant Agreement 101070014, Call #3).

Update from OWS.EU partner projects: Part 3

Building an Open Web Index does not only include technical challenges, but also legal and societal ones. To extend our R&D activities around Open Web Search, we initiated the OWS.EU Community Programme. In our first Third-party call we asked for contributions on legally compliant data gathering and identifying legal or economic aspects that enable or block the development and maintenance of an Open Web Index. The call opened in March 2023 and ended with the onboarding of six new partner projects in November 2023. This blogpost includes updates from two projects that address legal challenges of providing an Open Web Index: ALMASTIC and LOREN.

ALMASTIC: Legal Evaluation of Technical Aspects of the Open Web Index

The ALMASTIC project aims to legally secure the Open Web Index by subjecting its technical aspects to legal evaluation. Its goal is to identify obstacles and mitigate legal risks in the process of successful global dissemination.

After helping to draft the first version of the Open Web Index License (OWIL 1), a comprehensive analysis of relevant legislation, case law and applicable guidelines and academic literature has been performed, forming a solid basis for the future legal compliance of OpenWebSearch.EU. The examination focused on five key areas:

  1. liability for third-party content,
  2. copyright,
  3. data protection,
  4. cybersecurity, and
  5. data governance.

The team around Prof. Kai Erenli from the University of Applied Sciences BFI Vienna will use the remaining time of the project to finalise their analysis while keeping in mind that a final assessment is not always possible, as the legal situation in many relevant areas is currently highly dynamic and relevant legal acts have yet to be finalised or case laws identified.

More information about the ALMASTIC project.

LOREN: Legal Open European Web Index

The LOREN project seeks to provide a comprehensive analysis of the legal constraints and requirements for building and operating an Open Web Index. The project will specifically look into the legal implications of crawling, data storage and sharing as well as provide recommendations for building and operating an Open Web Index that complies with the European laws and regulations.

The team around the two lawyers Paul C. Johannes and Dr. Maxi Nebel compiled and analysed the laws and norms that are relevant to building and maintaining of an Open Web Index. Results are currently compiled into a legal opinion with actionable advice regarding crawling, searching, indexing, sharing of index and disclosure of data for scientific purposes.

Additionally, the LOREN team started to work on the implications of the right to de-referencing. Furthermore they are analysing existing open source and open data licenses in regard to the suitability for usage in an Open Web Index. In the next months the team will concentrate on providing their legal opinion with advice concerning selection and/or adaptation of open data licenses for the Open Web Index. In order to present a workable license the LOREN team has worked together with other projects from call #1 of the OWS.Eu Community Programme.

More information about the LOREN project.