OWS.EU Partner in Focus: CERN
CERN, the European Organization for Nuclear Research, is one of the world’s largest and most respected centres for scientific research. Its business is fundamental physics, finding out what the Universe is made of and how it works. Within the OpenWebSearch.EU (OWS) project CERN plays a crucial role not only with regards to supercomputing infrastructure, but also via its contributions to ethical and legal assessments, as well as project management and communications support.
The CERN project team is led on by Andreas Wagner, IT Solutions Architect and complemented by Noor Afshan Fathima with whom we spoke in her role as Data Infrastructure Engineer, about the project progress thus far.

Please describe your organisation’s tasks in the project. What is your field of expertise that you bring to the project?
CERN contributes across 6 workpackages – WP1 (Fill in Crawlers), WP4 (Search Applications), WP5 (Federated Data Infrastructures), and WP6 (Ethical, Legal, and Societal Aspects) – bringing expertise in infrastructure engineering, development of science search applications, ELSA considerations, and governance of the federated open search infrastructure. It also contributes to WP7 – Dissemination and Communication, WP8 – Project Management.
In WP1, we developed two purpose-built authenticated web crawlers for CERN’s internal web estate: cern-owler (Java/Playwright for HTML, producing 125 WARC archives totalling 3.1 GB) and owler_auth_pdf (Python/Tika for PDFs, extracting 2,211 documents from 90+ domains). Together they delivered 3.3 GB of content across 287 files to AccGPT (Accelerator GPT), CERN’s experimental AI-powered chatbot for the accelerator complex, covering 25,292 seed URLs across 182 CERN domains. We also participate in project-level coordination and contribute to the governance of the federated open search infrastructure, drawing on CERN’s institutional experience in managing large-scale, multi-partner scientific collaborations.
In WP4, we developed two complementary POC search applications demonstrating MOSAIC’s flexibility. The first is an institutional search engine built from custom-crawled WARC archives of CERN’s public web content, fed through the full OWS preprocessing pipeline (resilipipe, open-web-indexer, lucene-ciff) into MOSAIC, indexing 4,352 documents from 6,738 crawled pages. The second is Nooon, a vertical search engine for disability-related knowledge, built from a 2.59-million-document OWI (Open Web Index) slice extracted via the command line interface OWILIX. Nooon is designed to support HR and Diversity & Inclusion offices in fair hiring and inclusive policy development.
In WP5, we operate and document the production server fleet that underpins the Open Web Index at CERN. This includes the URL Frontier coordination service — where we drove the migration from OpenSearch to ScyllaDB to handle 94.7 million operations per day across 6.68 billion URL records — the web crawling infrastructure processing up to 3 TB of content per day, an iRODS data federation spanning four sites across five European data centers (CERN, LRZ, DLR, CSC Finland), load balancers, metrics collection, and the application hosting servers. Our systematic documentation methodology, developed specifically for this project, covers discovery, deep-dive analysis, checklist completion, and academic chapter creation for each server.
In WP6, we contribute to the ethical, legal, and societal dimensions of the project. This includes work on ELSA (Ethical, Legal, and Societal Aspects) as they apply to open web search — particularly around privacy-preserving information retrieval for vulnerable populations, knowledge sovereignty, and the responsible handling of disability-related data. Our OSSYM 2025 publication and CERN preprint on empirical ethics in disability information retrieval directly address these concerns. We also contributed to the governance of the federated data infrastructure.
In WP7 (Dissemination and Communication), we have contributed to raising the visibility of the OpenWebSearch.EU project through major CERN communication channels. Three feature articles were published on home.cern and in the CERN Courier: “A European project to make web search more open and ethical” and “Ethical, open and non-commercial: Open Web Search project designed to provide Europe with an alternative” on the CERN news site, and “Towards an unbiased digital world” in the CERN Courier. These articles reached CERN’s global audience of researchers, engineers, and policy-makers, highlighting both the technical infrastructure and the ethical dimensions of building a European open web index. Beyond written dissemination, we have presented the project at multiple international venues including OSSYM 2024 and 2025, CS3 2025, EGI 2024, and the Cambridge Forum on AI, contributing to community building around open search infrastructure.
How is the project progressing? Which major milestones did you achieve?
The project is progressing well, with all CERN-side deliverables on track. Our major milestones include:
URL Frontier evolution: We completed the migration of the URL coordination service from OpenSearch to ScyllaDB, resolving critical performance bottlenecks caused by JVM garbage collection pauses and write-heavy workloads (99.88% writes). The production ScyllaDB deployment now handles 24.3 billion total operations with zero failures and continuous uptime, storing 5.07 TB across the crawl state database.
Authenticated crawling and AccGPT delivery: We developed two purpose-built crawlers — cern-owler (Java/Playwright for HTML) and owler_auth_pdf (Python/Tika for PDFs) — capable of navigating CERN’s Keycloak SSO. Together they delivered 287 files totalling 3.3 GB to AccGPT’s S3-based knowledge base, covering 25,292 seed URLs across 182 CERN domains.
Search application deployments: The institutional search engine indexes 4,352 documents from 6,738 crawled pages through the complete OWS pipeline. Nooon serves 2.59 million disability-focused documents through MOSAIC, demonstrating the OWI-to-vertical-search workflow at scale.
iRODS data federation: We established the CERN node in a five-site iRODS federation (CERN ↔ LRZ ↔ DLR ↔ IT4I ↔ CSC Finland), which in future enabling cross-institutional data sharing for the Open Web Index across three European countries.
Infrastructure documentation: We completed comprehensive documentation for 8+ production servers using our systematic five-phase methodology, producing deliverable-ready chapters for D5.3 covering the full infrastructure stack from load balancers to database clusters.
Publications:CERN’s work in the project has produced a substantial publication record. As first author, six papers span disability information retrieval and infrastructure architecture: two at OSSYM 2025 (Knowledge Sovereignty in Disability IR; Architecting the URL Frontier datastore), two at OSSYM 2024 (Federated Data Infrastructure for the Open Web Search; Architecting the OpenSearch service at CERN), one accepted at the Cambridge Forum Journal on AI: Culture and Society (empirical ethics, article in progress), and one submitted to SEASON — the Search Engines and Society Network (Ethical Privacy in Disability Data Retrieval). As co-author, contributions include the Springer book chapter on the Open Web Index (2024), the JASIST journal article on Open Web Index impact (2023), federated infrastructure papers at CS3 2025 and EGI 2024/2025, plus two Zenodo deliverables (Pilot Infrastructure Launch; Training Material for Partners). A companion preprint is deposited at CERN’s document server (CERN-OPEN-2025-004). Three feature articles were published on home.cern and in the CERN Courier as part of WP7 dissemination.
What are the challenges you have been facing (regarding your tasks)?
Authenticated crawling at institutional scale. CERN’s web estate sits behind a Keycloak Single Sign-On layer that conventional crawlers cannot penetrate. Building browser-based authentication into the crawling pipeline — using Playwright to handle OAuth2 flows, session tokens, and cookie management at scale — required significant engineering effort and careful coordination with various CERN’s teams.
Network access complexity. CERN’s network security model requires two-hop SSH access (desktop → lxplus gateway → target server) with different authentication patterns per server type. Communication between servers and workstations require staging through intermediate nodes, which was a very interesting challenge to work on.
Which milestones do you plan to achieve in the remaining months?
In the remaining project period, we are focusing on completing and polishing our deliverable contributions and extending the search applications:
D5.3 completion: Finalise the remaining server documentation chapters and integrate all CERN infrastructure sections into the consolidated deliverable, including the URL Frontier evolution narrative, crawler infrastructure, and federation topology.
D4.4 integration: Complete the CERN search applications section with final evaluation results, upload the Zenodo reproducibility artifact package, and integrate figures and cross-references into the consolidated document.
Full estate crawling: Extend the institutional search from the current 6,738-page public subset to CERN’s full 25,000+ page web estate, integrating authenticated content into the MOSAIC index with appropriate access controls.
Nooon enhancements: Implement topic-level Curlie filtering for semantic corpus construction beyond keyword matching, and explore cross-corpus comparison capabilities (e.g., Disability in Employment vs. Disability in Education) tailored to HR and D&I workflows.
Frontier integration: Connect the authenticated crawlers to CERN’s URL Frontier infrastructure for continuous, scheduled crawling rather than the current manual campaign-based approach.
What makes the OWS project special to you?
The OWS project represents something genuinely rare: the attempt to build a public, European alternative to the commercial search infrastructure that shapes how billions of people access information. Working on this at CERN feels especially fitting — the web was born here, and now we are contributing to ensuring it remains open and searchable by everyone, not just by those who can afford to build their own index.
What makes it personally meaningful is the Nooon component. Building a search engine specifically for disability-related knowledge – one that surfaces voices and resources that mainstream search systematically underrepresents – connects the project’s technical ambitions to real human outcomes. When an HR professional can discover evidence-based accommodation guidelines or a disability advocate can find peer-reviewed employment research through an open, privacy-preserving infrastructure, that is the kind of impact that motivates the work.
The project also demonstrates that European research institutions can collaborate on infrastructure at scale. The iRODS federation across five sites in three countries, the shared URL Frontier coordinating billions of URLs, the OWILIX tooling that lets anyone extract a thematic slice of the web – these are building blocks for digital sovereignty that go beyond any single institution’s capability.
Do you already have plans for the time after the project ends?
Yes, several strands of work are designed to continue beyond the project timeline:
Nooon and fair hiring: Nooon is supported through the Open Search Foundation’s Ethics working group and CERN’s Disability Network within the Diversity and Inclusion programme. There is active interest from HR departments in exploring fair hiring tools built on open search infrastructure. We plan to extend Nooon to multilingual and multimodal corpora incorporating lived-experience contributions from disabled people and caregivers, with client-side preprocessing to protect sensitive employment data.
AccGPT integration: The authenticated crawling infrastructure continues to support AccGPT’s knowledge base requirements independently of the OWS project. The 3.3 GB already delivered serves as the foundation, with plans to extend coverage to CERN’s full web estate and establish continuous crawl schedules.
Infrastructure sustainability: The MOSAIC deployments on open-science-search serve as reference implementations for institutional search at CERN. The documented infrastructure and the systematic methodology we developed for server documentation provide templates that other institutions can adapt for their own open search deployments.
Open science artifacts: All reproducibility artifacts – seed inventories, crawler source code, pipeline outputs, and evaluation data – will be publicly available on Zenodo, ensuring that our contributions remain accessible and reproducible for the broader research community working on open web search.
Thank you for the interview!
Read more about CERN: https://home.cern/
Watch our interview with Noor about the search engine Nooon:





