The case for Neural Crawling: Inside the FUN project
A research team from Pisa and Glasgow proposes that AI language models should decide which web pages to download – and shows why this matters for the future of search
Before a search engine can find anything, it must first build a collection of web pages to search through. This collection is assembled by a crawler – a piece of software that systematically visits web pages, follows links, and downloads content. The decisions the crawler makes about which pages to prioritise determine, in a very direct way, what the search engine will eventually be able to find.
For over two decades, the dominant approach to crawling prioritisation has been PageRank and related link-analysis methods: pages that are linked to by many other important pages are assumed to be important themselves. This was a reasonable assumption in the era of keyword search. But search is changing. Users increasingly ask questions in natural language rather than typing keywords, and automated systems like retrieval-augmented generation (RAG) pipelines issue their own queries to search engines. These new kinds of queries demand pages with rich, coherent, meaningful content – and there is no guarantee that such pages are also the most popular or the most linked-to.
The FUN project – Focused Neural Crawling – funded under the European OpenWebSearch.EU project and carried out by researchers at the University of Pisa and the University of Glasgow, tackles this problem head-on. It proposes a new paradigm: instead of using link popularity to decide what to crawl, use AI language models to estimate the semantic quality of web pages and prioritise accordingly.
Why crawling matters more than you might think
It is easy to focus on the visible parts of a search engine – the ranking algorithms, the interface, the speed of results – and overlook the crawler. But the crawler is the primary content filter in the entire search pipeline. It decides what gets downloaded, stored, and indexed. Everything that happens downstream – indexing, ranking, retrieval – operates only on the content the crawler has already collected. A sophisticated ranking algorithm cannot compensate for a poor crawling strategy: if valuable pages were never downloaded, they simply do not exist as far as the search engine is concerned.
The web is vast, and no crawler can download everything. Choices must be made, and the heuristics that guide those choices shape the quality of the entire search corpus. Traditional heuristics like PageRank assume that a page’s importance can be inferred from its position in the web’s link structure. This works well when search queries are short keyword strings and when the most popular pages tend to be the most useful. But the FUN team argues that this assumption is increasingly outdated.
The Shift: From link popularity to semantic quality
The core idea behind neural crawling is straightforward: instead of asking “How popular is this page?”, the crawler asks “How likely is this page to contain content that would be useful for answering a search query?” To answer this question, the system uses a neural quality estimator – a small language model that has been trained to predict, from the text of a document alone, whether that document is likely to be relevant to any query. The model does not need to know what the query will be; it assesses the intrinsic quality of the text itself: its coherence, informativeness, and semantic richness.
There is an obvious practical problem: the crawler needs to decide whether to prioritise a page before it has downloaded that page. It cannot read the text of a page it has not yet fetched. The FUN team addresses this with two quality propagation strategies. The first is based on the observation that web pages tend to link to other pages of similar quality. If a high-quality page links to an unknown page, there is a reasonable probability that the unknown page is also of decent quality. The crawler can therefore use the quality of already-downloaded pages as a proxy for the likely quality of the pages they link to.
The second strategy works at the domain level: pages within the same domain tend to have similar quality. Once the crawler has downloaded a few pages from a domain, it can estimate the quality of the domain as a whole and use that estimate to prioritise other pages from the same domain.
What the experiments show
The team tested their approach through large-scale simulations on ClueWeb22-B, a web corpus of 87 million pages, using two different sets of test queries. One set consisted of traditional keyword queries; the other consisted of natural language questions.
The results are striking. On natural language queries, the neural crawling strategies consistently outperformed PageRank in both the quality of the crawled corpus and the effectiveness of downstream search results. The domain-level strategy (DomQ) was particularly strong, building corpora that led to substantially better retrieval performance. On traditional keyword queries, the neural strategies performed comparably to PageRank – they did not lose ground on the type of search that PageRank was designed for.
The efficiency results were also notable. Neural crawlers collected relevant pages faster than PageRank in the early stages of the crawl, meaning they built useful search corpora more quickly and with less wasted bandwidth downloading low-value pages. This matters in practice, because crawling the web is expensive in terms of network resources, storage, and computing time.
A key finding underpinning the domain-level approach is that the semantic quality of a web page is strongly correlated with the average quality of other pages on the same domain (Pearson correlation of 0.649). By contrast, the equivalent correlation for PageRank scores is much weaker (0.272). In other words, knowing that a domain tends to host high-quality content is a much better predictor of individual page quality than knowing that a domain is well-linked.
Why it matters
The FUN project is significant for the OpenWebSearch.EU project in a very direct way. Running an open European web index requires crawling decisions – and the quality of those decisions determines the quality of the index. If an open web index is built using traditional crawling heuristics, it inherits the biases of those heuristics: a preference for popular, well-linked content at the expense of semantically rich but less connected pages. Neural crawling offers a way to build corpora that are better suited to the modern demands of natural language search and AI-powered information retrieval.
To make this practical, the FUN team produced not just research findings but usable software. Their quality scoring tools are compatible with the OWS parquet file format and integrated into Resilipipe, the open-source content analysis framework used by OpenWebSearch.eu.
The FUN project demonstrates that the way we crawl the web should evolve alongside the way we search it. As search queries become more conversational and AI systems become major consumers of search infrastructure, the assumption that link popularity is the best guide for crawling priorities is no longer sufficient. Neural quality estimation offers a complementary – and in many cases superior – signal.
What’s next
Future work could explore combining neural and link-based signals in hybrid strategies, using ensembles of quality estimators that assess different dimensions of page quality (spam, machine-generated content, factual accuracy), and evaluating neural crawling with more advanced retrieval models beyond BM25. The approach could also be adapted for other tasks that depend on corpus quality, such as building high-quality training data for large language models.
To read the full technical report, go here: https://zenodo.org/records/17359141
The FUN project was funded under the OpenWebSearch.EU project (Horizon Europe, Grant Agreement 101070014, Call #2).



