Welcome to the Open Web Search Project!

We’re excited to have you here. Most likely our web crawler has recently visited your site as part of our research project. We crawl the Web to create an open web index and to bootstrap a more open internet search ecosystem, which also includes that webmasters and content producers have more control over how and for what their precious content is used. We’ve outlined more details of what we are doing and why we are crawling the Web (and thus your site) below, including technical information on how to exclude your site from being crawled by us. It would be great if you would allow us to crawl your site, but it is your choice!

Our Mission

Our goal is to build a comprehensive open web index to benefit all its users. A web index forms the core for every search engine and can be seen as a map of the internet, which relates words and search terms to websites where they occur in. A web index allows users to search and rank websites according to their defined queries. Our index will be openly available to everyone, so that everyone can build their own web search engine.

We hope that by enabling more web search engines, that users get the choice of where to search. Comparable to how you choose your newspaper, you should be able to choose your search engine.

OWLer

OWLer – our web crawler – is a friendly explorer that strictly follows the robots.txt protocol, ensuring legal and respectful online crawling. As we’re in the pioneering stages, there may be a hiccup or two along the way, and we apologize in advance for any potential inconvenience. We appreciate your understanding and are always open to feedback.

OWLer uses two main versions of our web crawler: the Experimental as well as the Stable version. The later one is currently built on top of the robust Apache Storm Framework and the StormCrawler technologies. Beyond that, we are always trying new technologies and functionalities, tested with the Experimental version of OWLer. Here’s a brief comparison:

Experimental Version

This version is our playground for innovation. We use it primarily for testing various tools and configurations before implementing them in our main crawler versions.

Tools: Our toolkit is ever-evolving, as we are incorporating different technologies and experimenting with decentralized web crawling techniques to reduce central points of failure.
Configurations: This version also allows us to experiment with different settings to maximize our crawler’s efficiency and effectiveness. For example, we might test different politeness policies, crawling speeds, or ways to handle various data types.

Stable Version

This is the current main version of our web crawler. It includes all the stable, tested features from the Experimental version that have proven to improve the crawler’s performance.

Reliable: After extensive testing in the Experimental version, the features and configurations that pass our strict quality and performance standards make their way to this version.
Focused on Performance: Unlike the Experimental version, which is designed for testing, the stable version is optimized for performance. It is geared toward effectively indexing the web and providing useful, up-to-date data for the Open Web Index project.

You can always stay updated about our progress and learn more about our crawler versions at https://opencode.it4i.eu/openwebsearcheu-public/owler. If you have any further questions, feel free to contact us anytime.

Technical info

Open Source Code

In our endeavor to keep the internet open and accessible, we believe in transparency and collaboration. We have therefore made our web crawler’s source code available to the public. You can access, review, and even contribute to the code on our Gitlab repository (https://opencode.it4i.eu/openwebsearcheu-public/owler).

Data Protection

We prioritize your privacy. While we collect data primarily related to your organization, we may also process certain personal data. Be assured, this data is always treated with the highest level of confidentiality. We only process such data when it is publicly available on your website and necessary for our project. Moreover, we don’t hold onto it forever – it will be deleted after a maximum of 90 days from removal on your site.

Justification

In harmony with GDPR, our data processing is supported by one of the six legal grounds provided in the regulation, specifically protecting the legitimate interests of our project and users. We have ensured these interests do not override your fundamental rights and freedoms.

Your Rights

We’re dedicated to protecting your rights. We are currently developing a platform allowing users to request access, rectification, or deletion of personal data, limit the processing, object to the processing, and even request data portability.

Opting-Out

Your control over your online presence is paramount. If you prefer to keep our web crawler from accessing your site, you can do so by updating your website’s robots.txt file. Just add our user-agent identifier Owler, which represent our main and experimental crawler. To prevent any current or future versions from accessing your site, simply add Owler to the file.

Due to the latest developments in regards to web publishers control, we also support the user-agent identifier GenAI, representing any data use for the purposes of training generative AI models. GenAI is conceptually similar to Google’s proposal of the Google-Extended user-agent. Whereas Google-Extended, GPTBot, Anthropic-AI, etc. are data scrapers particularly restricted to power the respective company’s AI products, GenAI means to provide a more general opportunity for the opt-out from data use related to the development of generative AI applications. OpenWebSearch.EU forwards any information about the publishers’ usage preferences to the users of our web index and all additional data products we publish through an INDEX as well as a GENAI Metadata field, both represented as boolean values.

Please following the step by step guideline bellow:

Guidelines for Updating Your robots.txt File

Adding our user-agent identifiers to your robots.txt file is a simple and effective way to control the access of our web crawler to your site. Here’s a step-by-step guide on how to do it:

1. Access Your Website’s robots.txt File

This file is usually located in the root directory of your site. For example, if your website is www.example.com, you can find the robots.txt file at www.example.com/robots.txt.

2. Edit Your robots.txt File

Open the file with a text editor. This could be any program that lets you view and edit text files – Notepad on Windows, TextEdit on macOS, or a dedicated code editor like Sublime Text or Visual Studio Code.

3. Add Our User-Agent Identifiers

To block our current web crawlers, add the following lines to your robots.txt file. Remember that the matching of user-agent identifiers is case-insensitive, so owler (lower letters) will work as well.

User-agent: Owler

Disallow: /

If you want to web page to be indexed in any Search Application built on top of the Open Web Index, but you still want to protect your web data against the use in the training of generative AI models, add the following lines to your robots.txt file.

User-agent: GenAI

Disallow: /

4. Save Your Changes

After you’ve added these lines, save your robots.txt file and upload it back to your website’s root directory, if necessary.

Remember: the “Disallow: /” line tells the user-agent specified not to crawl any pages on your site. If you want to block only certain pages, you can specify those pages instead of using “/”. For example, “Disallow: /private” would prevent the crawler from accessing any page on your website that includes www.example.com/private.

Feel free to refer to our GitLab repository for any further clarification. If you have additional questions or need assistance, don’t hesitate to reach out.