The University of Passau coordinates the OpenWebSearch.EU project and is beyond that responsible for providing the Open Web Index (OWI), which includes the development of technology for coordinating crawlers, building the OWI and enabling its download. Building the OWI is one of the key milestones in the OWS.EU project since it will accelerate further use and research towards an open web search.
Prof. Michael Granitzer leads the OWS.EU project and holds the Chair of Data Science at University of Passau. Together with Jelena Mitrović, Professor of Legal Informatics and Natural Language Processing and leader of the Junior Research Group CAROLL, he supervises the research team working on the Open Web Index. We talked to three researchers from their team about the work they do in the OWS.EU project: Saber Zerhoudi, Mahmoud Istaiti and Mohammed Al-Maamari.
How is the project progressing so far?
Saber Zerhoudi, University of Passau
Saber: Very good, we made considerable progress over the past months. Our team has developed a scalable and distributed crawling software that is currently deployed across three datacenters. To keep users informed about the content being crawled and provide them with filtering options, we have also created a monitoring dashboard that can be accessed under https://dashboard.ows.eu/.
Can you explain what the dashboard does?
Saber: One of the key features of the dashboard is its ability to display near real-time information about the crawling process. Users can easily track the progress of the crawling tasks and view statistics on the number of pages crawled. This transparency ensures that users are always informed about the status of our crawling pipeline.
Furthermore, the dashboard offers users the flexibility to filter the crawling content based on various criteria, such as domain, keyword, or date range. This functionality allows users to focus on specific subsets of data that are relevant to their needs, saving time and effort in analyzing the collected information.
In addition to monitoring and filtering capabilities, the dashboard provides users with the ability to actively contribute to the crawling process. Users can submit lists of URLs they wish to have crawled, expanding the scope of our data collection efforts. This feature enables users to tailor the crawling process to their specific requirements, ensuring that the most relevant and valuable data is collected.
But how does this look from the perspective of a website owner? Will they have the option to manage their data?
Saber: Yes, to address the important aspects of data privacy and intellectual property rights, we have integrated takedown request and website ownership verification functionalities into the dashboard. Through our third-party partners, users can easily submit takedown requests for content they believe infringes upon their rights. Similarly, website owners can verify their ownership, establishing a clear line of communication and ensuring that any concerns or requests are promptly addressed.
By combining a scalable and distributed crawling software with a user-friendly monitoring dashboard, we have created a powerful tool for data collection and management. The ability to monitor, filter, and contribute to the crawling process, along with the integration of takedown request and website ownership verification features, positions our system as a comprehensive solution for users seeking to gather and analyze web data efficiently and responsibly.
What other milestones did you achieve in the project so far?
Mahmoud Istaiti, University of Passau
Mahmoud: My role involves enhancing the crawler process by implementing various filters and features, as well as integrating different data sources into our process. Additionally, I am working on developing machine learning models to extract information from privacy policies.
One major accomplishment is that we can now label crawled websites as either spam or high-quality content by verifying their presence on datasets like Wikipedia external links or CURLIE.
Mohammed Al-Maamari, University of Passau
Mohammed: I specialize in Machine Learning and Data Science. My responsibilities include building and processing datasets, training machine learning models (such as URL classification models), and enhancing model modularity.
Key milestones we achieved so far include developing and comparing various URL classification models and building and open-sourcing several datasets.
What are the challenges you face in your work?
Saber: Navigating the diverse infrastructure setups, guidelines, and technology stacks unique to each of the three data centers that currently host the OWSI can be a significant challenge. Each data center has its own distinct configuration of hardware, software, and networking components, which requires a deep understanding of the specific environment to effectively manage and maintain.
Moreover, data centers often have their own set of best practices, policies, and procedures that must be followed to ensure smooth operations and compliance with industry standards and regulations. These guidelines cover various aspects, from physical security and access control to data backup and disaster recovery protocols.
Mahmoud: Same, I often encounter issues related to the infrastructure, that can be a challenge at times.
Mohammed: For me it’s often challenging to effectively test the trained machine learning models.
What are the next steps from here?
Saber: In the coming months, our goal is to streamline the crawling process across various datacenters using a centralized control center. This automation will enhance efficiency and consistency in data collection. Additionally, we are exploring methods to integrate embeddings seamlessly into our crawling-preprocessing-indexing pipeline.
Mohammed: In the coming months, I aim to optimize and improve the machine learning models, particularly the URL classification model.
Mahmoud: I plan to finish the integration of data from the Mastodon platform into our process.
Thank you for the interview!
Read more about University of Passau: https://openwebsearch.eu/partners/university-of-passau/