The open web index and its creation is the core of the project. The web index contains data and metadata in a structured form derived from the analysis of web documents (6).
To create such an index, web documents are systematically retrieved from the web (1) and stored on a server (2). A list of initial web documents serves as a starting point, then all links found in the source web documents and in additional web documents are used to traverse search and store documents. Blacklists guide the capture process by excluding websites according to certain characteristics.
The content analysis of the web documents consists of several steps. The text content is extracted (3), the metadata is processed (4), and an analysis is performed to obtain certain features, including qualitative, ethical and legal aspects (5). All this derived information is stored in the index together with the respective web addresses of the analysed web documents (6).
In order to enable the search for information, the index is deployed in two different ways to make it accessible (7). The entire index is deployed in a computing centre, and a part of the index can be downloaded and provided on a personal server.
Data products or search applications are created using and demonstrating the open web index.
Search applications enable a user to search for general information or domain-specific information for a particular purpose, e.g. for restaurants in the vicinity. For this purpose, the index is queried and a selection of matching web documents is retrieved (a). The documents found are then ranked based on purpose-specific characteristics, user preferences and usage data (b).
The search application provides the user with the search functionality via a user interface where the result is presented according to the application purpose (c). The user is taken into account by specifically addressing cognitive aspects to better perform and understand the search process (d).
The data products are knowledge representation models derived from the index, e.g. knowledge graphs containing structured conceptual knowledge or AI language models capable of text generation. Applications can use these components for a variety of purposes.
Openness is one of the key concepts of this technology. Not only can the web index be used by external applications, but any of the modules described above can be enriched. In particular, new data products or search applications can be created by other parties using the open web index.