Sunday, December 19, 2010

Profile Generation using Linked Data

For all of you that read my previous blog, you know already that this not only a personal blog. So I thought of sharing about the research that I did in my final year at the University. This research is about using Linked Data (Data published as RDF Data) to generate a profile ( a collection of information ). The following is an abstract of the research paper that I have wrote.


  The Internet has become the world’s largest data source very rapidly due to the openness of its data publishing methods. The rapid growth of the Internet has yielded many benefits but caused some problems as well. The main problem faced by the Internet users is that of searching the large amount of data it has. Due to the openness of the data publishing methods of the Internet, data are scattered between different data sources and the amount of data repetition is also very high. Many new products such as search engines and Web crawlers have emerged because of this reason.

Even though there are search engines like “Google”, “Yahoo”, “Bing” and “Ask”, and Web crawlers like “Googlebot” and “Yahoo! Slurp”, it is still difficult to retrieve information from the Internet. The result of most of these search engines is a list of Web sites and pages that contains the keyword the user entered?. The problem here is that the user has to do further searching on these Web pages to find the information needed. This is mainly because the Internet doesn’t have any specific method of publishing data and information and also due to the fact that there are no relationships between similar data from different data sources. Various methods of publishing data such as text, graphics, tables, charts, audio and video do not help the cause of retrieving information and more importantly summarizing information. One reason for this is the fact that most of the data publishing methods are not machine readable and therefore very difficult to build a system that summarizes information.

On the other hand there are some attempts to address the issue of the information scattered over the Internet. The most significant methodology that has been used is the linked data principles. In simple words, Linked data is the term used to describe the data which are interlinked because of their similarity. The most significant thing about linked data is that data are interlinked without regarding the data sources. Since linked data is relatively a new technology, there are still ongoing researches about linked data browsing and searching.

Since the arrival of linked data, it has made a huge difference in the way of publishing data and information, mainly because linked data are both human and machine readable. This also opened up some more possibilities of summarizing information since it creates relationships between similar data. This research is about using those relationships to get a summary of information about an entity that the user gives; in other words, generating a profile using linked data.

There are several browsers like “Tabulator”, “Humboldt” and search engines like “Falcons”, “Semantic Web Client Library” that are developed to explore the Web of linked data. Most of these are still under research and some of these browsers are domain specific. The most significant thing about these browsers is that almost all of them use a similar algorithm to retrieve information from RDF (Resource Description Framework) data. These algorithms use SPARQL (SPARQL Protocol and RDF Query Language) queries to retrieve information from RDF files.

In this research, we have developed a methodology that uses the technology of linked data and generates a profile. This is a novel methodology and it uses a point awarding system that awards values for attributes or for the concepts in the RDF data file, based on the relevance of that concept to the profile we are generating.

The methodology we have developed and the considerations we took are as follows. To generate a profile, the first task would be to identify the attributes of the physical entity or to identify the concepts that should be included in the profile. Since this is one of the main difficulties faced when generating the profile, we have developed a methodology that uses linked data itself to identify the attributes of a physical entity. In this methodology, we use queries to gather the concepts. The query language we have used is SPARQL and we have created several types of queries. These queries are used for different purposes such as concept gathering and data gathering.

The process of concept gathering is done using a general query that returns all the concepts that are in relation with the entity we query. The types of queries we use cover a range of data sets where the keyword resides as the subject or the object of a RDF triple. If there are concepts that describe similar URIs or redirect URIs among the concepts gathered in the above manner, we use two special types of queries to gather the value those concepts hold. These values are used in creating more queries and this procedure is done until no new similar or redirect URIs are found.

The query execution happens in the following manner. The concept identification process has two query execution steps. For the first step, we use an initial endpoint and an initial query to retrieve a single concept list. For that, we have selected an initial endpoint prior to the execution of the system. The execution of the other endpoints halts until the first result set is returned and these results are used to create the queries that are needed to retrieve concepts from other endpoints.

The endpoints we have used range from generic endpoints to domain specific endpoints so that the profile we generate can include very common information as well as some domain specific information. Each of these endpoints are assigned with a value in such a way that the generic endpoints have a lower value and the domain specific ones have a higher value. The initial endpoint is assigned with the lowest value. Each of the concepts gathered from these endpoints are assigned with the corresponding value and at the end of the concept identification process, all the concepts are merged and the values are summed in similar concepts. Finally, a threshold value is chosen and the concepts that have a total value more than the threshold value are taken as the final list of concepts. These concepts are used to generate new queries and the corresponding data are gathered by querying all the endpoints.

The main benefit of this system will be for the day to day users of the Internet who are trying to get a summary of information about a particular real world entity. However this system has some limitations of generating the profile mainly because of the following reason. The amount of data which are published as linked data is limited. So sometimes there will be lesser information in the profile than in a common Web site. Since our methodology depends on the amount of data published as linked data, we can assure that the quality of the profiles generated by the system will increase with the growth of linked data.

No comments:

Post a Comment