Semantic Web – Basic Terms and Principles
The flutter of the tiny wings of the unleashed Hummingbird seems to be acting as a harbinger of dawning of a new era in the way we communicate with search engines. Even though the idea of semantic web has been around for longer than most people would think, each year giving birth to another patent or approach that edged us closer to endowing search engines with at least some kind of understanding of the concepts they have to deal with, it was Hummingbird that created an environment in which those patents and approaches can show their full potential. Semantic web and the principles behind it are no longer just esoteric notions that you can choose to ignore, at least not if you want to still be able to claim that you understand how search engines work.
However, some of the ideas behind semantic web have been lurking in the shadows of keywords, mass produced content, guest blogging and other idols of the past, for so long that their obscurity become more of a character trait than a temporary state. They have been growing steadily, nurtured by enthusiasm and efforts of people with enough foresight and drive to seek out ways to improve the existing approach to search, so if you have been ignoring their development for some time, you might be a bit overwhelmed with how much they’ve progressed in the meantime.
The basic drive behind the semantic web that we hope to achieve some day is the shift from simply indexing data to allowing search engines to understand it as completely as possible. The approach is described in the battle cry of semantic web apostles – things not strings, where ‘things’ denominate search entities (more on them later), while the term ‘strings’ refers to the traditional way of connecting data, through strings of links.
While the urgency of enabling search engines to better understand the notions they are returning as results may have been intensified by the increase in the use of voice search on mobile devices and the type of queries used in such circumstances (resembling traditional questions that you would pose to a human being more than old school queries with a couple of keywords modified for location or other factors), this issue is only a drop in the ocean of splendor that is semantic web. This approach to structuring and retrieving data provides for a better, more comprehensive search, that includes much more than simply looking for the right keyword in the index.
In order to better understand how this works, you have to understand the difference between explicit and implicit signals. These are the signals that search engine analyzes when trying to answer your query. Keywords and other direct input from the searcher are explicit signals, while the term implicit signals refers to the context of the search. This includes everything from the time when the search was made, your current location and your previous searches, to everything else that the search engine is able to learn about your search habits and current situation.
Over time, search engines were given access to an increasing number of these implicit signals, which resulted in them being able to offer more relevant suggestions for your queries. The desire to provide them with even more of these signals led to the development of structured data and semantic markup, however, before we delve into that, we should first take a look at the basic unit of semantic web, search entity.
Once upon a time you had keywords and the domains they could be found on. The more authoritative and respected the domain, the better the chances it will come up as a result for your query. While this system was the best that we could do at the time, anyone with at least a casual interest in SEO was more than aware of its numerous flaws. Semantic web tries to do away with keywords as basic units of a search, and replace them with search entities.
A search entity can be just about anything, from a location or person, to a school of thought. Hummingbird, the knowledge graph before it, and numerous other factors in the current landscape of the world wide web, rely on gathering and structuring knowledge on those entities, viewing them as a whole, and not only as something that might have something to do with the keywords you entered into the search bar. The base of search entities is constantly expanding, with individual entities each holding a unique identifier, and being a unique semantic object.
This enables search engines to get a better idea of the context of your search, and to return better, more relevant and comprehensive results. This is why, if you want to increase your visibility in the new system, you need to be recognized as an entity. Forget about mentioning keywords or synonyms and hoping that this alone will make you more relevant, you’ll need to think a bit bigger than that.
There are numerous ways to establish yourself as an entity, the best of them being excelling at what you do, i.e. being recognized by people before you are acknowledged by the search engines. With all the signals that their algorithms consider, they can get a rather clear picture of what is it that you are all about, with your social activities, for instance, being at least as indicative as the keywords on your homepage. However, instead of trying to manipulate the search engines to see you in a certain angle and in certain light, you can simply tell them what it is that you are doing, by structuring your data through semantic markup.
In the dark ages of internet, pieces of unstructured data swirled around in the void of indifference, only occasionally being pulled from the chaos and oblivion by someone typing in their simulacrum in the search bar. Once examined they were returned to non existence to wander until their next summoning. But the light of the order came, what was a keyword became an entity, and instead of being simply a sum of their letters, they became notions and ideas, each developing character, personal traits, and a host of modifications that made them special. Data was classified, elaborated upon, and given essence.
Structuring of the data allowed search engines to understand that they shouldn’t return the same result for the word ‘cats’ to someone whose search history shows an interest in theater as they would to someone who is frequently looking up ‘cheapest cat food’. Data structuring allows us to elaborate on the role and the type of data, so it is easily recognized by a search engine, and returned when it’s relevant. This is achieved through semantic markup.
In the simplest of terms, semantic markup allows you to modify your data with metadata (information about information) so that its properties and meaning behind them is made apparent to search engines. The language used is easily understood by machines, and easily implemented by web developers who want to structure their data.
This kind of structuring relies on a number of things, vocabulary, markup format or syntax, and triples, so let’s try explaining each of them.
Triples are sets of data organized in accordance with their role in a particular instance. They are based on the semantic notions of subject, verb and object. Subject and object are different search entities, while verb is the link between them, their interaction, and basically, an explanation of their relation to each other. Enormous sets of triples are stored in Triplestores, and can be easily retrieved when a query comes along.
The importance of triples is found in the fact that they allow search engines to glean context from our search, to connect already formed search entities in a sensible way, to understand their interaction and return the relevant result.
Just like in language, vocabulary in semantic search denominates a set of arbitrary symbols meant to represent certain notions. Schema.org provides us with a huge vocabulary for structuring the data, and since it has been adopted by major search engines, it became the default go to for structuring your data.
While this adoption of a single syntax did impose some limits, it also allowed for standardization of data structuring, which is a must if the work on semantic web is to be continued. Naturally, there are still other, more specific vocabularies, used for structuring of different kinds of data, but your best bet is going with Schema, if you have that option.
Syntax refers to the language used to markup our data. This includes microdata, microformats and RDF (resource definition framework). Each of these approaches has its upsides and its faults, with RDF being the most extensible, and relying on attributes to elaborate on an entity; microformats being used for the topical attribution of HTML/XHTML and microdata being a set of specifications that allow for the addition of semantic modifiers to the code of a page.
Schema decided to focus on microdata, due to the fact that it allows for a decent breadth and extensiveness, and unlike microformats, it doesn’t have the tendency to interfere with the CSS attributes of a page. Naturally, this standardization and focusing on a single language didn’t really go well with those who used a different syntax to mark their data, but luckily, other languages are still supported, although it is highly recommended that you only use microdata in the future. Even if a number of your pages already contain some of the other languages, as long as you don’t use two or more types of syntax on a single page, you should be able to avoid any conflicts.
Now that you are (hopefully) a bit more familiar with some of the basic terminology, let’s try to take a look at how all of this works.
When creating a web page, you do your best to give as many clues about your data as you can through the use of microdata or an alternative syntax. If you are publishing a recipe, you’ll designate it such, allowing for its appearance in rich snippets; if you are writing about music, you’ll look up the vocabulary provided by Schema.org for the appropriate denominators and apply them as appropriate, and so on. By doing this you are supplying the search engines with additional context that they can use to better understand what your data is about, and what entities they are dealing with.
Once someone enters a query, apart from looking for other contextual clues, search engines consider the triplets and the metadata you have provided, and based on that input they are able to narrow down the scope of the entities they should be interested in and return a relevant result. This is not only a matter of convenience, but an approach that allows for a much more comprehensive, searchable, and effective data gathering, storing and retrieval.
Naturally, as far as we came so far with the development of semantic web, there are still a lot of challenges that await us.
For instance, while defining the entities as such does make them more universal than simple keywords ever were, trying to expand a particular vocabulary to other languages (spoken languages, not web syntax) is one of the issues that we are facing. This is to say, that while an entity in one language is the same thing in another language, helping the search engines recognize it as such is not really that simple. This issue is pronounced enough on the most basic, lexical level, when you consider the fact that one word of a language could be translated by several different words of another language, based on the context, style, format and a number of other influences, but when you bring in idioms and set phrases into the equation, things turn to chaos once again.
However, even if we are faced with some obstacles that seem insurmountable (explaining the intricacies of style to machines, for instance), we should be encouraged by the fact that we already came further than anyone would think possible just a couple of years ago, and that the research in semantic web is only gaining momentum with each passing day. As more data is collected, new ways of structuring it explored, and our habits and knowledge improved, this field of research seems to be promising an entirely different experience when it comes to the way we search and the results we are given.
Schema, Markup Formats