Search engine technology
Igor Eric Kuvykin Modern web search engines are highly intricate software systems that
employ technology that has evolved over the years. There are a number of
sub-categories of search engine software that are separately applicable
to your specific ‘browsing’ needs, so you won’t ever have to leave your
house or open a physical book. That includes web search engines (e.g.
Google), database or structured data search engines (e.g. Dieselpoint),
and mixed search engines or enterprise search. The more prevalent search
engines, such as Google and Yahoo!, utilize hundreds of thousands of
millions of computers to process billions of trillions of web pages in
order to return fairly well-aimed results. Considering there are
thousands of searches per second, that’s pretty decent. Due to this high
volume of queries and text processing, the software is required to run
in a highly dispersed environment with a high degree of superfluity.
Modern search engines possess the same following main components.
Web Search Engines
Search engines that are expressly designed for searching web pages,
documents, and images were developed to facilitate searching through a
large, nebulous blob of unstructured this and that. They are engineered
to follow a multi-stage process: crawling the infinite stockpile of
pages and documents to skim the figurative foam from their contents,
indexing the foam/buzzwords in a sort of semi-structured form (database
or something), and at last, resolving user entries/queries to return
mostly relevant results and links to those skimmed documents or pages
from the inventory.
Crawl
In the case of a wholly textual search, the first step in classifying
web pages is to find an ‘index item’ that might relate expressly to the
‘search term.’ In the past, search engines began with a small list of
URLs as a so-called seed list, fetched the content, and parsed the links
on those pages for relevant information, which subsequently provided
new links. The process was highly cyclical and continued until enough
pages were found for the searcher’s use. These days, a continuous crawl
method is employed as opposed to an incidental discovery based on a seed
list, because the system wasn’t Byzantine enough. The crawl method is
an extension of an extension of an extension of an extension of the
aforementioned discovery method. Except there is no seed list, because
the system never stops worming. The crawl method is similar to
SparkNotes, in that you don’t even have to complete a text before you
enter the title and are presented with several components of the novel
in question. Everything from a list of motifs to a comprehensive
character list are at your fingertips (and you didn’t even know what you
were looking for, because you never read “Uncle Tom’s Cabin”). Most
search engines use sophisticated scheduling algorithms to “decide” when
to revisit a particular page, to appeal to its relevance. These
algorithms range from constant visit-interval with higher priority for
more frequently changing pages to adaptive visit-interval based on
several criteria such as frequency of chance, popularity, and overall
quality of site. The speed of the web server running the page as well as
resource constraints like amount of hardware or bandwidth also figure
in. Search engines generally crawl a few more pages than are readily
available for searching because content across web pages is so often
grossly irrelevant and/or duplicative and/or wrong. For example,
according to Wikipedia in the early 2000s, Theodore Roosevelt was
pronounced both as one of our greatest presidents and a pederast.
Incidentally, more than half of the web pages that are available for
indexing are made up of duplicative/useless content. Whoops.
The pages that are discovered by web crawls are often distributed and
fed into another computer that creates a veritable map of resources
uncovered. The bunchy clustermass looks a little like a graph, on which
the different pages are represented as small nodes that are connected by
the links between the pages. The excess of data is stored in multiple
data structures that permit quick access to said data by certain
algorithms that compute the popularity score of pages on the web based
on how many links point to a certain web page, which is how people can
access Justin Bieber’s fan page just as easy as they can access any
number of resources concerned with diagnosing psychosis. Another example
would be the accessibility/rank of web pages containing information on
Mohamed Morsi versus the very best attractions to visit in Cairo after
simply entering ‘Egypt’ as a search term. One such algorithm, PageRank,
proposed by Google founders Larry Page and Sergey Brin, is well known
and has attracted a lot of attention because it highlights repeat
mundanity of web searches courtesy of students that don’t know how to
properly research subjects on Google. The idea of doing link analysis to
compute a popularity rank is older than PageRank. Other variants of the
same idea are currently in use – grade schoolers do the same sort of
computations in picking kickball teams. But in all seriousness, these
ideas can be categorized into three main categories: rank of individual
pages, rank of web sites, and nature of web site content. Search engines
often differentiate between internal links and external links, because
web masters and mistresses are not strangers to shameless
self-promotion. Link map data structures typically store the anchor text
embedded in the links as well, because anchor text can often provide a
“very good quality” summary of a web page’s content.
Searching for text-based content in databases presents a few special
challenges from which a number of specialized search engines flourish.
Databases are slower than molasses when solving complex queries (with
multiple logical or string matching arguments). Databases allow
pseudo-logical queries which full-text searches do not use. There is no
crawling necessary for a database since the data is already structured.
However, it is often necessary to index the data in a more economized
form designed to inspire a more expeditious search.
Sometimes, data searched contains both database content and web pages
or documents. Search engine technology has conveniently developed to
respond to both sets of requirements, to appeal to the savants that
don’t see the light of day. Most mixed search engines are large Web
search engines, like Google. They search both through structured and
unstructured data sources, which makes it exceptionally difficult to
know what you are looking for and how to get to it. Take for example,
the word ‘ball.’ In its simplest terms, it returns more than 40
variations on Wikipedia alone. Did you mean a ball, as in the social
gathering/dance? A soccer ball? The ball of the foot? Pages and
documents are crawled and indexed in a separate index. Databases are
indexed also from various sources. Search results are then generated for
users by querying these multiple indices in parallel and compounding
the results according to “rules.”
See also
Search engine technology and Mr. Igor Eric Kuvykin