Send feedback on this topic

Oracle Contextual Intelligence

Oracle Contextual Intelligence (Context) analyzes textual content on web pages at the massive scale and speeds required by automated advertising technology to determine the context , the central meaning , of the content on a page. (Oracle Contextual Intelligence was previously called Grapeshot.)

You can use Context to have pages included or excluded for consideration for your automated advertising buys based on the appropriateness of the content on those pages to the brand being advertised.

Context is part of the Oracle Data Cloud, which also includes Oracle Measurement — providing real-time analytics, fraud detection, and accreditation of more than 50 metrics for ad impression viewability — and a proprietary DMP (Data Management Platform) that has unparalleled access to industry data.

In this topic

Primary use cases

The primary uses for Context are for brand safety, contextual targeting in advertising, and prediction of trends.

Primary users

Context users include:

Language support

Context is available for 31 of the most-used languages in the world and can be implemented for dozens more. It can distinguish among important variations of languages, such as between British and American English.

How Context works

Context has two main processes: crawling and contextually analyzing pages, then matching them against carefully curated sets of keywords. These keyword sets are called keyword segments.

Comment: We should not say just segment because it could be confused with other types of segments. Using KS is clumsy but necessary.

Note: Context works in the pre-bid environment: it notifies ad systems to include or exclude pages before an advertising bid has been placed. This feature contrasts with systems that block an ad from appearing after a bid has been placed and won for a spot on a page. By using pre-bid technology, advertisers avoid being billed for placements on which they never bid.

Crawling and analyzing pages

Context crawls and contextually analyzes text in response to requests. The technology and architecture support loads exceeding 3 million queries per second (QPS), a level found in massive programmatic advertising implementations, and also offers sub-millisecond response times.

The request for a page at a given URL can originate from a number of sources but typically comes from a partner’s technology platform or server that handles ad serving, measurement and/or verification. Pages are crawled at a rate of well over 100,000 per minute, and are re-crawled to check for changes depending on the propensity of a particular page at the given URL to change within its epicenter.

The crawl information for any individual publisher's website is used for all partner implementations. To account for web pages that are the same as each other but have subtly different URLs —such as from parameters toward the end of a URL that are caused by web analytics software —Context strips out those parameters.

Sites that Context is attempting to crawl can exclude or block the crawlers by various means, such as by using a robots.txt page or by specifically excluding Context’s crawler. If Context is unable to crawl a page, information about that inability is indicated to Context’s partner or customer (via a signal prefixed “gx_”).

Concentrating on the epicenter

After a Context’s crawler has received a request to scan a web page, it finds and crawls that page. It then downloads the page's core textual content (the epicenter) from the page's HTML. We do not download or analyze the CSS, JavaScript, images, navigation, footer, sidebars, and other areas tangential to the main textual content on the page. For example, on a typical news web page, Context’s technology downloads and analyzes the central text of that page. It does not download or analyze the embedded or side elements which may include related stories, additional linked headlines, images, videos, and so on.

Re-crawling pages

Pages change, of course, and need to be re-crawled. The crawler maintains an estimate of how frequently a page changes. If a page has been modified since the last time it was crawled, then the crawling frequency is halved, to a lower limit of every four hours. If it hasn’t been modified, then the crawling frequency is doubled, to a maximum of every 30 days. In this way, the rate of re-crawl soon matches the modification rate, providing efficiency in apportioning resources.

Categorizing pages

Once a web page has been crawled, its information is kept in a document store — a centralized area for information from all crawled pages. Context’s central data store holds information from more than 5 billion documents at a time, and is an ever- growing, frequently updated record of all pages that have been crawled.

From this document store, a web page’s record can be run against our WordRankTM algorithm to determine the weighted value of the language on the page and determine if there is a match to the keyword segments being used by a partner. If a web page is requested but not found in the document store, the URL is sent to the crawler layer to be crawled, processed and stored in the document store.

Keyword segments, matches, and signals

When a page is requested for analysis, its categorized data is then matched against the relevant keyword segments. Keyword segments are another elemental factor at the heart of Context’s technological processes. Keyword segments (sometimes referred to simply as "segments") are collections of keywords and phrases that, when matched against our categorization of the text of a web page, indicate whether that page's epicenter is contextually relevant to that keyword segment.

Segments and sub-segments build upon each other in a tree structure. So, for example, the “Sports” segment may include “gs-sports-football” and “gs-sports-basketball” and each of those may include sub-segments relevant to their individual sports.

For example, a keyword segment concerning "sports" would contain multiple words and phrases that would indicate – if they appear with enough weight on a page – that the page is about sports.

Each language we cover has hundreds of standard segments covering an array of topics. The segments are constructed by our teams of editors, trained in linguistics and in our processes. Partners can also create, or ask us to create, custom segments to cover topics or niches not sufficiently covered for their purposes by our standard segments.

Once a page has been crawled, indexed and categorized, that information is then compared to the relevant keyword segments to determine if there is a contextual match. Matches are scored, to indicate how strong the match is.

Once a match is found and scored, a response is then sent via a signal to our partners so they can determine how the page should be treated for advertising purposes.

Partners set a threshold for the level of match appropriate to their purposes. They can adjust that threshold over time as they wish, for example to surface more pages for bidding to increase their reach for an advertising campaign by including more impressions, or to increase the level of brand safety, thereby excluding more pages.

The cache

In order to operate as quickly as possible, you can elect to situate a cache close to the systems you are using so that the information on categorized pages can be retrieved and matched as quickly as possible. The cache is updated every time a page is rescanned and categorized.

Signal nomenclature

Context provides signals in standardized alphanumeric formats that include a set prefix, descriptive word(s), and the “_” or "-" symbols. The nomenclature is constructed to be easily understandable and differentiated. As one example, a page found to match our "sports" segment sends a response of “gs_sports,” as well as a “score” to indicate how strong the match is for that page to the segment. The "gs" prefix indicates "Grapeshot Standard," as described below. There may be sub-segments as well, such as "sports-football" or "sports-tennis" which can be used separately for matching or rolled up into an umbrella segment.

We have seven overarching signal response types:

Brand safety options

Context provides two standard levels for identifying risks associated with the textual content of web pages:

A page that has been successfully processed and does not match any of the standard unsafe (gv_) segments is identified as safe (gv_safe) and offered for targeting.

Custom safe-from segments can also be added — segments for which if a match is found the page is negatively targeted.

Standard unsafe segments are ones that map to brand safety parameters that have been agreed upon by industry trade groups such as the 4As, In September of 2018, the American Association of Advertising Agencies (4A’s) Advertiser Protection Bureau (APB) introduced its Brand Safety framework. The framework lists 13 content categories that, in the words of a 4A's news release, "pose risk to advertisers, whereby advertisers might choose to adopt a 'never appropriate' position for their ad buys." These 13 categories and the 4A's definition of them are identified in the table below, along with corresponding Context avoidance categories where available and the ways in which such content is otherwise addressed for web page textual content.

4A Framework

Context Avoidance Categories Category

#

Category

Definition

Category

Definition

1

Adult & Explicit Sexual Content

Illegal sale, distribution, and consumption of child pornography.

Explicit or gratuitous depiction of sexual acts, and/or display of genitals, real or animated.

Adult

Avoids mature and sexual web page textual content.

2

Arms & Ammunition

Promotion and advocacy of sales of illegal arms, rifles, and handguns.

Instructive content on how to obtain, make, distribute, or use illegal arms.

Glamorization of illegal arms for the purpose of harm to others.

Use of illegal arms in unregulated environments.

Arms

Avoids web page textual content around guns and weapons.

3

Crime & Harmful acts to individuals and Society and Human Right Violations

Graphic promotion, advocacy, and depiction of willful harm and actual unlawful criminal activity including murder, manslaughter and harm to others. Explicit violations of human rights, such as trafficking and slavery.

Crime

Segments include serious, sex and violent.

4

Death or Injury

Promotion or advocacy of death or injury. Murder or willful bodily harm to others. Graphic depictions of willful harm to others.

Death or Injury

Segments include air, fire, rail, road and sea.

5

Online piracy

Pirating, copyright infringement, and counterfeiting.

Download

Relates to online piracy and spam.

6

Hate speech & acts of aggression

Unlawful acts of aggression based on race, nationality, ethnicity, religious affiliation, gender, or sexual preference. Behavior or commentary that incites such hateful acts, including bullying.

Hate speech

Avoids derogatory terms including racism, homophobia, and political terms.

7

Military conflict

Incendiary content provoking, enticing, or evoking military aggression.

Live action footage/photos of military actions and genocide or other war crimes.

Military

Avoids conflict, war and negative foreign policy web page textual content

8

Obscenity and Profanity

Excessive use of profane language or gestures and other repulsive actions with the intent to shock, offend, or insult.

Obscenity

Avoids web page textual content that includes offensive terms.

9

Illegal Drugs

Promotion or sale of illegal drug use, including abuse of prescription drugs. Federal jurisdiction applies, but allowable where legal local jurisdiction can be effectively managed.

Drugs

Avoids web page textual content related to consumption of drugs, including recreational and performance-enhancing use.

10

Spam or Harmful Content

Malware/Phishing

Does not map directly to a specific Context avoidance category, although certain URLs related to this definition may be covered in some part by Context's Download category. Additionally, Context, as part of our monitoring of keyword segments, manually adds Spam or Harmful Content sites to our internal block-list of pages not to be crawled. If requested by Context clients, these sites return a gv_spam_or_harmful_site categorization.

11

Terrorism

Promotion and advocacy of graphic terrorist activity involving defamation, physical and/or emotional harm of individuals, communities, and society.

Terrorism

Avoids web page textual content around terrorist attacks.

12

Tobacco/eCigarettes/Vaping

Promotion and advocacy of tobacco and e-cigarette (vaping) and alcohol use to minors.

Tobacco

Avoids all web page textual smoking content, including vaping and e-cigarettes.

13

Sensitive Social Issue/ Violations of Human Rights

Disrespectful and harmful treatment of sensitive social topics such as abortion, extreme political positions, and so on.

Acts, language, and gestures deemed illegal, not otherwise outlined in this framework. Examples include harm to self or others and animal cruelty.

Targeted harassment of individuals and groups.

Does not map directly to a specific Context avoidance category, although some text related to this definition may be covered in some part by Context's Hate Speech and Obscenity categories. Sites deemed inappropriate as noted in category 10 above may be removed from our crawl list and be designated with a harmful_site categorization

Video

For video, Context ingests and processes the audio matched to the video, converting it to text. That text is then analyzed and matched in similar fashion to the processes described for epicenter text noted above.

Apps

Context is able to evaluate applications built for mobile device environments through a different methodology. In general, such mobile applications do not allow the types of page scanning available for web pages. We do, however, look at descriptions in the Google Play and Apple app stores and can categorize apps with a PEGI rating of 3 (for Google) or an Age rating of 4+ (for iOS) and categorizes those apps as “safe for mobile.”