Last week was yet another digital advertising event week in New York City – ad:tech and surrounding events. Everyone involved was talking about all the great advances in advertising technology happening right now. Nobody was talking about the advances in publisher technology. Frankly, there isn’t much to talk about. Except that there is.
A certain group of publishers have advanced light years ahead of the current discourse. Advanced beyond the idea of programmatic efficiency. Advanced beyond the idea of creating segments by managing cookies. Advanced far beyond unactionable analytic reports related to viewability. These publishers, many of whom are the most revered names in media, have deployed machine learning and artificial intelligence in their media. They have moved beyond optimizing delivery of impressions to predictive algorithms optimizing ad server decisions in real-time against the performance of their media.
The movement towards publishers understanding and optimizing their media is one of retaining the value of ownership. It may seem obvious that publishers own their media but they don’t really. In reality when the buyer is the one that is optimizing the performance of that media and keeping the learning – even using that learning as leverage in what they are buying – the value in ownership, in fact the entire value of a publishing business, is called into question.
The smartest and most forward thinking publishers understand the stakes. We are entering a world where all media is performance media. Brands are getting smarter about attribution and measurement. They are getting smarter about marketing mix models and seeing the value in digital over other channels. They are gaining insight into how digital influences purchase. In this new world the imperative of publishers to be as knowledgeable about their media as brands is not about competitive intelligence. It is about better performance for these very brands – their customers!
Publishers will not survive in a world where they do not know when, why, where, how and someone is walking into their store. They will not survive if they do not know what customers are buying and how much they are paying. No business could survive that lack of data and intelligence. In fact, no customer really wants that type of store either. Customers want to buy products that serve their intended purpose. Customers want sellers to understand — even predict — what they will be interested in. Customer experience extends to buying media the same as buying anything else. This means marketer performance.
Publishers are ultimately responsible for the performance of their media and the happiness of their customers. Yet, many never think about it or feel helpless to do anything about it. They are the ones that will not make the transition to the performance media economy.
Like all intelligent systems at the core this is about data. Vincent Cerf famously said Google didn’t have better algorithms, they just had more data. The reality is no one has more data than publishers. Each landing on the site, each pageview by a consumer has well over a hundred dimensions of data. That data is fundamental to the structure of the web – it is linked data – and the serialization of it through each site session creates exponentially more of it. Unlike cookie data that does not scale this data is web scale. And it can be captured, organized and activated in real-time.
The data is a window into the consumer mindset and journey. Importantly for publishers it is an explicit first-party value exchange between the publisher and the consumer. It is an exchange of intent for content, of mindset for media. It is what brands want more than anything else. The moment, the zeitgeist, the exact time and place a consumer is considering, researching, comparing, evaluating, learning, and discovering what and where to buy. It is the single most valuable moment in media. It is right time, right place, right message. It is uniquely digital, uniquely first-party and owned solely by the publisher. It is a gold mine that Facebook and Google recognize and they have focused their recent publisher-side initiatives on capturing from publishers either unsuspecting or incapable of extracting the value themselves.
As publishers begin to understand these moments themselves, activate them for marketers and optimize the performance of the media against them, an amazing thing happens. The overall value of the media increases. It increases because of intelligence. It increases because of performance. Most important, it increases because of value being delivered to consumers. It also opens up new budgets. Over time, these systems will get smarter. With more data, publishers can even begin to sell based on performance thereby eliminating a host of issues around impression based buying and increasing overall RPM by orders of magnitude with higher effective CPMs and smarter, more efficient allocations of impressions.
Unfortunately for the hungry advertising trade publications you will not hear these most advanced publishers talking on panels about this or writing blog posts. Does Google talk about Quality Score? Does Facebook talk about News Feed? You will not hear industry trade groups arguing for publishers to sell performance to their customers either. In fact not one panel all of last week discussed first-party data created by the publisher.
Thankfully there is no hype related to this. Only performance. The people who need to know, know. As much as I’d like to share the names of every one of those publishers and agencies with you, more so I want to honor their competitive advantage in the marketplace.
The best publishers have the best content. The best content delivers the best consumers. The best consumers deliver the best performance. This is not new. The rise of the intelligent publisher in collecting, organizing, machine learning, activating and algorithmically optimizing this first-party data stream in real-time is. It’s the most groundbreaking thing in media happening right now and will be for some time because it has swung the data advantage pendulum to the publishers for the first time since data has mattered in digital media.
flambo is a Clojure DSL for Spark created by the data team at Yieldbot. It allows you to create and manipulate Spark data structures using idiomatic Clojure. The following tutorial demonstrates typical flambo API usage and facilities by implementing the classic tf-idf algorithm.
The complete runnable file of the code presented in this tutorial is located under the flambo.example.tfidf namespace, under the flambo /test/flambo/example directory. We recommend you download flambo and follow along in your REPL.
What is tf-idf?
TF-IDF (term frequency-inverse document frequency) is a way to score the importance of terms in a document based on how frequently they appear across a collection of documents (corpus). The tf-idf weight of a term in a document is the product of itsweight:
tf(t, d) = (number of times term t appears in document d) / (total number of terms in document d)
idf(t) = ln((total number of documents in corpus) / (1 + (number of documents with term t)))
Example Application Walkthrough
First, let's start the REPL and load the namespaces we'll need to implement our app:
lein repl user=> (require '[flambo.api :as f]) user=> (require '[flambo.conf :as conf])
The flamboand namespaces contain functions to access the Spark API and to create and modify Spark configuration objects, respectively.
flambo applications require aobject which tells Spark how to access a cluster. The object requires a object that encapsulates information about the application. We first build a spark configuration, , then pass it to the flambo function which returns the requisite context object, :
user=> (def c (-> (conf/spark-conf) (conf/master master) (conf/app-name "tfidf") (conf/set "spark.akka.timeout" "300") (conf/set conf) (conf/set-executor-env env))) user=> (def sc (f/spark-context c))
is a special "local" string that tells Spark to run our app in local mode. can be a Spark, Mesos or YARN cluster URL, or any one of the special strings to run in local mode (see README.md for formatting details).
Theflambo function is used to set the name of our application.
As with most distributed computing systems, Spark has a myriad of properties that control most application settings. With flambo you can eitherthese properties directly on a SparkConf object, e.g., , or via a Clojure map, . We set an empty map, , for illustration.
Similarly, we set the executor runtime environment properties either directly via key/value strings or by passing a Clojure map of key/value strings.handles both.
Our example will use the following corpus:
user=> (def documents [["doc1" "Four score and seven years ago our fathers brought forth on this continent a new nation"] ["doc2" "conceived in Liberty and dedicated to the proposition that all men are created equal"] ["doc3" "Now we are engaged in a great civil war testing whether that nation or any nation so"] ["doc4" "conceived and so dedicated can long endure We are met on a great battlefield of that war"]])
whereis a unique document id.
We use the corpus and spark context to create a Spark resilient distributed dataset (RDD). There are two ways to create RDDs in flambo:
user=> (def doc-data (f/parallelize sc documents))
We are now ready to start applying actions and transformations to our RDD; this is where flambo truly shines (or rather burns bright). It utilizes the powerful abstractions available in Clojure to reason about data. You can use Clojure constructs such as the threading macroto chain sequences of operations and transformations.
To compute the term frequencies, we need a dictionary of the terms in each document filtered by a set of stopwords. We pass the RDD,, of tuples to the flambo transformation to get a new, stopword filtered RDD of tuples. This is the dictionary for our corpus.
transforms the source RDD by passing each tuple through a function. It is similar to , but the output is a collection of 0 or more items which is then flattened. We use the flambo named function macro to define our Clojure function :
user=> (f/defsparkfn gen-docid-term-tuples [doc-tuple] (let [[doc-id content] doc-tuple terms (filter #(not (contains? stopwords %)) (clojure.string/split content #" ")) doc-terms-count (count terms) term-frequencies (frequencies terms)] (map (fn [term] [doc-id term (term-frequencies term) doc-terms-count]) (distinct terms)))) user=> (def doc-term-seq (-> doc-data (f/flat-map gen-docid-term-tuples) f/cache))
Notice how we use pure Clojure in our Spark function definition to operate on and transform input parameters. We're able to filter stopwords, determine the number of terms per document and the term-frequencies for each document, all from within Clojure. Once the Spark function returns,serializes the results back to an RDD for the next action/transformation.
This is flambo's raison d'être. It handles all of the underlying serializations to/from the various Spark Java types, so you only need to define the sequence of operations you would like to perform on your data. That's powerful.
Having constructed our dictionary we(or persist) the dataset in memory for future actions.
Recall term-frequency is defined as a function of the document id and term,. At this point we have an RDD of raw term frequencies, but we need normalized term frequencies. We use the flambo inline anonymous function macro, , to define an anonymous Clojure function to normalize the frequencies and our RDD of tuples to an RDD of key/value, , tuples. This new tuple format of the term-frequency RDD will be later used to the inverse-document-frequency RDD and compute the final tf-idf weights.
user=> (def tf-by-doc (-> doc-term-seq (f/map (f/fn [[doc-id term term-freq doc-terms-count]] [term [doc-id (double (/ term-freq doc-terms-count))]])) f/cache)
Notice again how we were easily able to use the Clojure destructuring facilities on the arguments of our inline function to name parameters.
As before, we cache the results for future actions.
Inverse Document Frequency
In order to compute the inverse document frequencies, we need the total number of documents:
user=> (def num-docs (f/count doc-data))
and the number of documents that contain each term. The following step maps over the distincttuples to count the documents associated with each term. This is combined with the total document count to get an RDD of tuples:
user=> (defn calc-idf [doc-count] (f/fn [[term tuple-seq]] (let [df (count tuple-seq)] [term (Math/log (/ doc-count (+ 1.0 df)))]))) user=> (def idf-by-term (-> doc-term-seq (f/group-by (f/fn [[_ term _ _]] term)) (f/map (calc-idf num-docs)) f/cache)
Now that we have both a term-frequency RDD oftuples and an inverse-document-frequency RDD of tuples, we perform the aforementioned on the "terms" producing a new RDD of tuples. Then, we an inline Spark function to compute the tf-idf weight of each term per document returning our final RDD of tuples:
user=> (def tfidf-by-term (-> (f/join tf-by-doc idf-by-term) (f/map (f/fn [[term [[doc-id tf] idf]]] [doc-id term (* tf idf)])) f/cache)
We again cache the RDD for future actions.
Finally, to see the output of our example application weall the elements of our tf-idf RDD as a Clojure array, sort them by tf-idf weight, and for illustration print the top 10 to standard out:
user=> (->> tfidf-by-term f/collect ((partial sort-by last >)) (take 10) clojure.pprint/pprint) (["doc2" "created" 0.09902102579427793] ["doc2" "men" 0.09902102579427793] ["doc2" "Liberty" 0.09902102579427793] ["doc2" "proposition" 0.09902102579427793] ["doc2" "equal" 0.09902102579427793] ["doc3" "civil" 0.07701635339554948] ["doc3" "Now" 0.07701635339554948] ["doc3" "testing" 0.07701635339554948] ["doc3" "engaged" 0.07701635339554948] ["doc3" "whether" 0.07701635339554948]) user=>
You can also save the results to a text file via the flambofunction, or an HDFS sequence file via , but we'll leave those APIs for you to explore.
And that's it, we're done! We hope you found this tutorial of the flambo API useful and informative.
flambo is being actively improved, so you can expect more features as Spark continues to grow and we continue to support it. We'd love to hear your feedback on flambo.
Last December Yieldbot open-sourced Marceline, our Clojure DSL for Storm’s Trident framework. We are excited to release our first major update to Marceline, version 0.2.0.
The primary additions in this release are wrappers for Storm’s built-in metrics system. Storm’s metrics API allows topologies to record and emit metrics. Read more on Storm metrics in the official documentation. We run production topologies instrumented with Marceline metrics and have found it to be stable; YMMV! Please file issues on GitHub if you encounter bugs or have ideas for how Marceline could be improved. See the Metrics section of the README for usage. Also note that Marceline’s metrics can be useful for any Clojure Storm topologies, either with vanilla Storm or Trident.
Marceline’s exposure of Storm metrics has been very useful for monitoring the behavior of Yieldbot’s topologies. Friction around instrumentation has been greatly reduced. Code smells are down. Metrics now entail fewer lines of code and less duplication. An additional architectural benefit is that dependencies on external services can be isolated to individual topology components. It is painless to add typical metrics while maintaining enough flexibility for custom metrics when necessary. We have designed Marceline’s metrics specifically with the goal to leverage Storm’s metrics API unobtrusively.
As Yieldbot’s backend scales it is increasingly crucial to monitor topologies. Simultaneously, new features require iterations on what quantities are monitored. While topology metrics are primarily interesting to developers, these metrics are often directly related to data-driven business concerns. Several of Yieldbot’s Key Performance Indicators (KPIs) are powered by Storm and Marceline, so the availability of a fantastic metrics API translates to greater transparency within the organization.
If you’re interested in such data engineering topics as this, check out some of the exciting careers at Yieldbot!
Yieldbot is pleased to announce the public release of Marceline, A Clojure DSL for Trident.
Storm plays a central role in Yieldbot's real-time data processing systems. From data collection and ETL, to powering on-line machine learning algorithms, we rely heavily on Storm to process vast amount of data efficiently. Trident is a high level abstraction on top of Storm, analogous to Cascading for Hadoop. Trident, like Cascading is written is written in Java. This simply would not do.
Clojure, a lisp dialect which runs on the JVM, forms that base of the software stack for the data team at Yieldbot. We love Clojure because it allows us to quickly and interactively build our data processing systems and machine learning algorithms. Clojure gives us the REPL based development of a dynamic, functional language with the performance and stability of the JVM. With Marceline, we get the best of both worlds by being able to develop and test our Trident topologies in Clojure, uberjar them up and ship them off to our production environments.
Marceline is still young, but we have been running in production without issue. For more information about how to use it, including examples, please see the README on github. Special Thanks to Dan Herrera and Steven Surgnier for their help in testing, writing documentation and providing additional examples.
The first two questions we usually get asked by publishers are:
1)What do you mean by “intent”?
2)How do you capture it?
So I thought it was time to blog in a little more detail about what we do on the publisher side.
The following is what we include in our Yieldbot for Publishers User Guide.
Yieldbot for Publishers uses the word “intent” quite a bit in our User Interface. Webster’s dictionary describes intent as a “purpose” and a “state of mind with which an act is done.” Behavioral researchers have also said intent is the answer to “why.” Much like the user queries Search Engines use to understand intent before serving a page, Yieldbot extracts words and phrases to represent the visitor intent of every page view served on your site.
Since Yieldbot’s proxies for visit intent are keywords and phrases the next logical question is how we derive them.
Is Yieldbot a contextual technology? No. Is Yieldbot a semantic technology? No. Does Yieldbot use third-party intender cookies? Absolutely not!
Yieldbot is built on the collection, analytics, mining and organization of massively parallel referrer data and massively serialized session clickstream data. Our technology parses out the keywords from referring URLs – and after a decade of SEO almost every URL is keyword rich - and then diagnoses intent by crunching the data around the three dimensions of every page-view on the site. 1) What page a visitor came from 2) what page a visitor is about to view and 3) what happens when it is viewed.
Those first two dimensions are great pieces of data but it is coupling them with the third dimension that truly makes Yieldbot special.
We give our keyword data values derived from on-page visitor actions and provide the data to Publishers as an entirely new set of analytics that allow them to see their audience and pages in a new way – the keyword level. Additionally, our Yieldbot for Advertisers platform (launching this quarter) makes these intent analytics actionable by using these values for realtime ad match decisioning and optimization.
For example: Does the same intent bounce from one page and not another? Does the intent drive two pages deeper? Does the intent change when it hits a certain page or session depth? How does it change? These are things Yieldbot works to understand because if relevance were only about words, contextual and semantic technology would be enough. Words are not enough. Actions always speak louder.
All of this is automated and all of this is all done on a publisher-by-publisher level because each publisher has unique content and a unique audience. The result is what we call an Intent Graph™ for the site with visitor intent segmented across multiple dimensions of data like bounce rate, pages per visit, return visit rate, geo or temporal.
Here’s an example of analytics on two different intent segments from two different publishers:
For every (and we mean every) visitor intent and URL we provide data and analytics on the words we see co-occurring with primary intent as well as the pages that intent is arriving at (and the analytics of what happens once it gets there). We also provide performance data on those words and pages.
Yieldbot’s analytics for intent are predictive. This means that the longer Yieldbot is the site the smarter it becomes - both about the intent definitions and how those definitions will manifest into media consumption. And soon all the predictive analytics for the intent definitions will be updated in realtime. This is important because web sites are dynamic “living” entities - always publishing new content, getting new visitors and receiving traffic from new sources. Not to mention people’s interests and intent are always changing.
Posted December 22nd, 2014 by Jonathan Mendez in
From Digiday posted September 23rd, 2014 in Company News
From Ad Exchanger posted September 23rd, 2014 in Company News
From AdAge posted September 23rd, 2014 in Company News
From Digiday posted September 23rd, 2014 in Company News
I have some bad news for real-time bidding. The Web is getting faster, and RTB is about to be left behind. Now, 120 milliseconds is becoming too long to make the necessary computations prior to page load that many of today’s systems have been built around.
From Ad Exchanger posted September 23rd, 2014 in Company News
Yieldbot, whose technology looks at a user’s clickstream and search data in order to determine likeliness to buy, is extending its business to give publishers a new way to monetize their first-party data.
From AdAge posted September 23rd, 2014 in Company News
Yieldbot, a New York based ad-tech company that lets advertisers buy display ads via search-style keywords, has raised a $18 million series B round of funding
From Digiday posted December 5th, 2013 in Company News
The most amazing thing about the Federal Trade Commission’s workshop about native advertising Wednesday morning is that it happened at all. As Yieldbot CEO Jonathan Mendez noted...
From Marketing Land posted October 3rd, 2013 in Company News
Publishers in women’s programming verticals such as food and recipes, home and garden, style and health and wellness have found a deep, high volume source of referral traffic from Pinterest.
From Ad Age posted October 3rd, 2013 in Company News
Pinterest may have quickly arrived as a major source of traffic to many websites, but those visitors may click on the ads they see there less often than others.