Posterous theme by Cory Watilo

Serendipity Is Not An Intent


Wired had two amazing pieces on online advertising yesterday and while Felix Salmon’s piece The Future of Online Advertising could be Yieldbot’s manifesto it is the piece Can ‘Serendipity’ Be a Business Model? that deals more directly with our favorite topic, intent.

The piece discusses Jack Dorsey’s views on online advertising and where Twitter is going with it. I had a hard time connecting the dots.

“…all of that following, all of that interest expressed, is intent. It’s a signal that you like certain things,” 

Following a user on Twitter is not any kind of intent other than the intent to get future messages from that account. If it’s a signal that you like certain things it’s a signal akin to the weak behavioral data gleaned from site visitations.

Webster’s dictionary describes intent as a “purpose” and a “state of mind with which an act is done.” Intent is about fulfilling a specific goal. Those goals fall into two classes, recovery and discovery.

Dorsey goes on:

When it (Google AdWords) first launched, Dorsey says, “people were somewhat resistant to having these ads in their search results. But I find, and Google has found, that it makes the search results better.”

At the dawn of AdWords I sat with many searchers studying their behavior on the Search Engine Results Pages. What I and others like Gord Hotchkiss who also studied searcher behavior at the time learned, was people were not as much resistant to Search Ads as they were oblivious to them. People did not know they were ads!

Search ads make the results better because they are pull. Your inputs into the system are what pull the ads. So how does this reconcile with the core of Twitters ad products that are promotions? Promos need scale to be effective. Promos are push. Precisely the opposite of Search where the smallest slices of inventory (exact match) produces the highest prices and best ROI.

Twitter is the greatest discovery engine ever created on the web. But discovery can be and not be serendipitous. Sometimes, as Dorsey alludes to, you discover things you had no idea existed. More often, you discover things after you have intent around what you want to discover. This is an important differentiation for Twitter to consider because it’s a different algorithm. 

Discovery intent is not an algo about “how do we introduce you to something that would otherwise be difficult for you to find, but something that you probably have a deep interest in?” There is no “introduce” and “probably” in the discovery intent algo. Most importantly, there is no “we.” It’s an algo about “how do you discover what you’re interested in.”

Discovering more about what you’re interested in has always been Twitter’s greatest strength. It leverages both user-defined inputs and the rich content streams where context and realtime matching can occur. Just like Search.

If Twitter wants to build a discovery system for advertising it should look like this.

Using Lucene and Cascalog for Fast Text Processing at Scale

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure's awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

Our goal is to tokenize a string of text. This is almost always the first step in doing any sort of text processing, so it's a good place to start. For our purposes we'll define a token broadly as a basic unit of language that we'd like to analyze; typically a token is a word. There are many different methods for doing tokenization. Lucene contains many different tokenization routines which I won't cover in any detail here, but you can read the docs ot learn more. We'll be using Lucene's Standard Analyzer, which is a good basic tokenizer. It will lowercase all inputs, remove a basic list of stop words, and is pretty smart about handling punctuation and the like.

First, let's mock up our Cascalog query. Our inputs are going to be 1-tuples of a string that we would like to break into tokens.

I won't waste a ton of time explaining Cascalog's syntax, since the wiki and docs are already very good at that. What we're doing here is reading in a text file that contains the strings we'd like to tokenize, one string per line. Each one of these string will be passed into the tokenize-string function, which will emit 1 or more 1-tuples; one for each token generated.

Next let's write our tokenize-string function. We'll use a handy feature of Cascalog here called a stateful operation. If looks like this:

The 0-arity version gets called once per task, at the beginning. We'll use this to instantiate our Lucene analyzer that will be doing our tokenization. The 1+n-arity passes the result of the 0-arity function as it first parameter, plus any other parameters we define. This is where the actual work will happen. The final 1-arity function is used for clean up.

Next, we'll create the rest of the utility functions we need to load the Lucene analyzer, get the tokens and emit them back out.

We make heavy use of Clojure's awesome Java Interop here to make use of Lucene's Java API to do the heavy lifting. While this example is very simple, you can take this framework and drop in any number of the different Lucene analyzers available to do much more advanced work with little change to the Cascalog code.

By leaning on Lucene, we get battle hardened, speedy processing without having to write a ton of glue code thanks to Clojure. Since Cascalog code is Clojure code, we don't have to spend a ton of time switching back and forth between different build and testing environments and a production deploy is just a `lein uberjar` away.


Recent Yieldbot Intent Streams Related to Steve Jobs

At Yieldbot our focus is on collection, organization and realtime activation of visit intent in publisher content. We do this not as a network but on a publisher-by-publisher basis because of this simple fact; every publisher has a unique audience and unique content. What that means is that even if the keyword is the same across publishers, the intent associated with it varies in each domain. 

The original purpose of this post however was not to point out the flaws of networked based keyword buying vs the performance advantage of Yieldbot’s publisher direct model. Nor was the purpose to show you how much we truly understand publisher side intent at the keyword level and how use that intelligence in an automated way to achieve the highest degrees of relevant matching. 

The original purpose of the post was to meet the request of a few people that had asked me to share some more data visualization of our Intent Streams™ after we originally shared a few on our recent blog post about our data visualization methods.

It occurred to me the other day that the best representative example over the last month was intent around “Steve Jobs” so below we are sharing our 30-day Intent Streams™ from four publishers. 

If you’re new to our streamgraphs the width of the stream is the measure of pageviews of intent associated with the root intent “Steve Jobs.” The other useful data points in these visualizations are the emergence, increases, decreases and elimination of the associated intent over time. As well as how many terms are seen to be associated with the root intent.


Another way we visualize intent data is across a scatter plot. Here you see the performance of the “Steve Jobs tribute” compared to the other intent related to Steve Jobs looking at the number of entrances (aka landings) on the y-axis and the bounce rate of that intent on the x-axis. 

It’s important to note in this scatter plot visualization that the analytics are predictive. We are estimating performance forward over the next 30 days. The four streamgraph visualizations were based entirely on historical data –in their case a 30-day look back as noted on their x-axis.

We hope you find this intent data as interesting as we do.


We Love the Mess of Ad Tech and Wouldn't Want it Any Other Way

My introduction to ad tech (roughly): "Ad networks are a mess, you wouldn't believe what a technical mess the industry is".

That was in my first meeting with David Cancel (then at Lookery, who since founded Performable, acquired by Hubspot) as I was bouncing an idea off of him that touched on the edges of the ad tech space.

Fast forward 6 months from then (almost exactly two years ago), I got a Twitter DM from David: "Friend is starting a new startup in the ad space. Looking for a CTO and/or help. Any interest?"

About a month later I was the CTO of Yieldbot.

Two years on I can say he was definitely right, ad tech is a mess. Part of it is how it's evolved and part of it is structural and baked into its nature.

I'll step it up and say this: ad serving may be the most complex distributed application there is.

The proof is in explaining why.

You Control Almost Nothing

There are so many degrees of freedom it can make your head spin.

You basically have a micro-application running embedded in a variety of site architectures each with their own particular constraints, whose users are distributed around the world running all manner of execution environments (browsers).

When you have your own site or application you still have to deal with (or choose what to support of) the myriad browsers and browser versions, complete with differences in language (javascript) version (or even whether javascript is enabled) and issues like what fonts are available on client systems.

If you are creating a destination site, that keeps your team's hands full. If you're serving ads, that's just a warmup. Because you also don't control the site architecture that you are embedded in.

You might need to sit behind any number of ad servers the publisher might be running everything through.

You might be in iframes on the page.

You might need to execute code in specific places relative to other ad serving technology also embedded on the page.

Navigation through the site may or may not involve full page refreshes.

But that's not all...

Distributed Worldwide with Time Constraints

Remember those environments you don't control? They are the customers' websites, and they don't want their users' UX degraded.

Serve ads, optimize it for relevance, and don't slow down page load times.

Most websites have some level of focused geographic distribution to their users. Even if it's as broad as US or even US+Europe.

But for ad serving, your user base is the set of users aggregated across all of the sites using your service. The world is your oyster. And the footprint of what you need to service. Quickly.

But wait! There's more!

Content Relevant To The User At That Moment

At least if you want to be as cool as Yieldbot.

Look, a CDN serving up a static image can satisfy all of the above if all you want to do is serve the same image to every user across all of your customers' websites.

Scattershot low value ads picked fairly at random would approximate that level of ease as well.

But there's no sport (or value) in that!

Our goal here is actually to serve content that is the most relevant to what the user is doing at that particular moment. When done right (we do), everyone wins.

So - simply serve the content that best fits what the user is doing at that moment, where they came from, and what they've expressed interest in *right now*, on whatever they happen to be running on, and wherever they happen to be. And make it snappy, would ya?

We Wouldn't Want It Any Other Way

So, that's what I signed up for - and I love it. And so does the rest of the Yieldbot team.

We started our first intent-based ad serving on a live site a couple months after coding started and started learning real world lessons immediately. And 20 months later it's still going.

I've always loved to work on systems with complex dynamics, so considering all of the above it's not that surprising I ended up finding my way to ad tech.

What I love about building Yieldbot technology? All of the above is only half the story. We also do Big Data(tm) analytics for our system to learn the intent of the users coming to our publishers' sites. We provide them visualizations and data views that teaches them what the intent to their site is. And *then* we enable them to serve ads that make that intent actionable.



Hacking Display Advertising

Being as passionate as we are about the huge advances in dynamic web languages and event based programming it is tough to love display advertising. Display advertising was never about web programming or data networking. It was nested on the web as a rogue aggregation and delivery mechanism. The ability of display to deliver relevance remains hindered by this disjointed architecture. It is not threaded into site experiences and the realtime goals of the visits on the pages where it resides. This is exactly why we’re hacking it into something else.  

Vanilla Sky

In Q2 2007 while most of the industry was living some sort of vanilla sky of Behavioral Targeting one company came in and paid, what at the time seemed to most people way too much money, to own a controlling interest in display. No, I’m not referring to AOL buying TACODA. The company I’m talking about has maintained a focus since day one on hacking what you are interested in at that very moment. Unlike other content aggregators it tied its advertising system in the core user experience of its pages and the realtime relevance they delivered. Their stated Display strategy has little to do with cookie matching and everything to do with realtime context and creative optimization with the purpose of “capturing relevant moments.” It is now the most powerful company in Display. That company is of course Google.

Lucid Dreaming

The first lesson here is about the medium itself. This is a different medium and the old media buying and selling template breaks here. Behavioral Targeting may have changed names to the less scary “Audience Buying” but seven years later performance expectations have not been met and it has dragged display into the mud of issues like privacy, ad verification, cookie stuffing and more.

By contrast, Search and email – the most important of the web’s applications - have little use for tracking people across the web, let alone reach and frequency measures. They are the opposite of that. Search (and the web itself) was built by hackers to solve information management and retrieval problems.

The second lesson is that in this medium three pieces of data are valuable – context, timing and performance. The rest is just pipes. Understanding the context of an impression or click at the moment the page is loading and the ability to optimize the message is what the web was built for. It took Search to turn it into a marketing channel but growth (~20% YoY with no end in sight) and the size ($46B in 2013 per eMarketer estimates) of that channel shows how powerful that data is and how helpful understanding it can be to consumers.

Waking Up

The fact that it is referred to as display “advertising” is reason enough to know it’s from another time. This medium kills advertising. Everything on the web is marketing. As Suzie Reider, national director of display sales for Google said recently “display needs to move beyond advertising and into interacting.” Yesterday, Krux CEO Tom Chavez wrote a thoughtful blog post on how it is time for display to move beyond advertising. We agree and we're walking the walk.

This doesn’t mean that publisher will not show ‘graphical’ units as Google calls them. Of course they will. It doesn’t mean that prime real estate isn’t going to be turned over to these units, they will. We’re headed to a world with fewer messages that will be bigger and more interactive. But if we have learned anything from Search it is that format and size don’t matter when the message is relevant, helpful and useful at that moment.

Billions of Relevant Moments

As long as technology to understand context and timing are progressing as fast as they are (and as places like Betaworks where realtime is the thesis of the new medium value creation the startups are hacking away) there is a bright future. Search has proven that the web is the greatest and most democratic marketing medium ever created. The hackers working with dynamic web languages and event driven programming can unlock an order of magnitude of more relevant moments. There are literally billions of them out there waiting to be captured and created. At Yieldbot we see this scale everyday in the inventory of web publishers and if you’re a hacker and remaking the staid idea of advertising appeals to you we would be interested in speaking with you.


One Slick Way Yieldbot Uses MongoDB

As a key/value store. Wait, what?  Yeah, MongoDB as a key/value store.

Why? Because we were already using Mongo and planned to move some of our data to a key/value store. But we also didn't want to wait until we made a decision on a specific solution to make the transitions in our code.

First, as if you wouldn't have guessed, here's how easy a "Python library for MongoDB as a key/value store" is:


def kv_put(collection, key, value):
    """key/value put interface into mongo collection"""{'_id': key, 'v': value}, safe=True)


def kv_get(collection, key):
    """key/value get interface into mongo collection"""
    v = collection.find_one({'_id':key})
    if v:
        return v['v']


Note: kv_get() returns None if nothing found, so technically this doesn't gracefully handle the case where you want None to be a possible value.

What was the pain point?

We basically found that we had collections with nightly analytics results that were really big, and whose indexes were really, really big.  And the index requirements were going way beyond our 70GB RAM server.  We didn't want to shard our Mongo server because of the cost involved, so instead decided to take a different appraoch.  Since this data was read-only results of analytics, where we once had collections that had entries that were multiply indexed, we now have collections that are pages of the old entries in a defined sorted order and are accessed as key/value.

How did the change work out? Great. We still haven't switched from MongoDB for this data. Still plan to, but in a startup once you address a pain point you move on to the next one.

You definitely can't argue that MongoDB isn't flexible.


Looking for a Few Good Devs

At Yieldbot we're a small team building incredible technology that's getting major publishers and advertisers hooked. We're always looking for the best technical talent to join our development team and work with us on getting to the next level, and as CTO I think it's only fair that you know what we're looking for. 🙂

What you need to know:

  • We're a small talented team looking to become a bigger talented team.
  • We work on tough interesting problems, use cutting-edge "big data" technology, and enjoy winning.
  • We're working on revolutionizing how web advertising works.
  • This is gonna be big.

What we need to know:

  • You like to solve tough problems and have a history of winning.
  • You can code in a few different languages and are expert with one of them.
  • That language is Python or you are py-curious.
  • You want to make big contributions on a small team and build a valuable business.

Some  technology stuff:

  • Python's big here, but we rock Clojure and Javascript too.
  • We use Django. Bonus if you know it, but if you don't it ain't so hard.
  • We wrestle some serious big data, and use Hadoop to do it.
  • MongoDB and HBase compete for our love.
  • Chef and the Vagrant knifed Puppet and we've never been the same


If you're a fit, dust off your Python and contact us:


''.join(map(chr,map(lambda x:x[0]+x[1],zip(map(ord,str(dir)),yieldbot))))




How Yieldbot uses D3.js + jQuery for Streamgraph Data Visualization and Navigation

One problem we needed to solve early on at Yieldbot was understanding intent trends in the publisher data. This couldn’t just be shallow understanding. We needed to expose multiple data trends at the same time around thresholds and similarity. Our need:

  1. Allow what we call the “root intent” trend to be apparent.
  2. Break out the “other words” that are associated with the root intent.

Making this happen in an integrated fashion meant we needed some flexible and powerful tools. We found that d3.js and jQuery UI were the right tools for this job.

Looking through the excellent documentation and examples from d3.js we saw the potential to build exactly the type of visualization we needed. We used a stacked layout with configurable smoothing to allow good visibility into both the overall and individual trends. Smoothing the data made it very easy to follow the individual trends throughout the visualization.

Having settled on this information rich way to visualize the data with d3 we then took the prototype static visualization and made it into a dynamic piece of our interface. It was very important for us that the data be more than just a visualization - we wanted it to be navigation. We wanted the data to be part a tangible and clickable part of the interface. The result was that each of the intent layers is clickable and navigates to another deeper level of data.

Having the core functionality in hand we used the jQuery UI Widget Factory to provide a configurable stateful widget that encapsulates the implementation details behind a consistent API. This makes using the widget very easy. Creating a trend visualization is just a one liner - while the raw power and flexibility is wrapped up and contained in the implementation of the widget.

Here are a few examples of this visualization in action:


With this reusable widget in hand we could use this trend visualization across our application in numerous places. This provides consistency to our interface that is extremely important UI concern for such a data intensive product.

Our approach to developing innovative data visualizations has been consistently repeatable as we now have 3 additional visualizations in the product and have played around with many more than that. Each time these are the steps we take when creating a new data viz.

Throughout the process the flexibility that d3 provides meant we never bumped into a wall where the framework complexity jumped drastically. It appears that the wall of complexity is still far off in the distance if it exists at all. As our understanding of d3 increased and with the use of prototypes driven by live data we are able to quickly iterate on ideas and design. This flexibility will continue to be one of the many long-term benefits that we’ll get from using d3.

Data visualization plays an important role in our product and we’re excited to keep using it to solved data comprehension problems. Not to mention it really brings the data to life. If you're interesting in data visualization or this process we'd love to hear your thoughts.

Rise of the Publisher Arbitrage Model


The business of the web is traffic. Always has been. Always will be. That’s why I was a bit surprised with how much play Rishad Tobaccowala’s quote received last week in the WSJ:

"Most people make money pointing to content, not creating, curating or collecting content."

The original lessons of how to make money on the web got lost because at some point everyone with a web site thought they could go on forever just selling impressions. They didn’t foresee two things:

A) The massive amounts of inventory being unleashed on the web. In the last two years alone Google’s index has gone from 15B pages to 45B pages.

B) The explosive rise of the ad exchange model with its third party cookie matching business that gave advertisers the ability to reach a publisher’s audience off the publisher site and on much cheaper inventory.

Fortunately the more things change on the web the more they stay the same – especially the business models.  Traffic arbitrage, the web’s original model, is a more viable model than ever for publishers and likely the only hope to build a sustainable business in the digital age.

The web is literally built around traffic. Here’s what it looks like right now.

From Search to Email to Affiliate the value and monetization of the web occurs in sending and routing these clicks or as Tobaccowala said “pointing out” at a higher return than your cost of acquiring the traffic.

Google of course is the biggest player in the arb business. 75% of the intent Google harvests costs them nothing. They’ve been able to leverage all the intent generation created in other channels like TV that shifts to Google for free. Just take a look at how much TV drives Search. But Google also pays for traffic. Last year they spent over $7.3 Billion or 25% of revenue on what they call TAC (traffic acquisition cost) to get intent. Of course you need this kind of blended model to be successful as Google is with arb.

With the growing (and free) traffic generating intent to publishers it is time they got in this game. Organic Search continues to drive higher percentages of traffic to top publishers (and the Panda update has pushed that even higher). YouTube, Facebook and Twitter are also sending more traffic all the time at no cost to Pubs. The seeds of a huge arb model continue to be sown.

Vivek Shah CEO of Ziff Davis speaking to Ad Exchanger about ‘What Solutions are Still Needed For Today’s Premium, Digital Publisher’ put it this way:

“…you need to invest in technology that can sort through terabytes of data to find true insights into a person’s intent… not just surface “behaviors.”

This speaks directly to the arb model and how it becomes a true revenue engine for publishers. Once you understand the intent that is present on your site you can then quantify its value. Once you quantify the value you can figure what intent you need to get more of, seed that intent with new content and how figure out how much you can spend to drive traffic to it. This is exactly the model we’ve been working with publishers - using Yieldbot to qualify and quantify the intent on their sites.

So the future of the web looks a lot like the past. There will be marketing levers and technologies that optimize what traffic you drive in and there will be marketing technologies that optimizes where and at what value that same traffic goes out. Everything else you will do in your business will support those value creation events. Yieldbot just wanted to “point out” that for publishers.