Filtering

January 27th, 2010

I was out drinking with my friend Allen a few nights ago. He was asking people at the table to talk about technologies that they used every day that were perhaps less fulfilling then they could be.

I hemmed and hawed. I’ve put so much time into my day-to-day technology that every step of the day is smooth and slick, aided by solid technology along the way.

There are a few arenas where the technology just doesn’t stack up – the Digital ACM Library is some sort of Coldfusion monstrosity, held together by hope and duct-tape and dreams. Whenever I try to find good teaching materials, I’m consistently stumped by a half-assed collection of blogs, teacher forums, and what-have-you. And when it comes to webcomic output, none of the self-publishing solutions are any good at all.

But most of these markets are aggressively niche. I’m not a teacher, I’m not allowed to disseminate ACM materials (dammit), and while I aspire to be a webcomic author some-day, I just don’t have the attention span required to write something hilarious every day – like, say, Zach Wiener. Okay, I still might end up building a better comic publishing engine. It’s totally on my list.

But then I thought of the biggest fish in the pond of websites falling out of my radar.

Reddit.

Oh, yeah, it was good. It was really good, in it’s heydey, when it first started up, it’s community was small and technical and smart and funny, and all of the submissions were small and technical and smart and funny, but it’s become bloated with old articles and karma whores and pictures upon pictures of cats.

That’s good for reddit. For them it means mainstream success. reddit’s also starting to hemorrhage users because nothing on their front page is new anymore. They’ve managed to compile a looping “greatest hits” of the internet, but that’s it, the end of the game, nothing new under the sun.

So instead of hitting reddit for the new and amusing, I started hitting RSS feeds instead. Oh, RSS. I love RSS with a passion. It is a fine technology. Every day my inbox is flooded with a hundred articles, many of them fresh and relevant to my interests.

So there are blogs, which are original sources of technical information, and magazines, most of them staffed by the technically savvy. The magazine-blogs read everything and post it immediately, vast filters who produce reams of information that is relevant to your interests. BoingBoing is the big fish in this pond- every day it’s a flood of the best of the internet, annotated with clever writing. What you’re reading is maybe 10 people reading everything on the internet every day and posting their favourite bits. Most of Wired’s blogs and feeds feel the same way.

These feeds are not successful because they do the reporting themselves. These feeds are successful because they stay on top of everything on the internet and only post the best links that they can find – hot and fresh and delivered right to your browser. It’s like a reddit, but instead of a mob of cat-hungry jackasses, it’s a paid staff, and they find a lot of really good stuff.

In fact, that’s what my new project, Potater, is all about. It’s just me reading as many blogs as I can handle, filtering out the good bits, and serving them up at the end of the day like a hot platter of steaming news.

I like it, I’m having a lot of fun with the Potater project. If you’ll excuse my expression, though, it’s … well, small potaters. It’s sort of a just-me-havin’-fun project.

And for most people who understand RSS feeds, they aren’t going to just let me do the filtering for them. Instead, they are going to add me to their burgeoning list of RSS feeds. The trick is just that I might be subscribed to something that they are not. I have even more interesting links for people who have interests along my lines.

In my opinion, the new lifeblood of the internet is not in publishing. Publishing is cheap-as-free. Blogging, images, comics, videos, whatever a user has to offer, there’s a way to get it on the internet for free or almost free. Because of it, the entire publishing industry is starting to fall on it’s ass. When anybody can publish something for ten bucks a month and push it to ten thousand users directly, what’s the point of going through a distribution company? They just need some sort of monetization model and they’re good to go.

No, traditional publishing is dead, or at least it will be soon. Newspapers, scientific publications, music distributors, even the big movie and television distributors are going to lose out eventually.

But this creates new problems. People aren’t getting paid for the content that they create. There’s no quality control on the internet.

These are the two big problems, now that distribution is out of the way – the people who used to be in charge of these things were the editors. These are people who used to be paid hundreds of thousands of dollars a year to sit in a room and read everything that came their way. They’d short list the good stuff, arrange it all in a neat and tidy little package, hire people to take pictures, have it typeset and nicely designed and copy-edited, and then sell that as a book or a magazine or a compact disc.

And these middle men were undeniably getting fat off of the contributions of the people who were actually producing the good stuff.

But the gravy train is starting to come to an end. Publication on the internet is free. There’s no need to work with these massive arbiters of the public taste.

And lo and behold, with the internet comes a flowering of niches that were underserved by these publications. Fanfic, hot-off-the-press technical news, cartoonists – oh BOY were cartoonists fucked over by the original system – all of these were people who just didn’t get a piece of the pie the ‘old way’ – it’s no wonder that they’re leading the charge in the free publishing world.

Here I am, sitting, looking at a copy of Communications of the ACM that I’ve just received in the mail. It is stunning. Full colour, beautifully typeset, full of imagery, free of ads, and totally relevant to my interests. It has articles about search engines and MapReduce and triple-parity RAID and streaming SQL technology and a computing museum; articles about how biology lacks proper standards, business of software articles, and about recent research in x86 sandboxing. No ads, even.

I’m not saying there’s not a market for this kind of thing. This magazine is magnificent. It’s also expensive, and monthly. There’s a new space out there, a space where news is instant and unfiltered and raw, and the new champions of this space are the BoingBoings, the TechCrunches, the reddits – the sites which act as the new magazines, deciders of taste, compilers of internet goodness.

This is the unexplored territory, a rich territory, the battlefield upon which the new publication battles of the 21st century will be fought.

And RSS is where it all happens.

———————

So, we return to my original point. Reddit is great, because it’s community-filtered, but it’s terrible, because the news is stale and the user base is starting to get tragedy-of-the-commons stupid.

On the other hand, e-Magazines like BoingBoing and Wired are great, because they’re hand-filtered, fresh, and relevant, but because of the manpower required, they don’t serve niche markets effectively and they bleed money. On top of that, they run out of news really quickly, and they rarely filter through the reams upon reams of excellent self-published material out there. (BoingBoing is pretty great, mind you, but there’s only so much BoingBoing to go around, and it’s quite the info-flood, still.)

Now, gentlemen, perhaps you are starting to get the gist of what I am proposing. Community-powered RSS-filtering. Each ‘channel’ is a package of RSS feeds, managed by a single person, or a small group, or publicly. My personal channel might be all of the feeds that go into the making of Potater. Our group channel might be an amalgam of all of the coolest tech and vancouver news that we can find. A public channel might cover a broader topic, like ‘technology’ or ‘Ruby blogs’.

As these channels flow, flush to bursting with all of the data that they contain, individual users flag individual posts with ‘notability’ markers. “I really like this”. “I like this.” “I hate this. ” (etc..) The notability markers affect the real-time view of the data – heavily liked items appear larger, they glow and pulsate with the energy of attention. Heavily disliked items fade into obscurity, both literally and figuratively.

When items reach certain notability thresholds – the feed-consumer, the end-user, might decide the flow (‘N links/day’ or ‘N links/hour’), which would translate into a notability requirement that the user wouldn’t see (‘5 upvotes’) – when the items reach these thresholds, then they get plunked into the “output” feeds, which can then be ‘input’ feeds for other groups or users (or just end up directly in somebody’s feed-reader.) Either that or the feed publishers can just set their own notability thresholds.

These links can come out raw, or annotated – so a small, clever group, a BoingBoing, they could gather, annotate, and re-publish a ball of feeds every day. For public feeds, the annotations would be more along the lines of a community discussion. Perhaps the discussion itself would constitute a feed of it’s own.

This is still in a fuzzy sort of design phase, but this is the direction that I want to take Potater.

I even assembled an imaginary A-Team of people who I’d want to work on Potater with, and then mentally assigned them powers from the Planeteers. Because I can.

Allen, with the power of Fire, a tech-blogger, a feed-reader, an Apple engineer, a tireless advocate of the entrepreneurial spirit, a battle-hardened coder, a front-end man.

Yangman, with the power of Water, a fucking-prolific-hacker, a pythonista, an expirimenter with everything that nobody’s ever heard of, and a sexy-beast, to boot.

Demwell, with the power of Wind, with business chops and technical prowess in equal measure, with natural language processing and distributed systems experience, and a beard that just won’t quit.

and

ME, with the power of Earth, because I take forever to do anything and weigh a fucking ton. Also I can code and design some. Write a bit. Forge tools from stone to harvest simple grains. That sort of thing. I could build this sort of thing on my own, but I have the awful tendency to get super-excited about a project, do a tonne of planning and preliminary code, and then lose interest and fall asleep in some bushes. This is why I need other people to help push things forward!

I hesitate to hand out the power of Heart, because it’s a super-lame power. I’d have nominated Danly for that specific honor, but apparently he’s under a no-compete/NDA so restrictive as to render him useless for side-projects. Curse you, danly’s source of funding!

Don’t feel bad if you’re not on the list. I still think that you are awesome, and I would have included you if I didn’t run out of Captain Planet themed powers. I mean, I could hand out the bad-guy names, but who wants to be Looten Plunder?

Wait, can I change my power to “Being Looten Plunder”?

So, tell me what you think of this ill-defined masterpiece. Throw ideas. Throw poop if necessary. Not too much poop. Keep poop-throwing to a manageable level. Please.

4 Responses to “Filtering”


  1. Dan says:

    Hardly, I can work on side projects as long as it doesn’t interfere with work product (this doesn’t) and doesn’t interfere with my ability to be fully committed to work product (which this won’t). As a gesture of goodwill I’d likely just to have to tell them I’m doing this. Considering a core programmer just left, leaving me to pick up his shit, this is a prime time for me to be making such ultimatums.


  2. lassam says:

    <Boose>:  ASK cdemwell ABOUT ideas
    <cdemwell>:  k, so
    <cdemwell>:  1. People already use existing rss picking tools like google reader’s share and facebook share and whatever else

    <cdemwell>:  they don’t want to do it AGAIN
    <lumy>:  yea.
    <cdemwell>:  2. What you really want is not a filter. What you want is a ranker

    <lumy>:  this is my problem with potatoor
    <cdemwell>:  3. Did you look at raindrop like I suggested?
    <yangman>:  The server at http://www.huge-melons.com is taking too long to respond. :(

    <Boose>:  cdemwell: Yes. Filter/Ranker. Good content is highly ranked, bad content is dropped.
    <cdemwell>:  yangman, it takes time to move that much melon
    <cdemwell>:  ok, so

    <lumy>:  hahaha
    <jeikobu>:  Carrying around huge melons can result in a bad back
    <cdemwell>:  so I hear

    <krichter>:  jeikobu, that’s why proper support helps
    <yangman>:  \OO/
    <cdemwell>:  This is why whenever I can, I try to help carry huge melons for the people who have to support them every day

    <cdemwell>:  anyhow Boose
    <cdemwell>:  4. What you want is a machine learning tool
    <jeikobu>:  Yeah, it’s just like Dirty Dancing

    <jeikobu>:  http://terrifieddad.files.wordpress.com/2009/06/dirty-dancing_l1.jpg
    <lumy>:  wait… no I disagree with 4.
    <lumy>:  I want new and exciting.

    <yangman>:  heh, one-giant-melon
    <lumy>:  I don’t think ML can really do that.
    <Boose>:  A machine learning tool? It learns to rank new news based on your opinion in regards to old news?

    <jeikobu>:  AKA melon loaf
    <cdemwell>:  lumy, what
    <cdemwell>:  are you kidding?

    <cdemwell>:  ML can learn what you think is exciting
    <lumy>:  really?
    <cdemwell>:  yes.

    <cdemwell>:  there are already tools that do sorta-that
    <lumy>:  I was unconvinced last time I looked into it.
    <cdemwell>:  http://www.stumbleupon.com/

    <lumy>:  but I’d love to be proven wrong.
    * cdemwell shrugs
    <cdemwell>:
      maybe you’ve heard of a relatively successful project
    <cdemwell>:  it’s called google search?

    <Boose>:  The thing about stumbleupon is that it uses a block of other users and a large base of pre-existing links to help you find things you’ll like
    <cdemwell>:  probably too obscure for you
    <yangman>:  Google eh? tell us more

    <cdemwell>:  yeah
    <cdemwell>:  so
    <Boose>:  It’s more like Amazon’s “People who liked this also liked X”

    <cdemwell>:  yangman, well it’s basically a search engine like duckduckgo but a little more simple
    <cdemwell>:  Boose, yes, it uses user clustering
    <Boose>:  Still complicated and a technical coup, but the computer doesn’t know what you like, it just knows what people who like what you like like.

    <cdemwell>:  I’m pretty familiar with the tech
    <cdemwell>:  no no
    <cdemwell>:  you tell it what you like, and that builds up a list of things you like

    <lumy>:  crap…
    <jeikobu>:  People who liked X like Y is the aggregated version of the output
    <cdemwell>:  it then predicts whether you’d like a new thing by looking at the feature vector

    * lumy leaves for lunch
    <cdemwell>:
      k, stop
    <cdemwell>:  collaborate and listen
    <Boose>:  HAMMERTI… wait, different conversation

    <jeikobu>:  Like…. it already knows that X ->:  Y because you told it X, and you told it Y, where you = many many people
    <cdemwell>:  ML works like this: You take in a bunch of data, which I’ll call “features”
    <jeikobu>:  This is more, you said you liked X and Y, and Z looks sort of like X and Y, so let’s see if you like that too

    <cdemwell>:  well that’s not how stumbleupon works, jeikobu
    <jeikobu>:  OK, my page might be different then yours then
    <cdemwell>:  SU works by saying “You like X and Y, and these 300 people like X and Y and Z, maybe you’ll like Z”

    <cdemwell>:  maybe they do more stuff now
    <cdemwell>:  I dunno
    <Boose>:  Like I said – it knows what people who like what you like like :p

    <cdemwell>:  but that’s what they used to do
    <cdemwell>:  but it’s stupendously effective
    <cdemwell>:  anyhow

    <cdemwell>:  so you input a bunch of features, and some of those features might be sparse
    <jeikobu>:  Obligatory whine: Someone needs a general ML engine where I feed it input key/value pairs and output key/value pairs, and it learns the associations
    <cdemwell>:  so for example one of the items in the feature set might be “ilovekittiesangie1989 likes it”

    <cdemwell>:  but another might be “pagerank”
    <Boose>:  and another might be article length,
    <Boose>:  presence of an image,

    <Boose>:  size of image,
    <jeikobu>:  SU sounds like the aggregated version/amazon
    <Boose>:  average paragraph length…

    <cdemwell>:  jeikobu, that’s just memorization
    <cdemwell>:  Boose, yes
    <cdemwell>:  then you feed that data into something that processes it with special sauce and does stats

    <cdemwell>:  this produces some stats distributions
    <jeikobu>:  cdemwell: …and gives me probable associations for a new set of inputs.
    <cdemwell>:  ding.

    <Boose>:  This is how Bayesian spam filtering works, yes?
    <cdemwell>:  uh, sorta
    <cdemwell>:  bayesian classification is a very simple classification system

    <cdemwell>:  it requires labelled data
    <Boose>:  Ah.
    <cdemwell>:  which is to say [feature1, feature2, ..., featuren, IS_SPAM]

    <cdemwell>:  that last item is the label
    <cdemwell>:  it’s the decision made
    <cdemwell>:  we can also do learning with unlabelled data

    <cdemwell>:  ie “hey go figure out some properties of this data. I’ll tell you why later”
    == rlongair|work [n=rlongair@66.46.112.60] has quit ["Leaving"]
    <cdemwell>:
      it’s less effective, but it exists
    <Boose>:  Okay, so you’re suggesting that machine learning can be used to provide real-time ranking and filtering of RSS data.

    <cdemwell>:  *customized* ranking and filtering
    <cdemwell>:  and it doesn’t have to be realtime
    <Boose>:  I’m suggesting that a cloud of users can be used to provide delayed-time ranking and filtering of RSS data.

    <Boose>:  And that, the place where these two things meet is _awesome_
    <cdemwell>:  there are of course people doing this already
    <cdemwell>:  http://blog.postrank.com/about/company-info/

    <Boose>:  Because as a user, you rank and filter what you think is the best/worst, and then the ML algorithms use these to develop your own custom feed, while at the same time you provide data for other feeds and what-have-you.
    <cdemwell>:  yeah it’s easy to add a feature BOOSE_LIKES_IT
    <cdemwell>:  so I figure, fuck postrank and whatnot

    <cdemwell>:  just build something that harvests features and offers a ranked feed
    <cdemwell>:  I’m not sure how compatible this would be with google reader
    <cdemwell>:  like what if the feed is not ordered by time, but rather by score

    <cdemwell>:  hard to say, right
    <cdemwell>:  part of the input can be your google reader shared items feed
    <cdemwell>:  so it’ll see stuff coming back from its own suggestions

    <cdemwell>:  there’s the labels
    <cdemwell>:  USER_LIKES
    <cdemwell>:  but yeah, that’s generally the thing

    <cdemwell>:  this solves the reddit issue where everyone’s vote is considered as important
    <cdemwell>:  like sorry, mr “I like baseball” is not an important vote for me


  3. lassam says:

    This is sort of what I was thinking.


  4. PieRC says:

    Start with the linked bit and then read downwards.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>