Fun Friday: The Race for AI Creative Works Control

In April of 2020, William and I wrote in the Cutter Business Journal an article entitled Moving Forward in 2020: Technology Investment in ML, AI, and Big Data. We focused on three areas: surveillance (monetization), entertainment (stickiness), and whitespace opportunities (climate, energy, transportation). This statement bears emphasis:

Instead of moving from technology to key customers with an abstracted total addressable market (TAM), we must instead quantify artificial intelligence (AI) and machine learning (ML) benefits where they specifically fit within business strategies across segment industries. By using axiomatic impacts, the fuzziness of how to incorporate AI, ML, and big data into an industry can be used as a check on traditional investment assumptions.

[For additional information on this article, please see AI, ML, and Big Data: Functional Groups That Catch the Investor’s Eye (6 May 2020, Cutter Business Technology Advisor).]

But one might be puzzled as to where generative AI tools such as ChatGPT or Dall-E fit in the AI landscape and why we should care about AI art, AI news and press releases, AI homework and essays — even threatened AI music like what’s talked about in 1984 by Orwell.

The reality is these areas utilize easily crawled content available everywhere lying around in the Internet attic. It also takes tremendous computing power to conduct ML and process this data effectively into some kind of appearance of sensible output. Hence, these tools will remain in the corporate hands of the creators no matter what they claim about “open source” — it’s simply too difficult for anyone but a giant corporate entity to support the huge costs involved. So this is about monetization and stickiness. Large companies are willing and able to pay the cloud costs if the customer gets dependent on using their tools. Flashback to the tool-centric sell of the 1980s, Silicon Valley style. All we need is an AI version of Dr. Dobbs Journal and we’re all set.

Previous attempts at generative AI have usually focused on small ML datasets, leading to laughable and biased results. Now companies are looking at the shift at Google in particular from ads to AI, along with Microsoft and FaceBook. Everyone believes they are in a race and frantically trying to catch up before all that sweet sweet money is locked up by one of them.

But is there really any need to “catch up”? Is this a real trend, or just an illusion? Google made its fortune on categorizing every web page on the Internet. It had plenty of rivals back then. I was fond of Altavista myself. But there was also everything from Ask Jeeves to Yahoo. 

Now Google and Microsoft are analyzing the contents of these big pots of data with ML But it’s not just for analyzing. It’s for creating content. Music. Art. News. Opinion. And you need an awful lot of processing power to handle all that data. So it’s now a Big Guy Game.

One of the approaches to eliminating bias is to use ML to process more and more data. The bigger the data pot, the less the bias and error. Well, that’s the assumption, anyway. But it’s a dubious assumption given the pots analyzed are often variations on the theme. Most search categorization is based on recent pages and not deep page analysis. Google is no Linkpendium.

All this, oddly, reminds me a bit of the UK mad cow fiasco, where their agricultural industry essentially cultivated prions by feeding animals dead animals. Like Curie purified radium from pitchblend, the animals who died of this disease were then processed and fed to other animals. And since prions, like radium, persist after processing, the prions were concentrated and made it up the food chain into humans.

So in like kind the tools themselves are feeding back into the ML feedlot and being consumed again. It may take longer than a few days, but we will be back to the same problems in terms of bias and error.

However, the gimmick of having a “machine” write your essay or news blurb is very tempting. Heck, AIs are claimed to take medical exams or pass a law class or handle software programming better than people.

But being a doctor or attorney or a software engineer is much more than book learning, as anyone who’s done the job will tell you.

And of course, there is now backlash from various groups who value their creative works and are not interested in rapidly generated pale imitations polluting their space and pushing them out. They didn’t consent to have their works pulled into a training set and used by anyone. Imitation is neither sincere or flattering, and is even legally actionable if protected by copyright, trade secret, or patent. It’s not “fair use” when you suck it all in and paraphrase it slightly differently.

This isn’t new. William and I ran into this in the old days with our 386BSD code. We were happy to let people use it and modify it — what is code if not usable and modifiable? But we asked that provinance be maintained in the copyright itself by leaving the names of the authors. And we had entire modules of code that were written denovo in the days when kernel design meant new ideas. It was an amazing creative time for us.

So I remember how shocked I was when an engineer at Oracle asked me about tfork() and threading, since a Linux book he had talked about it but he could find nothing in the Linux source code. I pulled out our 386BSD Kernel book and showed him that it was novel work done for 386BSD and would not work in Linux. Upon discussion it turned out that book just “paraphrased” many of our chapters without even considering that Linux did not incorporate much of our work because it was a very different architectural design. It misled software designers — but I’m sure it sold a heck of a lot more books than we did by turning “386bsd” to “linux”. So it is today, but a heck of a lot easier for the talentless, the craven, and the criminal to steal.

Now many software designers are upset because their source code depositories are used as the models for automated coding, and they don’t like that one bit. And I don’t blame them.

We lived it. And it was a primary reason why the 386BSD project was terminated. Too many trolls and opportunists ready to take any new work and paraphrase it. So get ready to see this happen again in music, art, news, and yes, software. The age of mediocrity is upon us.

1984, here we come…