Feeding the AI Dataset: Where does the Information in ChatGPT and Other Algorithms Come From?


I stumbled across a really interesting post over at Futurizon, the site of a fellow futurist – who compiled a list of the information that various futurists feed into the AI machine. You can read it here.

Dr. Pearson comments:

Thanks to Washington Post, we now know how the C4 dataset was made up, one of those used to train some AI chatbots. Since it is s snapshot of a sample of the web at a particular time, a few years ago, it isn’t perfectly representative, but it does give a taste of the likely makeup of AI training sets.

When someone asks their AI chatbot about the future, this list gives a broad indication of the relative influence that site would have on the answer. It’s really just a proxy for website size, but that’s as good a way as any of ranking influence. At least it’s objective, based on actual data. Ross Dawson’s top futurists list is equally objective, but ranks by social media presence. Neither is perfect. Not all futurists have big websites and not all futurists bother to cultivate large social media followings. Neither list indicates quality of analysis or insight.

When you are using ChatGPT or other LLMs (large language models), the information used in their predictive text algorithm comes from somewhere – with a good part of it being from scraping the Internet. (Other sources include scholarly journals, large-text research datasets, and more.)

The Top Futurist Sites by Influence on AI (C4 Dataset), took a look at how much of one AI particular dataset came from various futurists. It’s quite an interesting read and notes that my own site came in at #22 on a long, long list of fellow futurists. Why would this be? Because I’ve been blogging since 2002, and with thousands upon thousands of posts, I have a lot of information to feed the machine! Google scraped these posts as part of its’ process of building its search engine and also fed it into this particular AI dataset.

You can see how you are feeding the machine at the original source for this insight based on research projects done at the Washington Post – find it here. Give it the address of your own site, and you’ll see where you stand. In my case, I’m the 32,689 ranking items in the quantity of insight fed into the algorithm – among millions of sources.

What does this mean? My insight will form a bigger part of the predictive text that this algorithm generates, particularly if someone is searching for future trends, innovation, inspiration, creativity, or the other things I post about. As the world of search – Google, Bing, and other systems – migrates to become more like ChatGPT-style systems, my insight will become more and more a part of the response. Will I get credit? Maybe, maybe not. Bing will often cite the source of the answers that its ChatGPT engine derives.

The Google C4 dataset that is referred to here doesn’t seem to form the basis of any particular current ChatGPT engine – but does offer an interesting look at how the information we generate feeds the machine!



THE FUTURE BELONGS TO THOSE WHO ARE FAST features the best of the insight from Jim Carroll’s blog, in which he
covers issues related to creativity, innovation and future trends.