Meta Releases New Bots With Web Crawling Capabilities That Gather Data For AI Models And Related Products

Tech giant Meta just rolled out a new array of bots with web crawling capabilities. The tools are known for sucking up online data which the company can use for various AI products.

The bots entail specific features that prevent website owners from blocking or securing their content and preventing it from getting scraped and collected. When asked to explain, Facebook’s parent firm shared how the first bot called Meta-ExternalAgent is designed for training purposes.

They can improve the firm’s AI offerings through direct indexing of content. On the other hand, the second bot dubbed Meta-ExternalFetcher has to do with the firm’s AI offerings. It can gather links to provide specific functional support.

The bots were first spotted last month as per startup Originality.ai. It’s an alarming finding because as we speak, Robots.txt was called out for similar actions. You can see it as a race where startup firms are working round the clock to create the most powerful chatbots.

To become a frontrunner, AI models need to be trained on the best data and that is where web crawlers come into play. One of the biggest ways to get this done is by sending bots on the web to scrape content from the internet. So far, bots from all leading tech giants have been identified. This includes Google, Anthropic, OpenAI, and more.

If any website wishes to block this from happening, they make use of robots.txt which stops scraping of content online. You can consider it as a small code that has been in use for the past decade, if not more.

Despite such rules in place, many AI giants have a growing thirst for more data for AI training purposes. A lot of companies were outlined for ignoring the robots.txt rule including OpenAI and Anthropic.

With the latest offering by Meta, it seems like this is the goal, although the company hoped to do it more discreetly to be less noticed. As it is, Meta has already warned the world that one of the two web crawlers has this capability and therefore could bypass the robots.txt regulations.

Not only can it collect data for AI training but also indexes content at the same time. Meanwhile, a lot of website owners can block Meta from this, only if they opt to stop their websites from getting indexed by Facebook’s parent firm. If that’s the case, fewer visitors will come to their page from Meta, which again is not good for revenue.

When you combine both of these offerings into a single bot, it’s harder to block. Today, just 1.5% of all leading pages block the bot. Another web crawler from Meta dubbed FacebookBot gets blocked by just 10% of leading pages. Yahoo and X are on the list. Interestingly, the other Meta bot called Meta-ExernalFetcher gets blocked by less than 1% of the top pages.

As per the head of Originiality.ai, companies like Meta need to give websites the chance to block onsite data as it’s their material. Nobody should be forced to give in to tech giants due to their growing power, only because they fear website visibility loss.

The news is alarming and clear proof that Meta fails at respecting other pages. While its older bots may have been mindful, the latest offerings are not. This can be a huge debate and controversial move for the company.

Image: DIW-Aigen