This task can be performed using Thundercrawl
Thundercrawl – Turn Your Website Into AI Fuel.
Best product for this task

What to expect from an ideal product
- Thundercrawl automatically crawls your website and pulls out all the text content without you having to manually copy and paste from each page
- It cleans up the messy HTML code and gives you plain text files that machine learning models can actually read and process
- The extracted text comes pre-formatted in a way that works best with language models, saving you hours of data preparation work
- You get organized .txt files that are ready to feed directly into your AI training pipeline without additional formatting steps
- It handles the technical stuff like removing navigation menus, ads, and other junk so you only get the valuable content for your models