Spring AI Web scraping ETL - Aurora Blog|java/go/python

Is there a feature in the pipeline to support web scraping functionality - similar to what the LangChain library has to offer (https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).

It is basically to load HTML pages from a web url and transform it to text, before chunking and indexing it to a Vector Store.

Comment From: ThomasVitale

You can already load web pages into a vector database using the Tika DocumentReader, but it would be great to have dedicated support for the web scraping use case. For example, it would be great having the possibility to customise the loading and transformation/splitting of web pages in an HTML-aware way (similar to what LangChain and LlamaIndex support.

Dependency:

dependencies {
    ...
    implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}

Example:

public void run() throws MalformedURLException {
        List<Document> documents = new ArrayList<>();

        logger.info("Loading .html files as Documents");
        var documentUri = URI.create("https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/concepts.html#_models");
        var htmlReader = new TikaDocumentReader(new UrlResource(documentUri));
        documents.addAll(htmlReader.get());

        logger.info("Creating and storing Embeddings from Documents");
        var textSplitter = new TokenTextSplitter();
        vectorStore.add(textSplitter.split(documents));

        var similarDocuments = vectorStore.similaritySearch(SearchRequest
                .query("Retrieval Augmented Generation")
                .withTopK(3)
                .withSimilarityThreshold(0.75));
        similarDocuments.forEach(doc -> System.out.println(doc.getContent()));
}

Comment From: sivaprasadreddy

There are some commons-compress version incompatibilities. I had to exclude and configure it as follows:

 <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-tika-document-reader</artifactId>
      <exclusions>
          <exclusion>
              <groupId>org.apache.commons</groupId>
              <artifactId>commons-compress</artifactId>
          </exclusion>
      </exclusions>
  </dependency>
  <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-compress</artifactId>
      <version>1.26.1</version>
  </dependency>

Comment From: markpollack

We can include these changes to the pom.

How much more dedicated support over Tika is expected? The sample code reads well to me

Comment From: markpollack

Web scraping is notoriously difficult to do correctly beyond basic pages. I'm not against a simple web scrapper, but I'm not sure it really adds a lot of value as people will complain alot of pages that are blocked from javascript etc. We now have the jsoup (a well respected library) that will help in this area. Hopefully it meets your needs, if not, try out firecrawl - they also have an mcp server - https://www.firecrawl.dev/mcp