How to run Puppeteer Sharp in a Linux Docker container
Puppeteer Sharp is a a website crawler for C#. I personally use it to crawl websites for price information of products that I am interested in. In this article, you are going to learn about the configurations you have to do if you want to use the crawler in a Linux hosted web application.
Create a new .NET Core API
When creating the new project do not forget to set up Docker support. Create a new C# class for storing the code that follows.
Use the PuppeteerSharp Nuget package
Reference the latest version of PuppeteerSharp
in your project and use it in the class you created.
Crawl a webpage
Create a new method for getting the content of a webpage as string to the user. You can use the following code as guidance.
Pay attention to these points in the code:
- The
--no-sandbox
property inArgs
disables the sandbox mode in Chromium to allow Puppeteer run in the Docker container since the sandbox can cause permission issues. However, it’s important to note that running without the sandbox can pose a security risk, so this option should be used with caution. - The
ExecutablePath
property instructs the code where to find the Chromium browser. As we are going to see when we set up the Dockerfile, we will use this path again. - We then create a new page inside the browser and set the
DefaultNavigationTimeout
to 0, so long-running requests do not receive a timeout, which by default is 30 seconds. - When we make the request with
GoToAsync
, we set theTimeout
property to 0 to ensure once again that our requests do not fail due to timeouts. - The
WaitUntilNavigation.Networkidle2
is a way to ensure that all information was loaded on the webpage, since some JavaScript code might have made an internal request for data. Thus, we wait for the default 500ms of theNetworkidle2
enum property.
public async Task<string> GetContentAsync(string url)
{
using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Args = ["--no-sandbox"], ExecutablePath = "/usr/bin/google-chrome-stable" });
using var page = await browser.NewPageAsync();
page.DefaultNavigationTimeout = 0;
await page.GoToAsync(url, new NavigationOptions { WaitUntil = [WaitUntilNavigation.Networkidle2], Timeout = 0 });
return await page.GetContentAsync();
}
Set up the Dockerfile
We want our code to run inside a Docker container, possible on a Linux App Service in Azure. For that we will instruct Docker on how to build and start the application.
You can adapt the Dockerfile based on the following configuration. Read the comments for a more detailed explanation about why we do things as we do.
FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base
# Puppeteer recipe
# Based on this code: https://github.com/armbues/chrome-headless/blob/master/Dockerfile
RUN apt-get update && apt-get -f install && apt-get -y install wget gnupg2 apt-utils
RUN wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN echo 'deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main' >> /etc/apt/sources.list
RUN apt-get update \
&& apt-get install -y google-chrome-stable --no-install-recommends --allow-downgrades fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf
# We are setting the same path as before in the C# code
ENV PUPPETEER_EXECUTABLE_PATH="/usr/bin/google-chrome-stable"
# The following commands are the standard configuration for restoring, building and publishing a .NET core application.
# You will have to update the name of your project
USER app
WORKDIR /app
EXPOSE 8080
EXPOSE 8081
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
ARG BUILD_CONFIGURATION=Release
WORKDIR /src
COPY ["YourApi/YourApi.csproj", "YourApi/"]
RUN dotnet restore "YourApi.Api/YourApi.Api.csproj"
COPY . .
WORKDIR "/src/YourApi.Api"
RUN dotnet build "YourApi.csproj" -c $BUILD_CONFIGURATION -o /app/build
# We set the UseAppHost to false since we do not want to create any executable for the Linux environment
FROM build AS publish
RUN dotnet publish "YourApi.csproj" -c $BUILD_CONFIGURATION -o /app/publish /p:UseAppHost=false
# Define the entrypoint of the application
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "YourApi.Api.dll"]
Conclusion
It is possible to run Puppeteer in a Linux environment. However, some specific configurations in code and in the Dockerfile have to be made to allow it. I hope this article clarifies some open questions you might had.