Building a Wikipedia Indexer in C#: Fetching Data for AI Embeddings

In this post, we are going to write the code that handles the first part of our indexing process. We are going to build a basic C# console application that connects to the Wikipedia API, downloads a collection of articles, and prepares them for the next stage of our pipeline.

Eventually, we will create embeddings from these articles and store them in Pinecone, while putting the original text into a content database. But before we get to the vector database wizardry, we need the raw data.

You have likely used Wikipedia before; it is the massive online encyclopedia we all know. We are going to use Wikipedia's API to pull in information about a specific list of historical landmarks.

Setting Up the Source Data

I have created a new C# console application and cleared out the default code in Program.cs. To get started, I added a file named SourceData.cs. This is a static class containing a string array of landmark names. These names correspond directly to Wikipedia articles.

As you can see, we have landmarks like the Eiffel Tower, the Great Wall of China, and Stonehenge.

SourceData.cs showing the list of landmarks to index

public static class SourceData
{
    public static readonly string[] LandmarkNames = 
    [
        "Eiffel Tower",
        "Great Wall of China",
        "Stonehenge",
        "Edinburgh Castle",
        "Tower Bridge",
        "Buckingham Palace",
        // ... additional landmarks
    ];
}

We will write code that iterates through each of these landmarks one by one. We will use the Wikipedia API to pull down the article information, create an embedding from it, and eventually index it.

Understanding the Wikipedia API

Before writing the client, let's look at the API we are consuming. We are using the English language version at en.wikipedia.org/w/api.php.

We need to pass a few query string parameters to get the data we want. For example, if we query for "Stonehenge" and request the format in JSON, the result looks like this:

JSON response from the Wikipedia API showing the extract for Stonehenge

The JSON response contains a query object, which contains pages, and inside that, we get an extract.

An important point to note here is the scope of the content. When we use the standard query arguments, we only get the initial section of the Wikipedia article, which is known as the "extract." Wikipedia articles can be quite long, but for now, we are focusing on this high-level overview at the top of the page. Later on, we might look at indexing the entire article, but for this post, we will just pull out this top summary.

Defining the Document Model

We need a C# type to represent the document coming back from Wikipedia. I created a folder called Models and added a file named Document.cs.

We will define this as a record called Document. It will store the data we want to keep for our search index.

Defining the Document record in C#

namespace ChatBot.Models;

public record Document(
    string Id,
    string Title,
    string Content,
    string PageUrl
);

Here is what these fields represent:

Id: A unique identifier for each Wikipedia article.
Title: The title of the article (e.g., "Stonehenge").
Content: The string content of the article (the extract we saw earlier).
PageUrl: The direct link to the Wikipedia page.

We will map the API response into this object, create an embedding from it, and store it so that when a user searches for something, we can match a document based on the Content property and display the Title and PageUrl.

Creating the Wikipedia Client Service

Now, let's create the client to call the API. In a new Services folder, I created WikipediaClient.cs.

We will start by defining a WikipediaClient class. A critical best practice in .NET is not to create new instances of HttpClient every time you want to use it, as this can lead to socket exhaustion. Instead, we will create a single static instance.

Setting up the static HttpClient in the WikipediaClient class

public partial class WikipediaClient
{
    private static readonly HttpClient WikipediaHttpClient = new();

    static WikipediaClient()
    {
        WikipediaHttpClient.DefaultRequestHeaders.UserAgent.Clear();
        WikipediaHttpClient.DefaultRequestHeaders.UserAgent.Add(new ProductInfoHeaderValue("AICourseBot", "1.0"));
        WikipediaHttpClient.DefaultRequestHeaders.UserAgent.Add(new ProductInfoHeaderValue("(contact:[email protected])"));
    }
    // ...
}

We use a static constructor to configure the default request headers. Specifically, we need to set the User-Agent. I am not sure what the default .NET user agent string is, but we want to be good citizens when communicating with Wikipedia.

Adding User-Agent headers to be a good API citizen

By identifying our application ("AICourseBot") and providing contact information, we let Wikipedia know who is making these requests. If we accidentally hammer their API, they can identify us or block requests based on this user agent, or even email us to ask us to stop. It is highly recommended to do this when calling open-source community services.

JSON Serialization Options

The Wikipedia API returns JSON using lowercase camelCase, which is typical for JSON APIs. However, our C# classes use PascalCase. To handle this mapping automatically, we need to configure JsonSerializerOptions.

Configuring JsonSerializerOptions for case insensitivity

private static readonly JsonSerializerOptions JsonOpts = new()
{
    PropertyNameCaseInsensitive = true
};

Setting PropertyNameCaseInsensitive to true ensures that a JSON property like title maps correctly to our C# property Title.

Helper Classes for Deserialization

We also need some internal classes to deserialize the specific shape of the Wikipedia JSON response. The response has a query property, which contains pages, which is an array containing objects with pageid, title, and extract.

Internal classes to map the Wikipedia API JSON structure

private sealed class WikiApiResponse
{
    [JsonPropertyName("query")]
    public WikiQuery? Query { get; set; }
}

private sealed class WikiQuery
{
    [JsonPropertyName("pages")]
    public List<WikiPage> Pages { get; set; } = new();
}

private sealed class WikiPage
{
    [JsonPropertyName("pageid")]
    public long? PageId { get; set; }

    [JsonPropertyName("title")]
    public string? Title { get; set; }

    [JsonPropertyName("extract")]
    public string? Extract { get; set; }

    [JsonPropertyName("missing")]
    public bool? Missing { get; set; }
}

These classes mirror the JSON structure we saw in the browser, allowing us to easily pull out the data we need.

Constructing the Request URL

We need a function to build the long URL with all the query string parameters. We will create a method called CreateWikipediaUrl.

We use the UriBuilder class, which is a safer way to construct URLs than string concatenation.

Building the Wikipedia URL using UriBuilder

static string CreateWikipediaUrl(string pageTitle, bool full)
{
    var urlBuilder = new UriBuilder("https://en.wikipedia.org/w/api.php");
    var queryString = new Dictionary<string, string>
    {
        ["action"] = "query",
        ["prop"] = "extracts",
        ["format"] = "json",
        ["formatversion"] = "2",
        ["redirects"] = "1",
        ["explaintext"] = "1",
        ["exsectionformat"] = "wiki",
        ["titles"] = pageTitle
    };

    if (!full)
    {
        queryString["exintro"] = "1";
    }

    urlBuilder.Query = string.Join("&", queryString.Select(kv => $"{WebUtility.UrlEncode(kv.Key)}={WebUtility.UrlEncode(kv.Value)}"));
    return urlBuilder.ToString();
}

We define parameters like explaintext and exsectionformat based on the Wikipedia API documentation. The titles parameter is where we pass in our landmark name (e.g., "Stonehenge").

If the full boolean parameter is false, we add exintro=1, which tells the API to only return the introductory extract, not the full article. Finally, we join these parameters into a query string, ensuring keys and values are URL-encoded.

Fetching and Processing the Page

Now we can write the method that actually performs the fetch. This will be an async task that returns a Document.

The GetWikipediaPage method implementation

static async Task<Document> GetWikipediaPage(string url)
{
    using var request = new HttpRequestMessage(HttpMethod.Get, url);
    using var response = await WikipediaHttpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead);
    
    response.EnsureSuccessStatusCode();

    var json = await response.Content.ReadAsStringAsync();
    var apiResponse = JsonSerializer.Deserialize<WikiApiResponse>(json, JsonOpts) 
                      ?? throw new InvalidOperationException("Failed to deserialize Wikipedia response.");

    var firstPage = apiResponse.Query?.Pages?.FirstOrDefault();

    if (firstPage is null || firstPage.Missing is true)
        throw new Exception($"Could not find a Wikipedia page for {url}");

    if (string.IsNullOrWhiteSpace(firstPage.Title) || string.IsNullOrWhiteSpace(firstPage.Extract))
        throw new Exception($"Empty Wikipedia page returned for {url}");

    // ... mapping logic continues

We send the request and use EnsureSuccessStatusCode to throw an exception if we get anything other than a 200 OK (or similar 2xx code). We then deserialize the JSON string into our WikiApiResponse object.

Because we are searching by a specific title, we expect the Pages list to contain a single result. We grab the first page and perform several checks. If the page is null, "missing", or has empty content, we throw an exception. We do not want to index blank or invalid pages.

Generating a Safe ID

To create a robust system, we need a clean ID for our database. I included a utility method called ToUrlSafeId in a Utils class.

The Utils.ToUrlSafeId method cleaning up strings

public static string ToUrlSafeId(string? title)
{
    if (string.IsNullOrWhiteSpace(title))
        return string.Empty;

    var s = title!.Trim();
    s = Regex.Replace(s, @"[^\w\-]+", "_");
    s = Regex.Replace(s, "_{2,}", "_");
    s = s.Trim('_');

    if (string.IsNullOrEmpty(s))
        return Uri.EscapeDataString(title);

    return s;
}

This function takes the title and removes special characters, spaces, and brackets using basic Regex. This ensures that our ID (e.g., "Great_Wall_of_China") is safe to use in Pinecone or as a database key.

Mapping to the Document Record

Finally, we map the data from the API response to our Document record.

Mapping the API response to the Document record

    var title = firstPage.Title!;
    var content = firstPage.Extract!.Trim();

    var id = Utils.ToUrlSafeId(title);
    var pageUrl = $"https://en.wikipedia.org/wiki/{Uri.EscapeDataString(title)}";

    return new Document(
        Id: id,
        Title: title,
        Content: content,
        PageUrl: pageUrl
    );
}

We construct the public-facing Wikipedia URL manually using the title, so we can direct users to the actual article later.

Exposing the Service

To wrap this up, we create a public instance method that our application will call.

The public GetWikipediaPageForTitle method

public Task<Document> GetWikipediaPageForTitle(string title, bool full = false)
{
    var url = CreateWikipediaUrl(title, full);
    return GetWikipediaPage(url);
}

This method simply acts as a bridge. It takes a title (from our SourceData list), generates the correct API URL, and calls our internal static fetching logic.

Recap

We have successfully built the foundation for our indexing pipeline. Here is what we accomplished:

Data Source: We established a static list of landmarks to index.
Model: We defined a Document record to hold clean data.
Client: We built a WikipediaClient that manages HttpClient efficiently and respects API etiquette with User-Agent headers.
Logic: We implemented robust URL construction, JSON deserialization, and error handling to ensure we only process valid content.

In the next phase of this project, we will use this client to iterate through our source data, fetch the articles, and begin the process of generating embeddings for our vector database.

Thanks for reading, and happy coding!

Initializing lesson...

Desktop Required

Code exercises are designed for desktop or laptop computers. Please switch to a larger screen to complete this exercise.

Instructions

Translating...

0:00 / 0:00

Code BetaThis code editor is in beta. You may encounter bugs or unexpected behavior. To report issues or provide feedback, please use the feedback button.

Validating syntax Ln 1, Col 1

Font Size

14px

Minimap

Word Wrap

Line Numbers

Theme Dark

High Contrast

Autosave

Autocomplete

Confetti

View Changes

Code History

Keyboard Shortcuts

Restart Tutorial

Code Editor starting up...

Loading test cases...

Domebrain AI

Exercise Chat 0 / 15

Resets in 60s

Don't share personal data. Chats may be used to improve the product. AI can make mistakes—always verify.

Instructions

Questions

0 / 0 answered

Loading...

Sign in to watch this course

Trial Limit Reached

With Dometrain Pro you get:

Building a Wikipedia Indexer in C#: Fetching Data for AI Embeddings

Setting Up the Source Data

Understanding the Wikipedia API

Defining the Document Model

Creating the Wikipedia Client Service

JSON Serialization Options

Helper Classes for Deserialization

Constructing the Request URL

Fetching and Processing the Page

Generating a Safe ID

Mapping to the Document Record

Exposing the Service

Recap

Sign up to continue reading

Subscribe to continue reading

Subscribe to continue reading

Next Lesson

Desktop Required

Instructions

Code BetaThis code editor is in beta. You may encounter bugs or unexpected behavior. To report issues or provide feedback, please use the feedback button.

Instructions

Questions

Sign in to watch this course

Trial Limit Reached

With Dometrain Pro you get:

Activate Your Subscription

Building a Wikipedia Indexer in C#: Fetching Data for AI Embeddings

Setting Up the Source Data

Understanding the Wikipedia API

Defining the Document Model

Creating the Wikipedia Client Service

JSON Serialization Options

Helper Classes for Deserialization

Constructing the Request URL

Fetching and Processing the Page

Generating a Safe ID

Mapping to the Document Record

Exposing the Service

Recap

Sign up to continue reading

Subscribe to continue reading

Subscribe to continue reading

Next Lesson

Skip XP for this lesson?

Desktop Required

Instructions

Code BetaThis code editor is in beta. You may encounter bugs or unexpected behavior. To report issues or provide feedback, please use the feedback button.

Exercise Complete!

Exercise Feedback

Keyboard Shortcuts

Instructions

Questions

Quiz Complete!

Upgrade to Dometrain Pro

Congratulations!