Implementing a JSON to C# parser

This week I’ve had a stab at implementing a tool I make ample use of in my day-to-day development life. Json2CSharp is a very handy tool — that I’ve mentioned in numerous blogs — where you can paste in some JSON and it’ll convert it to C# models for you. It’s a life-saver, especially if you’re consuming APIs, simply paste in an example response and you’re good to go.

Here’s an example:

{
	"name": "Luke Garrigan",
    "happy": true,
	"height": 210,
	"website": {
	  "name": "codeheir.com",
	  "rating": 10
	}
}

And the C# it generates:

public class Website
{
    public string name { get; set; }
    public int rating { get; set; }
}

public class Root
{
    public string name { get; set; }
    public bool happy { get; set; }
    public int height { get; set; }
    public Website website { get; set; }
}

I tend to use the “Use Pascal Case” and “Add JsonProperty Attributes”.

Lexical Analysis

The first step to creating a parser is lexical analysis. Lexical analysis is the process of splitting the input into tokens so that your parser has a standard format to process.

Whilst coding this I followed TDD principles — partly because it’s fun and partly because the recursive nature of parsing can get relatively complicated. It’s comforting to have a safety net.

Here’s an example test I wrote for lexical analysis:

[Test]
public void Should_Return_Boolean_Token() {
    var input = "{\"happy\":true}";
    var lexer = new Lexer(input);
    lexer.Tokens.Count().Should().Be(5);
    lexer.Tokens.ElementAt(0).Should().Be('{');
    lexer.Tokens.ElementAt(1).Should().Be("happy");
    lexer.Tokens.ElementAt(2).Should().Be(':');
    lexer.Tokens.ElementAt(3).Should().Be(true);
    lexer.Tokens.ElementAt(4).Should().Be('}');
}

Note how the tokens are of a different type — this’ll come in handy when writing the parser.

In the lexical analysis phase, generally you might iterate over each element one by one to determine its current component part — is it an int, a string, a boolean? You might also get rid of any unnecessary whitespace here, whilst preserving whitespace within strings.

private void Lex(string input)
{
    var output = new List<object>();
    while (input.Any())
    {
        if (LexString(input, out var stringOutput))
        {
            input = input.Substring(stringOutput.Length + 2);
            output.Add(stringOutput);
        } else if (LexNumber(input, out var numberResult))
        {
            input = input.Substring(numberResult.Length);
            output.Add(int.Parse(numberResult));
        } else if (input.Length > 4 && input.Substring(0, 4) == "true")
        {
            input = input.Substring(4);
            output.Add(true);
        } else if (input.Length > 5 && input.Substring(0, 5) == "false")
        {
            input = input.Substring(5);
            output.Add(false);
        } else if (input.Length > 4 && input.Substring(0, 4) == "null")
        {
            input = input.Substring(4);
            output.Add(null);
        } else if (jsonSynax.Any(e => e == input[0])) 
        {
            output.Add(input[0]);
            input = input.Substring(1);
        }
    }
    Tokens = output;
}

I’m looping through each character, figuring out its current type, adding it to the output list and then removing its length from the input so we don’t iterate over it again.

Here’s an example of how I’m lexing strings:

private static bool LexString(string currentInput, out string result)
{
    result = "";
    if (currentInput[0] == '"')
    {
        currentInput = currentInput.Substring(1);
    }
    else
    {
        return false;
    }
    
    for (var i = 0; i < currentInput.Length; i++)
    {
        if (currentInput[i] == '"')
        {
            return true;
        }

        result += currentInput[i];
    }

    return true;
}

So if the first character is a quote, then I know we’re currently looking at a string, if not return false as it’s not a string. Then I’m simply iterating over each character, adding it to the result and stopping when I get to the end of the string — which is the end quote.

If you’re interested in how I am lexing other types, the code is published on Github.

Syntactic analysis

The next step is to perform syntactic analysis:

The syntax analyzer’s (basic) job is to iterate over a one-dimensional list of tokens and match groups of tokens up to pieces of the language according to the definition of the language. If, at any point during syntactic analysis, the parser cannot match the current set of tokens up to a valid grammar of the language, the parser will fail and possibly give you useful information as to what you gave, where, and what it expected from you.
https://notes.eatonphil.com/writing-a-simple-json-parser.html

So for me, this involves performing the conversion from the tokens to the C# syntax.

So given a token, if it’s a ‘{‘ we know it’s creating an object, so I call a ParseObject() method likewise for ‘[‘ calling the ParseArray():

private void Parse(bool root = false)
{
    var currentToken = tokens.First();

    if (root && !currentToken.Equals('{'))
    {
        throw new Exception("Root must be an object");
    }

    if (currentToken != null && currentToken.Equals('{'))
    {
        tokens = tokens.Skip(1);
        ParseObject();
    }
    else if (currentToken != null && currentToken.Equals('['))
    {
        tokens = tokens.Skip(1);
        ParseArray();
    }
    else
    {
        tokens = tokens.Skip(1);
    }
}

If it’s an object then I need to create a new C# class. I then iterate through each token, preemptively setting the nextClassName just incase the next value is in object. Parse is then called on the value which is either an array, object, int, bool, null, or a string.

private void ParseObject()
{
    var output = $"public class {nextClassName ?? "Root"} {{ ";
    
    var firstChar = tokens.First();
    if (firstChar.Equals('}'))
    {
        tokens = tokens.Skip(1);
        return;
    }
    while (true)
    {
        var jsonKey = tokens.First();
        
        nextClassName = CapitaliseFirstLetter(jsonKey);
        tokens = tokens.Skip(2); // also skip colon

        var type = GetType();
        Parse();
        output += $"public {type} {jsonKey} {{ get; set; }}";
        
        if (tokens.First().Equals('}'))
        {
            tokens = tokens.Skip(1);
            output += '}';
            break;
        }
        tokens = tokens.Skip(1); // skips ,
    }

    Output += output;
}

You may have noticed above there’s a call to the GetType() method, this creates the C# type from the json value:

private string GetType()
{
    var type = "string";
    if (tokens.First() is int)
    {
        type = "int";
    }
    else if (tokens.First() is bool)
    {
        type = "bool";
    }
    else if (tokens.First() is null)
    {
        type = "object";
    }
    else if (tokens.First().Equals('{'))
    {
        type = nextClassName;
    }
    else if (tokens.First().Equals('['))
    {
        type = GetArrayType(type);
    }
    return type;
}

So given the following JSON:

{
   "name":"Luke",
   "address":{
      "postcode":"pe321da"
   },
   "favColor":"blue",
   "favNumbers": [8, 88, 888]
}

The C# my parser creates is:

public class Address
{
    public string postcode { get; set; }
}

public class Root
{
    public string name { get; set; }
    public Address address { get; set; }
    public string favColor { get; set; }
    public List<int> favNumbers { get; set; }
}

And, for nested objects like the following:

{
   "person": {
      "name": {
         "firstName":"Luke"
      }
   }
}

It outputs:

public class Name
{
    public string firstName { get; set; }
}

public class Person
{
    public Name name { get; set; }
}

public class Root
{
    public Person person { get; set; }
}

Finishing touches

Let’s add an extension method that’ll perform the lexical analysis and syntactical analysis on a string, to make interfacing easy.

public static string ToCSharp(this string json)
{
    json = json.Trim();
    json = RemoveAllWhitespace(json);

    var result  = new Lexer(json);
    var parser = new Parser(result.Tokens);

    return parser.Output;
}

Notice I call a RemoveAllWhitespace method prior to lexical analysis:

private static string RemoveAllWhitespace(string json)
{
    var output = "";
    for (var i = 0; i < json.Length; i++)
    {
        var currentCharacter = json[i];
        if (currentCharacter == '"')
        {
            var stringEnd = false;
            output += currentCharacter;
            var j = i + 1;
            while (!stringEnd)
            {
                output += json[j];
                if (json[j] == '"')
                {
                    stringEnd = true;
                }
                j++;
            }
            i = j-1;
            continue;
        }

        if (currentCharacter != ' ')
        {
            output += currentCharacter;
        }
    }
    return output;
}

So now all I need to write to use the converter is:

var json = "{\"Person\":{\"Name\":{\"FirstName\":\"Luke\"}}}";
var csharp = json.ToCSharp();

And that’s all there is to it.

There’s still a few limitations. For instance, if I were to use this for C# models, I’d want everything to be in PascalCase — to align with C# coding conventions — and use JsonProperty attributes to handle serialisation.

If you liked this blog then please sign up for my newsletter and join an awesome community!

Implementing a JSON to C# parser

Lexical Analysis

Syntactic analysis

Finishing touches

1 Comment

Leave a Reply Cancel reply