Linq2XML

Microsoft’s Linq to XML is a great way to query XML documents. It can be used to do SQL-type operations such as left outer joins between complex types either in the same document or even in different documents. It can also be used to do aggregate queries such as min, max and count.

Contents

A great place to start is Microsoft’s landing page for LINQ to XML, here:
https://msdn.microsoft.com/en-us/library/bb387098(v=vs.140).aspx

LinqPad

Linq, if you haven’t worked with Linq before, can be a bit disconcerting. The syntax is very different than SQL or XPATH. Fortunately there is LinqPad, which is a free tool that makes working with Linq2XML a lot easier.

LinqPad allows you to run C# code dynamically, and with it you can execute Linq2XML queries in much the same way that you would use SQL Enterprise Manager to execute SQL queries.

The basic version of LinqPad is available free from http://www.linqpad.net.

LinqPad

Selecting with Linq2XML

Here is a really basic Linq2XML query that can be run from within LinqPad:

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	select new {
		author=(string)gp1.Element("author").Value,
	}).Distinct();

sampledata.Dump();

When run against the sample xml, you’ll get these 9 rows, nicely formatted by the “sampledata.Dump();” statement.

author
Gambardella, Matthew
Ralls, Kim
Corets, Eva
Randall, Cynthia
Thurman, Paula
Knorr, Stefan
Kress, Peter
O’Brien, Tim
Galos, Mike

The list can be sorted with an orderby statement:

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	orderby (string)gp1.Element("author").Value
	select new {
		author=(string)gp1.Element("author").Value,
	}).Distinct();

sampledata.Dump();

You can see that the list is now sorted:

author
Corets, Eva
Galos, Mike
Gambardella, Matthew
Knorr, Stefan
Kress, Peter
O’Brien, Tim
Ralls, Kim
Randall, Cynthia
Thurman, Paula

A where clause can be added to filter the results:

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	where ((string)gp1.Element("author").Value).StartsWith("K")==true
	orderby (string)gp1.Element("author").Value
	select new {
		author=(string)gp1.Element("author").Value,
	}).Distinct();

sampledata.Dump();

As you can see, the results now include only those authors whose last names begin with “K”:

author
Knorr, Stefan
Kress, Peter

Linq2XML and attributes

XML attributes can be returned in the results of the query, or included in where or orderby clauses. Simply use the “Attribute” collection rather than the “Element” collection, as shown here:

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	where ((string)gp1.Element("author").Value).StartsWith("K")==true
	orderby (string)gp1.Element("author").Value
	select new {
		bookid=(string)gp1.Attribute("id"),
		author=(string)gp1.Element("author").Value,
	}).Distinct();

sampledata.Dump();

Here are the results, which now include the bookid attribute.

bookid author
bk108 Knorr, Stefan
bk109 Kress, Peter

If there are multiple elements at the same level where an attribute is used as an element identifier, then a different technique is necessary.

This XML stores both ISBN and UPC in a BookIdentificationCode element with an identifying IdType attribute:

<Book>
   <Title>Hitchhiker's Guide to XML</Title>
   <BookIdentificationCode IdType="ISBN">
       <IdentificationCode>999999999</IdentificationCode>
   </BookIdentificationCode>
   <BookIdentificationCode IdType="UPC">
       <IdentificationCode>140411</IdentificationCode>
   </BookIdentificationCode>
</Book>
 This LINQ can be used:
var data = (
    from gp1 in root.Descendants("Book")
    select new
    {
	Title=gp1.Element("Title").Value,
        BookISBN = gp1.Elements("BookIdentificationCode")
             .Where(element => element.Attribute("IdType").Value == "ISBN")
                 .Elements("IdentificationCode").First().Value,
        BookUPC = gp1.Elements("BookIdentificationCode")
             .Where(element => element.Attribute("IdType").Value == "UPC")
                 .Elements("IdentificationCode").First().Value
    }
);

Linq2XML and Inner Joins

A common requirement is to be able to join two xml documents together for comparison (or join elements in one part of an xml document against elements in another). A plain inner join is simple to implement.

Here we’ll make use of a second sample XML document called “sample_inv.xml” that contains inventory and cost information about the books in sample.xml.

This query shows the price and cost of each book, where the cost is derived from a join to sample_inv.xml.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

string samplexmlinv=@"C:\Research\linq2xml\sample_inv.xml";	
XDocument sampleinvroot = XDocument.Load(samplexmlinv);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	join gp2 in sampleinvroot.Elements().Elements("book")
	on (string)gp1.Attribute("id") equals (string)gp2.Attribute("id")
	select new {
		id=(string)gp1.Attribute("id"),
		title=(string)gp1.Element("title").Value,
		price=(string)gp1.Element("price").Value,
		cost=(string)gp2.Element("cost").Value
	});

sampledata.Dump();

Note that there are only 11 records shown. Since there is no record in sample_inv.xml for book id bk112 our inner join does not include that book.

id title price cost
bk101 XML Developer’s Guide 44.95 22.95
bk102 Midnight Rain 5.95 2.95
bk103 Maeve Ascendant 5.95 2.95
bk104 Oberon’s Legacy 5.95 2.95
bk105 The Sundered Grail 5.95 2.95
bk106 Lover Birds 4.95 1.99
bk107 Splish Splash 4.95 0.69
bk108 Creepy Crawlies 4.95 3.99
bk109 Paradox Lost 6.95 1.99
bk110 Microsoft .NET: The Programming Bible 36.95 14.99
bk111 MSXML3: A Comprehensive Guide 36.95 14.99

Linq2XML and Left Joins

Here are the sample and sample_inv xml files joined together using a linq2xml left outer join.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

string samplexmlinv=@"C:\Research\linq2xml\sample_inv.xml";	
XDocument sampleinvroot = XDocument.Load(samplexmlinv);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	join gp2 in sampleinvroot.Elements().Elements("book")
	on (string)gp1.Attribute("id") equals (string)gp2.Attribute("id")
	into combined
	from d1 in combined.DefaultIfEmpty()		
	select new {
		id=(string)gp1.Attribute("id"),
		title=(string)gp1.Element("title").Value,
		price=(string)gp1.Element("price").Value,
		d1
	});

sampledata.Dump();

Note that an into clause is used with the outer query, similarly to the way it is used with aggregate queries, and DefaultIfEmpty is used to indicate that the query should return results even if no match is found in the outer data set.

If we run the above query, we see that a row returned for book id bk112 and with a null cost as there is no sample_inv element corresponding to that book id.

id title price d1
bk101 XML Developer’s Guide 44.95
<book id="bk101">
  <cost>22.95</cost>
  <instock>10</instock>
  <onorder>5</onorder>
</book>
bk102 Midnight Rain 5.95
<book id="bk102">
  <cost>2.95</cost>
  <instock>2</instock>
  <onorder>15</onorder>
</book>
bk103 Maeve Ascendant 5.95
<book id="bk103">
  <cost>2.95</cost>
  <instock>3</instock>
  <onorder>7</onorder>
</book>
bk104 Oberon’s Legacy 5.95
<book id="bk104">
  <cost>2.95</cost>
  <instock>7</instock>
  <onorder>10</onorder>
</book>
bk105 The Sundered Grail 5.95
<book id="bk105">
  <cost>2.95</cost>
  <instock>6</instock>
  <onorder>5</onorder>
</book>
bk106 Lover Birds 4.95
<book id="bk106">
  <cost>1.99</cost>
  <instock>0</instock>
  <onorder>2</onorder>
</book>
bk107 Splish Splash 4.95
<book id="bk107">
  <cost>0.69</cost>
  <instock>0</instock>
  <onorder>10</onorder>
</book>
bk108 Creepy Crawlies 4.95
<book id="bk108">
  <cost>3.99</cost>
  <instock>19</instock>
  <onorder>1</onorder>
</book>
bk109 Paradox Lost 6.95
<book id="bk109">
  <cost>1.99</cost>
  <instock>9</instock>
  <onorder>8</onorder>
</book>
bk110 Microsoft .NET: The Programming Bible 36.95
<book id="bk110">
  <cost>14.99</cost>
  <instock>4</instock>
  <onorder>11</onorder>
</book>
bk111 MSXML3: A Comprehensive Guide 36.95
<book id="bk111">
  <cost>14.99</cost>
  <instock>2</instock>
  <onorder>2</onorder>
</book>
bk112 Visual Studio 7: A Comprehensive Guide 49.95 null

Notice too that the value in the d1 column is straight XML. To show the cost value where available, we will need to implement a null value check.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

string samplexmlinv=@"C:\Research\linq2xml\sample_inv.xml";	
XDocument sampleinvroot = XDocument.Load(samplexmlinv);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	join gp2 in sampleinvroot.Elements().Elements("book")
	on (string)gp1.Attribute("id") equals (string)gp2.Attribute("id")
	into combined
	from d1 in combined.DefaultIfEmpty()		
	select new {
		id=(string)gp1.Attribute("id"),
		title=(string)gp1.Element("title").Value,
		price=(string)gp1.Element("price").Value,
		cost=d1!=null?d1.Element("cost").Value:null
	});

sampledata.Dump();

Now when we execute the query, we get the cost for the 11 books where it is available, and null when it is not.

id title price cost
bk101 XML Developer’s Guide 44.95 22.95
bk102 Midnight Rain 5.95 2.95
bk103 Maeve Ascendant 5.95 2.95
bk104 Oberon’s Legacy 5.95 2.95
bk105 The Sundered Grail 5.95 2.95
bk106 Lover Birds 4.95 1.99
bk107 Splish Splash 4.95 0.69
bk108 Creepy Crawlies 4.95 3.99
bk109 Paradox Lost 6.95 1.99
bk110 Microsoft .NET: The Programming Bible 36.95 14.99
bk111 MSXML3: A Comprehensive Guide 36.95 14.99
bk112 Visual Studio 7: A Comprehensive Guide 49.95 null

Linq2XML and Computing Aggregates

It’s possible to use Linq2XML to perform aggregate queries on xml documents. You can count elements, find min and max, and even include aggregate values in where clauses.

Here is an example of a simple aggregate query that simply counts the number of books that exist for each genre.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	group gp1 by new {
		Genre=(string)gp1.Element("genre").Value
	}	
	into gpk
	select new {
		genre=(string)gpk.Key.Genre,
		count=gpk.Count()
	});

sampledata.Dump();

Run the query and you’ll see this list:

genre countΞΞ
Computer 4
Fantasy 4
Romance 2
Horror 1
Science Fiction 1

The aggregate is defined in the “group” clause where you define a new object consisting of key values. These key values are like the “group by” values in a sql statement.

The key values can then be referenced not just in the select but also in where and order by clauses as seen here.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var sampledata=(
	from gp1 in sampleroot.Elements().Elements("book")
	group gp1 by new {
		Genre=(string)gp1.Element("genre").Value
	}	
	into gpk
	where gpk.Count()>1
	orderby gpk.Count() 
	select new {
		genre=(string)gpk.Key.Genre,
		count=gpk.Count()
	});

sampledata.Dump();

Run this query to see only those genres that contain at least one title, sorted by count of books.

genre countΞΞ
Romance 2
Computer 4
Fantasy 4

There are a couple of ways to determine aggregate values such as min, max or average on an XML file using Linq2XML. Here is the most basic.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var minprice=sampleroot.Root.Elements().Min(e=>Convert.ToDouble(e.Element("price").Value));
var maxprice=sampleroot.Root.Elements().Max(e=>Convert.ToDouble(e.Element("price").Value));
var avgprice=sampleroot.Root.Elements().Average(e=>Convert.ToDouble(e.Element("price").Value));

minprice.Dump();
maxprice.Dump();
avgprice.Dump();

These aggregate functions can be incorporated into a query. Here is a query that returns min, max, average and sum values for the prices element in the book catalog XML.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

var result=(
	from gp1 in sampleroot.Elements().Elements("book")
	group gp1 by new {
		genre=(string)gp1.Element("genre") 
	}
	into gpk
	orderby gpk.Key.genre
	select new 
	{
		genre=gpk.Key.genre,
		Max=gpk.Select(e=>Convert.ToDouble(e.Element("price").Value)).Max(),
		Min=gpk.Select(e=>Convert.ToDouble(e.Element("price").Value)).Min(),
		Sum=gpk.Select(e=>Convert.ToDouble(e.Element("price").Value)).Sum(),
		Average=gpk.Select(e=>Convert.ToDouble(e.Element("price").Value)).Average()
	}
);

result.Dump();

When we execute the query these are the results that are returned.

genre MaxΞΞ MinΞΞ SumΞΞ AverageΞΞ
Computer 49.95 36.95 168.8 42.2
Fantasy 5.95 5.95 23.8 5.95
Horror 4.95 4.95 4.95 4.95
Romance 4.95 4.95 9.9 4.95
Science Fiction 6.95 6.95 6.95 6.95

Here is the query modified to use an outer join to retrieve aggregate values based on cost (from samplexml.inv.xml) instead of price.

string samplexml=@"C:\Research\linq2xml\sample.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

string samplexml_inv=@"C:\Research\linq2xml\sample_inv.xml";	
XDocument samplerootinv = XDocument.Load(samplexml_inv);

var result=(
	from gp1 in sampleroot.Elements().Elements("book")
	join gp2 in samplerootinv.Elements().Elements("book")
		on (string)gp1.Attribute("id") equals (string)gp2.Attribute("id")
	group gp2 by new {
		genre=(string)gp1.Element("genre") 
	}
	into gpk
	orderby gpk.Key.genre
	select new 
	{
		genre=gpk.Key.genre,
		Max=gpk.Select(e=>Convert.ToDouble(e.Element("cost").Value)).Max(),
		Min=gpk.Select(e=>Convert.ToDouble(e.Element("cost").Value)).Min(),
		Sum=gpk.Select(e=>Convert.ToDouble(e.Element("cost").Value)).Sum(),
		Average=gpk.Select(e=>Convert.ToDouble(e.Element("cost").Value)).Average()
	}
);

result.Dump();

Here are the results that you will see when the above query is executed.

genre CountΞΞ MaxΞΞ MinΞΞ SumΞΞ AverageΞΞ
Computer 3 22.95 14.99 52.93 17.6433333333333
Fantasy 4 2.95 2.95 11.8 2.95
Horror 1 3.99 3.99 3.99 3.99
Romance 2 1.99 0.69 2.68 1.34
Science Fiction 1 1.99 1.99 1.99 1.99

Note that if you wish your Linq query to return the XML associated with a group you can use an expression similar to this:

var data = (
   from gp1 in root.Descendants("Book")
      group gp1 by new 
	{
	   Title=gp1.Element("Title").Value
	}
      into gpk		  
      select new
      {
	Title=gpk.Key.Title,
	count=cpk.Count(),
	xml=gpk.ToList()[0].ToString()
      }
   );

 

Linq2XML and namespaces

Sometimes it’s necessary to create Linq2XML queries on XML documents that contain namespaces. Doing so is not difficult. Simply create an XNamespace for the necessary namespace and precede tag names with the namespace as shown here.

string samplexml=@"C:\Research\linq2xml\sample-NS.xml";	
XDocument sampleroot = XDocument.Load(samplexml);

XNamespace ns = "http://www.jlion.com/sample";

var sampledata=(
	from gp1 in sampleroot.Elements().Elements(ns + "book")
	select new {
		author=(string)gp1.Element(ns + "author").Value,
	}).Distinct();

sampledata.Dump();

 

Tips & Tricks

Use this bit of Linq2Sql to list all top-level element names in an XML document. It’s namespace agnostic.

XDocument root = XDocument.Load(file);

ListcomplexTypes=new List();

foreach (XElement dc in root.Descendants().Where(child => child.Parent == root.Root))
{
	if (!complexTypes.Contains(dc.Name.LocalName.ToString()))
		complexTypes.Add(dc.Name.LocalName.ToString());
}

complexTypes.Dump();

If executed against this XML:

<Book>
   <Title>Hitchhiker's Guide to XML</Title>
   <BookIdentificationCode IdType="ISBN">
       <IdentificationCode>999999999</IdentificationCode>
   </BookIdentificationCode>
   <BookIdentificationCode IdType="UPC">
       <IdentificationCode>140411</IdentificationCode>
   </BookIdentificationCode>
</Book>

You would receive these results:

Title
BookIdentificationCode