26 11 2011

Quickly Load Sample data into SharePoint from Wikipedia

Do remember that this application is designed solely to create sample content, not to replicate the service that Wikipedia provides. The content that appears is unformatted and without images. The content not intended for public (even for internal intranet) use, so please adhere to the Wikipedia Terms of Service, and do also consider making a donation to the Wikimedia Foundation to support the freely available content that we have all taken for granted in these modern times.

Need to load up a couple of hundred pages into your SharePoint publishing site for testing search? Bored of Lorem Ipsum, or needing to test out different structures of content?

Run this code in a console application which gets a random page from Wikipedia then saves to SharePoint. There is a slight pause (approximately 1 second) as it gets each document and saves it, so this shouldn’t constitute as heavy traffic to Wikipedia. As aforementioned, the content is added unformatted, infact it adds the entire HTML document to the Page Content field (including all the ‘edit’ links, not best practice in the slightest, but all I was after was searchable content in my site).

Do note, however, that Wikipedia do offer archives of their content here: http://dumps.wikimedia.org/backup-index.html and if you’re intending on populating hundreds of thousands to millions of documents to test out large scale search solutions, then use these archives combined with the bulk load tool (available from here: http://code.msdn.microsoft.com/windowsdesktop/Load-Bulk-Content-to-3f379974). This methods requires downloading the entire archive (7GB worth) in one go then convert & upload to SharePoint directly. Use the method in this blog post to create a couple of hundred sample documents that you need to test, without downloading the entire wikipedia archive.

In SharePoint Enterprise Search, there is the option to add an external web site as a content source and to add that to your search index. I opted with the method in this post over that, because a) I would feel that I’d be adding significantly more traffic to websites that I don’t run if I indexed possibly hundreds of thousands of pages on other sites than running this code, and b) my environment doesn’t have the disk space for that kind of index :] ).

Finally, thanks to Todd Klindt and Shaun O’Callaghan for a couple of pointers and considerations for this article.

Inspiration for this method came from an answer on this StackExchange question.

Create a SharePoint Publishing Portal, and change the URL in the code to point to your new site, and simply run. You could add logic to the GetPageLayout method to retrieve a random page layout each time for testing out different information architectures, such as for testing out the Refinements web part.

using System;<br>using System.Collections.Generic;<br>using System.Linq;<br>using System.Text;<br>using System.Net;<br>using System.IO;<br>using Microsoft.SharePoint;<br>using Microsoft.SharePoint.Publishing;<br><br>namespace SampleContentGetter<br>{<br>    class MyWebClient : WebClient<br>    {<br>        Uri _responseUri;<br><br>        public Uri ResponseUri<br>        {<br>            get { return _responseUri; }<br>        }<br><br>        protected override WebResponse GetWebResponse(WebRequest request)<br>        {<br>            WebResponse response = base.GetWebResponse(request);<br>            _responseUri = response.ResponseUri;<br>            return response;<br>        }<br>    }<br><br>    class Program<br>    {<br>        static void Main(string[] args)<br>        {<br>            string site = "http://demolab-sps2010/sites/content";<br>            int docsToGet = 50;<br><br>            Console.WriteLine("Opening SharePoint Site...");<br>            SPSite oSite = new SPSite(site);<br>            SPWeb oWeb = oSite.OpenWeb();<br>            PublishingWeb pWeb = PublishingWeb.GetPublishingWeb(oWeb);<br>            PageLayout[] pLayouts = pWeb.GetAvailablePageLayouts();<br>            PageLayout layout = GetPageLayout(pLayouts, "PageFromDocLayout.aspx");<br><br>            for (int i = 0; i < docsToGet; i++)<br>            {<br>                try<br>                {<br>                    Console.WriteLine(string.Format("Getting next page ({0} of {1})...", i, docsToGet));<br>                    string title;<br>                    string page1;<br>                    using (MyWebClient wc = new MyWebClient())<br>                    {<br>                        wc.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");<br>                        page1 = wc.DownloadString("http://en.wikipedia.org/wiki/Special:Random");<br>                        string url = wc.ResponseUri.AbsolutePath;<br>                        title = Uri.UnescapeDataString(url.Substring(6));<br>                        Console.WriteLine("Got document: " + title);<br>                    }<br><br>                    Console.WriteLine("Adding to SharePoint Site...");<br>                    PublishingPage newPage = pWeb.AddPublishingPage(title + ".aspx", layout);<br>                    newPage.ListItem["PublishingPageContent"] = page1;<br>                    newPage.Update();<br>                    newPage.CheckIn("Adding content from Wikipedia");<br>                    newPage.ListItem.File.Publish("Adding content from Wikipedia");

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.IO; using Microsoft.SharePoint; using Microsoft.SharePoint.Publishing; namespace SampleContentGetter { class MyWebClient : WebClient { Uri _responseUri; public Uri ResponseUri { get { return _responseUri; } } protected override WebResponse GetWebResponse(WebRequest request) { WebResponse response = base.GetWebResponse(request); _responseUri = response.ResponseUri; return response; } } class Program { static void Main(string[] args) { string site = "http://demolab-sps2010/sites/content"; int docsToGet = 50; Console.WriteLine("Opening SharePoint Site..."); SPSite oSite = new SPSite(site); SPWeb oWeb = oSite.OpenWeb(); PublishingWeb pWeb = PublishingWeb.GetPublishingWeb(oWeb); PageLayout[] pLayouts = pWeb.GetAvailablePageLayouts(); PageLayout layout = GetPageLayout(pLayouts, "PageFromDocLayout.aspx"); for (int i = 0; i < docsToGet; i++) { try { Console.WriteLine(string.Format("Getting next page ({0} of {1})...", i, docsToGet)); string title; string page1; using (MyWebClient wc = new MyWebClient()) { wc.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"); page1 = wc.DownloadString("http://en.wikipedia.org/wiki/Special:Random"); string url = wc.ResponseUri.AbsolutePath; title = Uri.UnescapeDataString(url.Substring(6)); Console.WriteLine("Got document: " + title); } Console.WriteLine("Adding to SharePoint Site..."); PublishingPage newPage = pWeb.AddPublishingPage(title + ".aspx", layout); newPage.ListItem["PublishingPageContent"] = page1; newPage.Update(); newPage.CheckIn("Adding content from Wikipedia"); newPage.ListItem.File.Publish("Adding content from Wikipedia");

                    newPage.ListItem.File.Approve("Adding content from Wikipedia");

1	newPage.ListItem.File.Approve("Adding content from Wikipedia");

                }<br>                catch (Exception e)<br>                {<br>                    Console.WriteLine(e.Message + "\nSomething odd happened, trying again....");<br>                }<br><br>                i++;<br>            }<br>        }<br><br>        private static PageLayout GetPageLayout(PageLayout[] layouts, string name)<br>        {<br>            foreach (PageLayout p in layouts)<br>            {<br>                if (p.Name == name)<br>                {<br>                    return p;<br>                }<br>            }<br>            return null;<br>        }<br>    }<br>}<br><br>

} catch (Exception e) { Console.WriteLine(e.Message + "\nSomething odd happened, trying again...."); } i++; } } private static PageLayout GetPageLayout(PageLayout[] layouts, string name) { foreach (PageLayout p in layouts) { if (p.Name == name) { return p; } } return null; } } }

jQuery iframe auto height SharePoint 2010 FBA and Logging out when closing the browser

One thought on “Quickly Load Sample data into SharePoint from Wikipedia”

Bose says:

January 21, 2012 at 5:35 am

awesome logic and appreciate your efforts..Sample Documents

Reply

Quickly Load Sample data into SharePoint from Wikipedia

One thought on “Quickly Load Sample data into SharePoint from Wikipedia”

Leave a Reply Cancel reply