Kelly Qu's Blog

Sunday, March 19, 2017

Visual Studio 2017 and C# 7 - what they have to offer

Attended Microsoft VS 2017 launch event in Portland last week @ViewPoint, and was impressed by some new features presented in VS 2017 and mainly in C#7 framework.

In VS 2017, performance and productivity are the two main areas that you would see quite-a-bit improvement. Comparing to VS 2015, installation of 2017 runs significantly faster. You would experience much faster start-up time and smaller memory footprints. Its new live unit-testing and exception helper features will make your debugging experience more efficient. On the navigation side, its new ctrl +t super-find feature would allow you find any text in your files or jump to any line of you code with this single keyboard shortcut. Super cool!

For C#7 framework, below are the enhancements I really like -

1) For exception handling, now you can chain exceptions together so that while it is captured, it would show you the original trouble-maker in pop-up screen without having you dive into the inner exceptions. A great time-saver!

2) For functions, now you can return a new type tuple. I love this feature since there are times I want to return multiple values with mixed types as a quick dump. The tuple option is quite handy so that I would no longer need to write a structure/ class for that purpose.

var (x, y, z) = myFunction();

3) Fakes - In unit test, we've learned to implement dependency injection to inject mock objects through interface. With fakes, it allows you achieve the same objective with less code and more flexibility. One example demoed at the event is to do a unit-test based on current date. Use a fake date would make your test results more manageable. Of course, Microsoft cautioned you about not over-using it. Similar to TypeMocks, while you try to fake something, it means that You trust it so that this fake itself is not tested.

4) Nested named-functions - if you are familiar with R, this would be nothing new to you. Yes, now you can do nested named-functions in C# too. The main difference between this and delegates is readability.

Overall, this was a very good event with great speakers! I really enjoyed it!

Tuesday, November 1, 2016

Hadoop in a nut-shell

Hadoop framework creates a lot of buzz nowadays in the data science field. What is It?

Hadoop is a computational platform built to aim solving big data questions, data that can be both structured and non-structured. The main idea is to bring computation to data instead of bringing data to computation. Its file system (Hadoop Distributed File System - HDFS) breaks data into small chunks, saves and replicates them across clustered data nodes on low-cost computers and disks for high data throughput and extensive computation. Such tasks could only be possible by expensive supercomputers in the past.

What are the key components in Hadoop?

HBASE - Hadoop's non-relational database. Data is stored as key value pairs in a large scale.

Sqoop - a data transformation tool that allows transfer data from relational database to Hadoop.

PIG / HIVE - PIG is high-level data flow language that runs on top of MapReducer. HIVE uses SQL-like syntax for data summarizing and ad-hoc querying.

MapReducer - execution engine for mapping, reducing data and returning results. One of the drawbacks of MapReducer is that it reads data from disks. MapReducer can be interfaced with native Java APIs, or REST APIs.

SPARK - Enhanced MapReducer engine that utilizes in-memory technology to cache data. It has a wide range of applications for ETL, machine-learning and data streaming. SPARK can be interfaced with Java, Python, and in the near future with R.

Cloudera - a software company provides a big array of Hadoop based big data tools. Its single-node VM can give you a jump start for testing, demoing or learning Hadoop framework and tools mentioned above.

Friday, October 28, 2016

Data Lake, Data Warehouse and Data Mart

Getting dizzy with these big terms. Let's take a close look at them to hopefully help clear some confusions.

Data Lake is the newest term among the three. It is the data storage architecture for BIG data. All kinds of raw data store in it as blobs or objects with unique keys. Data modeling, cleansing and transformation steps will be taken when needs arise, and would only be applied to a subset of relevant data objects. Using a technical term, this modeling method is called schema-on-read. Data Lake serves a broad range of users who can sample and dive in the lake for their specific needs at anytime they see appropriate.

Data warehouse has been around for decades. It is almost the opposite of Data Lake in terms of how data is stored in it. A laborious data modeling and ETL process will need to happen first before data is loaded into it. Data modeling is tailored to answer particular questions and target specific audiences. Because of the up-front invested effort, data is usually well-formatted and ready for querying, slicing and dicing. This data modeling technique is also called schema-on-write. I was involved in a well-funded enterprise data warehouse initiative as a data modeler. The magnitude of effort was a big deal and very impressive. Documentation played a huge role in this process.

Data Mart is a small version of data warehouse. It is smaller in size and more agile to implement. The targeted audience is consequently smaller as well. Data in data mart is also pre-transformed, cleansed and well structured. Comparing to data warehouse, data mart is a better fit for small to medium business without a big IT budget. With a few capable hands, business can be benefited to answer some critical and particular questions in a much faster pace. The downside is that, data marts are often disconnected without needed keys to link them together for providing a holistic view of your organization data.

Sunday, November 10, 2013

IO Resource Governance in SQL Server 2014

CPU and memory resource governance is available since SQL server 2008, However the IO resource governance is new in SQL server 2014.

Here is - a very good blog about "IO Resource Governance in SQL Server 2014"

"By configuring the IO settings on the Resource Pools we can control the resources we want to provide for each tenant. This allows us to guarantee predictable performance regardless of the activity from other tenants or even provide differentiation in SLA for the Database Service based on the amount of the resources customers sign up to reserve.

Another scenario that many of you might find applicable is isolating your OLTP workload from any maintenance operations that need to run in the database. Rebuilding an index, for example, is a common operation that can trigger a large number of IO requests, as it needs to scan the whole index or table. By using IO Resource Governance we can limit the number of IO operations these tasks can perform and guarantee predictable performance for concurrent OLTP workload."

Saturday, March 9, 2013

automated way to generate classes from a database

Problem:
Is there a quick way to generate classes based on DB schema by a few clicks in Visual Studio?

This seems a common enough problem to worth having an automated solution.

Solution: Turned out Microsoft has something to offer already. Its DBContext generator in ADO.Net entity data model does exactly what I am looking for.

Pr-requisites:
- create your relational database first on a SQL server, and apply physical foreign keys (FK). With FKs, the generator would add child collections in their parent classes. Below is my sample db schema.

- have DBContext generator template activated in your VS. If not, find it by searching online templates.

Steps:

1. Create a new class library project, and add a new item: "Data" - "ADO.Net entity data model".

2. Follow through a few prompts, and connect to the above database and select all tables.

3. Under the model ".edmx" branch, expand your ".tt" file to see the auto-generated classes.

Interestingly, the class name is changed from "Parents" plural to "Parent" singular. But, didn't bother to change "Children" to Child. Obviously Microsoft didn't invest money here to use an English dictionary. Lol.

4. Here you go - the nice and neat auto-generated code. Null-able types stay as null-able.

Very cool. Happy coding by not coding too much!

Sunday, December 2, 2012

understanding SQL server 2012 columnstore index

In the past few months, I spent some time playing with SQL server 2012 ColumnStore (CS) index. One of the tests was to apply CS index on tables with 50 million and 500 million rows. On average, query execution time reduced from 10 - 15 minutes to less than 10 seconds. Realized later that even with CS, you might still need to apply tranditional row-based indices to maximize query optimization. One simple way is to let the query optimizer decide when to use which. In general, CS improves performance on large tables using joins and aggregations. For look-ups, the row-based index doing seeks still works faster in many cases.

CS works so fast due to its in-memory technology. It compresses tables on columns and puts them in-memory as a whole. For columns with many repeating values, if they are integers, date-times and numerics with precisions less than 18, they are compressed the most. If you inlcude many text fields with high cardinality, the CS compression won't help much. These texts would also increase storage spaces and consume lot of memories. Had one interesting observation that when I was opening a tabular project in visual studio 2010, the tabular was loading into memory instantly. It consumed about 2 GB of memory for about 1.5 GB of data.

SQL server 2012 SSAS tabular and PowerPivot have built-in CS indices. Tabular runs on a server, while PowerPivot runs on a client PC. If you import millions and millions rows to a PowerPivot into your Excel, you need minimum 4 GB RAM based on Microsoft, 6 or 8GB based on my testing. Your PC must have 64-bit OS.

Some other basics learned about CS -

- CS cannot be applied to views

- After you apply CS to a table, the table becomes read only.

- CS cannot be combined with page / row compression.

reference - http://msdn.microsoft.com/en-us/library/gg492088.aspx

Wednesday, April 18, 2012

Customize SharePoint top navigation bar by working with Master page, CSS and JQuery

Worked on SharePoint 2010 top navigation drop-down menu for a project lately. It was quite straightforward to use SharePoint designer to edit master page. However, don't overwrite the original v4.Master page without making a copy for roll-back. Enabling the drop-down menu turned out to be easier than I thought. All you need to do is to update SharePoint:AspMenu - ID = "TopNavigationMenuV4" ' s attribute "MaximumDynamicDisplayLevels" from "1" to "2". Done. You can then update CSS combined with JQuery to apply some visual effects and themes. Make sure to register your new CSS file in the master page and provide a correct order. Usually you would order your custom CSS file running after the coreV4.CSS file. The class Dynamic is used for UL and LI to apply drop-down menu themes.

One thing got me is that after I completed the master page, I didn't realize that I needed to publish it as a major version at least once to the master page document library for approval. Once it is approved, this new master page then becomes available for site collection, sites and sub-sites.

It's nice to make a face lifting for your entire site by re-using the traditional .net master page technology. :)