Master Data Management And Data Governance

Master data management

What is Master Data Management

The objective of Data Management is to provide and maintain a consistent view of an organisation core business entities, which may involve data that is scattered across a range of application systems. The type of data varies across industry type. Examples include, Customers, Suppliers, Products, Employees and Finances. Presently many MDM applications concentrate on the handling of customer records and data because this aids sales and marketing process. Customer MDM solutions is called Customer Data Integration (CDI).

MDM and CDI are presented as technology but in reality they are business applications. The objective of both MDM and CDI is to provide a consistent view of dispersed referenced data. This is created using data integration techniques and technologies, and may be used by business transaction applications and analytic application. Data integration include:

  • Data Consolidation – captures data from multiple sources systems.
  • Data Federation – Single virtual view of one or more source data files.
  • Data Propagation – Copies data from one location to another.

Benefits of Implementing Master Data Management

  • Operational Efficiency.
  • Improved decision making.
  • Regulatory and Compliance.
  • Strategic M&A.

Steps To Managing Master Data Management

Step 1 assesses the current mastering capabilities. During this step you should assess the MDM maturity of the records in scope. In order to quantify the impact of the MDM technology, it is important to have a relative point of comparison.

Step 2 involves envisioning the future data mastering capabilities and the solution plan to support them. In addition, it is important to define the implementation plan and understand the cost of implementation.

Step 3 is where we truly understand the benefits of MDM technology to the business. It is during this stage that we quantify the business value of the technology. Using the investments from step 2 and the quantified value in step 3 we are ready to calculate ROI of MDM.

 Master Data Management and Data Governance

Effective Data Governance serves an important function within the enterprise, setting the parameters for Data Management and usage, creating processes for resolving data issues and enabling business users to make decisions based on high quality data and well managed information assets. Implementing a data Governance framework is not easy. Factors that come into play are Data ownership, Data inconsistencies across different departments and the expanding collection and use of big data in companies.

Businesses cannot do Master Data Management without Governance. MDM unites multiple users and Data sources. Governance creates an agreement on the rules of interaction among systems. Governance enables MDM’s success by providing business context and frameworks and ensures that MDM is not treated as a simple IT project. It brings users together to discuss business rules for data usage. MDM in turn make Data Governance more relevant because those Governance policies  become tangible. Both Master Data Management and Governance need to merge to become Master Data Governance (MDG).

Enterprise level Governance that spans both data and process is increasingly a key requirement put forth by IT Executive Management.

gov chart

Think of Governance as a component model, where several inter-related yet distinct components seamlessly interact to provide a connected environment that fosters ownership and accountability.

Business Intelligence – Hadoop

Hadoop Image

Hadoop

Hadoop is a free Java based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project.

Hadoop is seen as having 2 parts: a file system (the Hadoop Distributed File System) and a programming model (MapReduce) and Hadoop common.

To understand Hadoop, you must understand the underlying infrastructure of the file system and the MapReduce programming Model.

The Hadoop Distributed File System allows applications to be run across multiple servers. Data in a Hadoop cluster is broken down into smaller pieces called blocks and distributed throughout the cluster. In this way, the Map and the Reduce functions can be executed on smaller subsets of your larger data sets and this provides the scalability that is needed for big data processing.

The goal is to use commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives. For higher performance, MapReduce tries to assign workloads to these servers where the data to be processed is stored. Think of a file that contains the phone numbers for everyone in Ireland the Surnames starting with A might be stored on server 1, B on server 2 and so on. In Hadoop pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default.  All of Hadoop’s data placement logic is managed by a special server called NameNode. This NameNode server keeps track of all data files in HDFS, such as where the blocks are stored.

The Basics of MapReduce

MapReduce is the heart of Hadoop. It is the programming paradign that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. MapReduce refers to two tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (Key/Value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

Hadoop MapReduce API are called from Java, which requires skilled programmers.

Pig and Piglatin

Pig was initially developed at Yahoo to allow people using Hadoop to focus more on analysing large data sets and spend less time having to write mapper and reducer programs.

Hive

Facebook developed a runtime Hadoop support structure that allows anyone fluent with SQL to leverage the Hadoop platform.  It allows SQL developers to write Hive Query Language (HQL) statements. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.

Jaql

Jaql is primarily a Query language for Java Script Object Notation, but it supports more than just JSON. It allows you to process both structured and non-traditional data and was donated by IBM to the open source community. Jaql allows you to select, join, group, and filter data that is stored in HDFS.

There are many more other open source projects that fall under the Hadoop umberella, either as Hadoop subprojects or as top level Apache projects; Zoo Keeper, IIbase, Oozie, Lucene and many more.

 

References:

http://public.dhe.ibm.com/common/ssi/ecm/en/iml14296usen/IML14296USEN.PDF

Big Data And The 3 V’s

Velocity

The 3V’s of big data; Volume, Velocity and Variety continue to grow, so too does the opportunity for finance sector firms to capitalise on this data for strategic advantage.

Finance professionals are accomplished in collecting, analysing and benchmarking data, so they are in a unique position to provide a new and critical service, making big data more manageable while condensing vast amount of information into actionable business insights

It was not always like this, the most recognisable incident was the collapse of Lehman Brothers in 2008. Who would have benefitted from better data analysis. When Lehman Brothers went down, it was called the Pearl Harbour moment of the US Financial crises. Yet it took the industry days to fully understand how they were exposed to that kind of risk. Today with advancement in big data analytics and data processing whenever any trader makes a trade, financial firms know what’s going to happen in real-time through risk management if they have the right infrastructure.

3vs

Volume

Volume presents the most immediate challenge to conventional IT infrastructure. It calls for scalable storage, and a distributed approach to querying. If you could run a forecast taking into account 300 factors rather than 6 could you predict any better? Assuming that the volumes of data are larger than those conventional database infrastructures can cope with, processing options breakdown into a choice between parallel architectures – data warehouses or databases such as Greenplum and Apache Hadoop based solutions. Data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Hadoop on the other hand, places no conditions on the structure of the data it can process. At its core Hadoop is a platform for distributing computing problems across a number of servers.

The vast majority of the capital markets is, however cautious about the use of public cloud technology in commercially sensitive areas. Security remains a concern for most firms and as big data is used to deliver insights for revenue generating functions, senior managers may decide against handing over sensitive information to cloud providers. Private clouds tend to be the norm but these services are expensive.

Velocity

The increasing rate at which data flows into an organisation has followed similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialised companies such as financial traders have long turned systems that cope with fast moving data to their advantage. The internet and smart phone ere means that the way re deliver and consume products and services is increasingly generating a data flow back to the provider. New York Stock Exchange captures 1 terabyte of information each day By 2016 there will be an estimated 18.9 billion network connections with roughly 2.5 connects per person on earth. Financial Institutions can differentiate themselves from the competition by focusing on efficiently and quickly processing trades. Source: http://www.investopedia.com/

Variety

Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse and does not fall into neat relational structures. It could be a text on Social Networks, image data, a raw feed directly from a sensor source. None of these come ready for intergration into an application. Even on the web, where computer to computer communication ought to bring some guarantees, the reality of data is messy.

Data Mining And The Financial Markets

For Data Mining

Data Mining

In an article in the Financial Times dated 20th May 2013. High Frequency Trading contributed to the fall in the Dow Jones Industrial Average in May 2010, according to US Regulators. However HFT of today is very different from that of three years ago.

This is because of “Big Data”. Financial Markets are big producers of big data, Trades, Quotes, Earnings Statements, Consumer research reports.

About 2 years ago, it became common for hedge funds to extract market sentiment from social media. The idea came from Zhang et al, to develop algorithms based on millions of messages posted by users of twitter and to detect public sentiment and trends, in relation to individual companies. Within the past couple of years it has become popular to develop algorithms that fire up orders as soon as unscheduled information is published, such as natural disasters or terrorist attacks. This is hardly a crazy concept; the stock market is fuelled largely by the perceptions of investors and how those investors react to news.

When it goes wrong was evidence by the so called hash crash of 23rd April 2013, the market dropped by 143 points caused by a hacked bogus tweet about a terrorist attack on Barrack Obama sent from the much respected sources Associated Press twitter feed.

Unlike the crash that happened in 2010 when high level sales caused further sales. It was not a speed crash: it was a “big data” crash. The panic however brief, demonstrates how tightly intertwined Wall Street has become with Twitter, a site that acts both a chat room and news service, where Journalists and publications regulary send out breaking news. There was also concerns over what many suggested was the lurking menace of trading algorithms that scan the news and trade quickly, causing flash crashes.

Text MiningNatural language processing Image

Data analysis of Natural Language (Articles, Books) using text as a form of data. It is often joined with data mining, the numerical analysis of data works, and referred to as “text and Data Mining”. This method TDM involves using advanced software that allows computers to read and digest digital information far more quickly than a human can. TDM software breaks down digital information into raw data and text, and analyses it, and comes up with patterns in Stock Market and Commodities.

Regression is the main statistical technique used to quantify the relationship between two or more variables.

According to a recent Gartner report the business intelligence market is growing 9% per year, will exceed 80 billion by 2015, with about 50% from predictive analytics by that time. Despite all this, the best opportunity for you is still the one you feel most passionate about and know the best, that no one else has recognised. The possibilities are endless.

Recommended Further Research

God and Norman Bloom

The Signal and The Noise Book By Nate Silver, The Art And Science of Prediction.

Fooled By Randomness: The Hidden Role Of Chance in Life and in the Markets, By Nassim Nicholas Taleb.

Pi Movie, 1998, Written and Directed By Darren Aronofsky.

References:

http://www.thestreet.com/story/13044694/1/how-traders-are-using-text-and-data-mining-to-beat-the-market.html

http://searchsqlserver.techtarget.com/definition/data-mining

http://nerdsonwallstreet.com/stupid-data-miner-tricks-quantitative-finance-85/

www.gartner.com/it-glossary/masterdatamanagement-mdm

Data Mining

front page image

Data mining is a analytical process designed to explore large amounts of data in search of consistent patterns and systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.

The main goal of Data Mining is prediction and predictive data mining is the most common type of data mining and one that has the most direct business application.

What Can Data Mining Do

Companies in a wide range of industries are already using data mining tools and techniques to take advantage of historical data. By using pattern recognition technology and statistical and mathematical techniques to sift through warehoused information, data mining helps analysts recognise relationships, trends patterns, exceptions and anomalies. For Businesses, data mining is used to discover sales trends, develop smarter marketing campaigns, and accurately predict customer loyalty.

Specific Uses Of Data Mining Are

  • Market segmentation – Identify common characteristics of customers who buy the same products from your company.
  • Fraud detection – Transactions which are most likely to be fraudulent.
  • Direct Marketing – Which prospects should be included in a mailing list to obtain highest response rate.
  • Interactive Marketing – Predict what each individual accessing a web site is most likely interested in seeing.
  • Market Basket analyses – Understand what products or services are commonly purchased together, eg Bread, butter.
  • Trend Analyses – Reveal the difference between a typical customer this month and last.

Data Mining Process

 

Problem Definition

Data mining projects are often structured around specific needs of an industry sector, or even tailored and built for a single organisation. A successful data mining project starts from a well defined question or need.

Data Gathering and Preparation

Data gathering and preparation is about constructing a dataset from one or more data sources to get familiar with the data. Data preparation is usually a time consuming process and prone to errors.

Model Building And Evaluation

Predictive modelling is the process by which a model is created to predict on outcome. If the outcome is categorical it is called classification and if the outcome is numerical it is called regression.

Descriptive modelling or clustering is the assignment of observations into clusters so that observations in the same cluster are similar. Association rules can find interesting associations amongst observations.

Knowledge Deployment

The knowledge gained will need to be organised and presented in a way that the customer can use it. It will be mainly up to the customer to decide and carry out the deployment steps.

Methods

  1. Data Mining Tools
  2. Programming Language (Java, C, VB, R)
  3. Database SQL Script
  4. PMML (Predictive Model Markup Language)

Data Mining Algorithms

There are many different Algorithms that Organisations can use in predictive modelling I am going to list just a few of them.

Decision Trees

Decision trees are commonly used in data mining with the objective of creating a model that predicts the value of a target dependent variable based on the values of several input independent variables. The structure of the decision trees reflects the structure that is possibly hidden in your data.

Clustering – the K-mean Algorithm

Process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. There are a large number of Clustering Algorithms.

Association Analysis

Association Analysis is the task of uncovering relationships among data.

Association rules – It is a model that identifies how the data items are associated with each other. It is used in retail sales to identify what are frequently purchased together. It is sometimes referred to as the Market Basket Analysis.

References:

http://www.statsoft.com/textbook/data-mining-techniques

http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm

Try R Programming Language

Introduction To R Programming Language
financial_market

Because of the vast amount of Big Data available to Corporations and Businesses in today’s Technology world. R Programming has become an increasingly popular choice among Programmers and Organisations. R Programming Language is very powerful when it comes to exploring data, visualising data, and developing new statistical models. R is an implementation of the S programming Language but there are differences. R is an Object-Oriented Language much of it’s functions work in C and Fortran.

R first appeared in 1996 when statistics Professors Ross Ihaka and Robert Gentleman of the University of Auckland, in New Zealand released the code as a free software package.

R is particulary useful because it contains a number of built-in mechanisms for organising data, running calculations on the information and creating graphical representations of data sets.

Packages written for R add advanced algorithms, coloured and textured graphs and mining techniques to dig deeper into databases. The Financial Services community has demonstrated a particular affinity for R; dozens of packages exist for derivatives analysis alone.

The great beauty of R is that you can modify it to do all sorts of things,” said Hal Varian, Chief Economist at Google.

R Software is an open source and is free to download, simply google www.r-project.org and follow the download instructions for your pc or laptop.

R supports matrix mathematics and it’s data structures include vectors, matrices, arrays, data frames similar to tables and lists. R can be used as a calculator for example if you input 1+1 in the R command prompt you would receive the answer 2.

Creating Maps

Luckily R has some sample data sets to play around with. One of these is Volcano a 3D maps of a dormant volcano in New Zealand.

It’s simply a 87 X 61 matrix with elation values, but it shows the power of R’s matrix visualisations.

6th

7th

The image function will create a heat map.

8th

Plotting A Graph in R

For this part I downloaded a .csv file from Met Eireann website, for the monthly rainfall in Dublin for the past 30 years. The file was too big so I took the 12 months 2014 to make it easier to work with.

9th

In order to plot a graph in R the data has to be in length form to do this I put in the following script.

10th

 

11th

The above graph shows the amount of rainfall in Dublin last year 2014, amounts are given monthly. I found R language to be a very unforgiving at times. Mainly because the amount of error messages that came up. The graph is not as good as it should be, I’ll have to keep practicing the R language.

 

Google Fusion Table

For this assignment the aim was to create a map of Ireland with data obtained from 2011 Census Population broken down by Counties of Ireland.
This map was created by Google Fusion Tables. The following is the steps that I took to complete this assignment.
1. I opened an account with Google Fusion Tables it is a free to use.
2. I then downloaded Irish KMZ Datafile at address http://www.independent.ie/editorial/test/map_lead.kml. KMZ is a file format used to display geographic data in an earth browser, such as Google Earth, Google Maps and Google Maps for mobile.
3. I opened up Google Fusion Tables, then I clicked on create new table. In the import new table dialog box I chose file. Select http://www.cso.ie/cn/statistics/populationofeachprovincecountyandcity2011/ then click next.
4. Click on Tools then change map, then Change Feature Styles, PolygonsFill Colour, choose a colour. Click on Buckets – choose 5. Then I clicked on Legends.
5. To publish or to export the map to a website, click Tools then Publish, copy the html code in the box.
6. Paste the html code into the website or blog’s html editing feature.
Capture
The file that was downloaded from the Central Statics Office is a table which consists of four columns headed County or City, Male, Female and Total Population. The file gives a breakdown of the total population of Ireland. The heat map shows the population by rank and each County has a different colour representing the different ranks.
How the Table Could be Developed
The map that I completed did not tell us a whole lot and no decisions could be made from the information obtained. However the table could be developed more with information broken down such as the following:-
• Age
• Occupation.
• Average Annual Pay per Household.
• Farm Holdings.
• Industry Employment.
With this extra information a lot more information could be gained and graphed or included into the Map. This extra information could tell what type of services or goods could be marketed in these different Counties, and where to best concentrate the marketing campaign.
Because Google Fusion Tables is free it is a very valuable tool to use, as graphs are relatively easy and with practice using Google Fusion Tables a lot of information can be broken down and presented really professionally.