Thursday, August 16, 2012

Gets the data analysis at the roots of your legacy data


Get the roots of the problems your conversion data for the design of an analysis algorithm of the custom data. This is an iterative process that requires an experienced programmer and a user who knows the actual data, it seems, its anomalies and the way in which it originated. Business decisions that are required depart from the standard characteristic applied to the job of programming in that the data to be analyzed is generally passes from one system to another only once. That being the case, any analysis will be manual, that being the cheapest option.

A successful data migration project, by means of transforming data integrity rules and data format (or lack of rules) of the source system in a different format and a different set of rules, those that makeup the integrity database of your new system. The process of evaluation, standardization and interpretation of the data source so that it can be properly reformatted and stored in the target system is sometimes referred to as "analysis". All data from the source system must be examined before migrating to the new system. Nothing should be taken for granted no matter how well-defined methods for data entry for the old system could have been. The nature of the data and the flexibility for its use which is designed in such systems, it is apparent Parsing is necessary for most of the data elements.

Perhaps the most data elements used globally that require the use of an algorithm of analysis are the name and address. Name and address parsing is based on the concept that the name and address information includes numerous components that have common characteristics identified. Although the process is not foolproof, a high degree of success can be achieved for the analysis of names and addresses so that they can be properly reformatted for use in a system with different formatting requirements with respect to the system that initially acquired the data. One of the most common problems in such analyzes is less than perfect solution is inconsistent data entry.

A simplified overview of the analysis name and address:

A block name and the address is composed of three main components-name, address lines and lines of code in the city / state / zip. Any of these may occur more than once, or may be absent. Each has particular characteristics which can be identified and is constituted by its own set of components. Alternatively, two or all three may be combined in a data field.

The lines will be the first name and consists of a name prefix (eg Mr., Mrs., Ms., etc.), a name, middle name and surname and a name suffix, (ie Dr., DDS , etc.).. The names of compounds may be recognized by keywords or characters like "and" or "&". The corporate names may be identified by words like "Company", "Inc." etc. To be successful as the parser must take into account such things as errors in spelling, plural forms, abbreviations, and hyphenated names. To allow flexibility for the unique characteristics of a particular region or work, the identification of these components is constructed tables. The parser must also have options for dealing with the names stored in reverse order, ie "Smith, John" instead of "John Smith" and, in the case of the last name first, must allow for various methods of resolution indicating the surname, that is to say "," or "#" or "."

The components of the lines of the city are city, state, zip and country. Lines are generally recognizable by their position (the last), the presence of a recognizable name or state abbreviation and the presence of a 5 or 9 digits (zip code). The shares must be made to deal with foreign countries, the lack of zip codes, misspellings of state names, and other data entry errors.

Guidelines are generally recognizable by their position (between the name and city lines), and the presence of numeric values, key words and abbreviations (street, avenue, boxes, etc..) The components of the address line items are more complex and how to include the house number, street name, street direction, road type, etc.

Once the address is parsing algorithm correctly identified the components of address, the individual parts can then be reassembled in the form required by the target system. Specific components can be standardized, if desired, using standard abbreviations and correcting spelling errors. These options often provide a significant part of data "scrubbing", which would otherwise have to be done manually.

The use of conversion tables also play an important role in correcting data entry errors or inconsistencies, and modification of data elements that are not wrong, but do not conform to the rules of data integrity of the new system. An example would be to change abbreviated names of insurance companies where there are specific rules were followed and the data do not conform easily to a programmable solution.

Summary:

This explanation has focused on the name and address of the data component, however, the same general concepts apply to any other data that requires transformation between systems.

It is important to know that extensive development of an analytical tool only comes in good time and many conversions.

Other Uses:

In some cases, the analysis tool can be used to identify data items that have not been correlated in the data base, but the correlation exists articles in the transaction or through another set of conditions and data.

An example:

By extracting and correlating part numbers and their names associated hand, we have built an electronic translation table that we then used in the conversion. This could be done manually, but the example cited involved more than one million transactions and hundreds of part numbers and their names associated part. It may seem absurd that a system should be developed and put into production where something as simple as tying the part name and part number with the database, was actually missing. But the conversion was done in our office and we had to build the translation table as described.

This is just one of many examples from my experience of almost 2000 data conversions. I am sure that the reader has at least one I know and probably more, without counting those that "got away".

A well thought out and well planned conversion of the data that integrates the use of good tools, a talented programmer and an experienced user, can save many hours before go-live and now even more so after the system is in use. ......

No comments:

Post a Comment