BigData - Basics
Data - Data is stored/accessed in many ways As text or strings, as arrays/lists and as hash/lookup tables
Python JAVA LiveCode C C++
in lists/arrays tuples, lists arrays, lists tables
in Lookup tables dictionary maps, hash tables (associative) arrays
Data can be stored and manipulated in lists, tables, arrays and other data structures. This is a separate topic of study in most colleges. Each data structure has its advantages and disadvantages. How you will use the data will determine how you store it. With big data, it is critical because times often increase exponentially as data grows.
The following program illustrates some of the timing differences. Although it is written in LiveCode, the results would be the same in most other languages. It uses 3 scenarios
- a plain text file and uses a simple "Find" statement. The sorts use the same statement except for the 2nd Gender sort. It shows how different commands can take wildly different times to complete. (be patient, it takes a long while)
- a table/Data Grid which is like a spreadsheet - self sorting
- a Lookup table/(Associative) Array which is much like a database
It has a timer in milliseconds to compare the times. To see for yourself, - download and use the program listed at the bottom: BigData.exe2 - rename the file to BigData.exe, (Google does not allow executable files to be uploaded, so I had to rename it from exe to exe2). It is preloaded with names from the year 2000. (nearly 30,000 records). To download a different year use the following link: Download All - Download the entire set of all "Names" files since 1897 The source code is provided for you to look at the code.
Data is shared (Exported) in a number of ways Word Processors - as text files - data.txt Spreadsheets - as tab-delimited files - data.xls Databases - as comma separated fields - data.csv To see examples of these files see Mashups
Testing and Benchmarks Often we need data for testing purposes or to do time measurements.
We might want to see if: - For loops are faster than enhanced for loops.
- Nested if statements are faster than separate if statements or else-if statements.
- Lists or Arrays are faster
- One piece of code is faster than another.
- One sort is faster than the other.
When dealing with sorting large amounts of data, the time increases are not linear. Two different sorts may be close for small amounts of data but differ widely as the size of our data increases. As the data doubles, one sort may take twice the time but the other takes 10 times as long. |
|