Tuesday 3 December 2013

Timsort

Timsort is the default sorting implementation in Python. It is a hybrid sort that uses binary insertion sort and merge sort. It is also adaptive. This means that the algorithm will behave differently based on the input factors. In a sense, the algorithm tries to select the optimal sorting approach to a particular input based on different properties of that input such as input size. The main idea behind timsort is to leverage natural runs. Natural runs are subsets of the initial input data that are ordered. In other words, the input data may have subsections that are already sorted. For a completely random distribution this approach does not offer any improvement, but most data sets are will have ordered subsets. Tim sort has minrun variable that defines the smallest size of an ordered subset such that it is treated as a natural run. Runs that have are below this threshold are build up using binary insertion. Once runs are identified the merge subroutine combines consecutive runs. (note a run could be as small as two elements)
In summary, timsort uses the idea of natural runs to increase performance when operating on sets that have ordered subsets. Runs are build up from ordered subsets and then combined using the merge subroutine. So essentially search space is split up into smaller chunks which are then sorted using insertion sort and everything is put back together using merge sort. Something to note here, is that timsort is not optimized for totally random distributions, and the adaptive overhead of the algorithm will probably slow it down when dealing with random data.

Tuesday 26 November 2013

Context managers in python --verbose

Context manager is a piece of code that wraps around another block of code, creating a local run time context for that block. A context type is an object with two methods .__entry__() and .__exit__(exc_type, exc_val, exc_tb). Basically these two methods define the setup and tear down of the state. This state is insulated from outer scopes and context managers are used to execute code inside this 'sandbox' environment to prevent any changes to outer namespaces. Context managers are also useful when refactoring code. If you have code that follows a similar pattern where you need to preform some form of setup and tear down each time, then you can abstract that into a context manager.
Context manager types are often used with the 'with' statement to wrap code. The with statement executes .__exit__() callable of the context type and assigns the return value to the 'as' target (if any). When the wrapped code finishes execution, the .__exit__(exc_type, exc_value, exc_tb) is called. This code performs clean up and handles any exceptions that were raised. To be more precise, the return type of this callable is a boolean, true to signal that all exceptions raised will be suppressed (either during the exec
ution of the block statement or the execution of the __exit__()) or False to indicate that exceptions should propagate further.

Thursday 21 November 2013

Python Decorators

A word about decorators in Python. A 'decorators' is a piece of code that modifies other code (often refered to as metaprogramming). There are two types of decorators: a function and a class decorators, modifying functions and classes respectively. Decorators take in a callable type and return a callable type. Lets restrict our attention to the following subset of callable types: user defined functions, builtin function and generator functions. Functions, like everything else in Python, is an object. The most simple decorator may take a user defined function as input and add an extra attribute to this function, returning this new version. In this case, the action of the underlying callable object does not change, we simply use the decorator to attach more information to it. A more complicated example of a decorator is one that changes the action of the 'decorated' callable, returning a new modified callable with different behaviour. An example of a decorator is the property builtin that was introduced in class. This function 'decorates' variables allowing to modify/restrict access, change doc strings etc. Note that variables are also objects, instances of more 'fundamental' classes such as int, float etc and these themselves are instances of a factory object 'type'. Factory as in a sense that this object assembles/creates 'fundamental' classes.

Monday 11 November 2013

Adobe data leak

Several days ago major news organizations picked up the story of the adobe data leak. Adobe released a security notice on Oct 3 notifying its customers that their data has been stolen. According to the security notice, 2.9 customers were affected; however, 38 million users were notified and it is believed that as much as 150 million passwords were stolen. Stolen data consisted of emails, encrypted passwords and plaintext hints (user hints to remind themselves of their password in case they forget it). Adobe claimed that all confidential data that was lost was encrypted. Recent analysis by Paul Ducklin seems to indicate that stolen data was not salted or hashed. Worse, it appears that the passwords were encrypted using a single key. In other words, if someone gets the secret key then they would have be able to decrypt every single password. The discussion around this story comes down to how badly Adobe mismanaged user passwords (using block cypher instead of hash and salt combination) and the sheer scale of this breach. I would like to mention my take on this story. If a significant amount of passwords are recovered than this event has the potential to have the scale and magnitude of 2009 rockyou.com leak. In other words, this could be a giant data sample that could be used to refine and improve password crackers. Password crackers don't simply brute force passwords, but they exploit the common features of passwords (ie substituting numbers for letters, appending number to words). In other words, crackers look for common patterns in passwords and write algorithms that exploit those patterns, significantly reducing the time needed to break/crack passwords. This leak could be a potential gold mine for passwords. Sure most of these (even all) have been changed but understanding common themes in how people chose passwords may be a lot more valuable then the initial breach.

Monday 28 October 2013

Cryptolocker


Cryptolocker is a piece of ransomware that encrypts data on the infected host. Cryptolocker uses 2048-bit RSA encryption scheme which has been considered to be secure against brute force attack. In other words, there are no decryption techniques. The purpose of this malware is to hold your sensitive and confidential data as hostage with the intent to extort money for the victim. When infected, the user is asked to pay a ransom of 100-300 dollars/euros. The user is given a specified amount of time to transfer the funds, usually 72 hr. If the user does not pay, the decryption keys are deleted and the data is unrecoverable.
Cryptolocker usually disguises itself as a legitimate company email, asking the user to fill out a travel expense (or some other company form). Of course this is not a pdf but a .exe file. When cryptolocker is up and running on your system it tries to contact one of its command and control servers This server generates the encryption keys and sends it the host. Cryptolocker uses these keys to encrypt your data, it filters by file extension. Formats such as images, video or documents will be encrypted.
Currently, the security industry can only offer advice on how to avoid this infection. However, once infected even security professionals seem to agree that the best option is to pay the ransom. Ironically, people who have paid the ransom report that cryptolocker in fact does decrypt your data and remove itself from your system.

Monday 14 October 2013

Recursion

In the previous post I talked about OOP. The main advantages of the OOP paradigm is that it offers data encapsulation and polymorphism. Data encapsulation separates interface from implementation and polymorphism lets a whole system of data behave as one. A competing language design paradigm is 'functional programing'. In this framework objects have no notion of identity or mutable state. The consequence of this last feature is heavy use of recursion. Recursion is the idea that we can solve a big problem by breaking them down into smaller, self-similar, subproblem. Therefore recursive functions take in the return value of their own return call (recursive call). Recursion is related to mathematical induction and in the most simple case a recursive needs a base case and a recursive case. Base case produces a result without recurring and recursive case divides the problem into a smaller subproblem. The base case is sometimes called the terminating case and the job of the recursive case is to breakdown the problem such that the terminating case is eventually reached and the problem is solved.

OOP

Object oriented programming allows the creation of abstract data types with data fields and methods. Data fields are attributes that describe the object and methods are functions that describe its behaviour. For example, Python is an OOP language. In python we often have to create a class of objects that represent an abstract data type. For example if we are creating a program for a car lot then we might be interested in the car class of objects. Each instance of this class might have the following attributes: model, colour, price and various technical specifications. Some of the methods might include moving the car from one dealership to another, going for a test drive or putting it in the showroom. So an object is just a collection of defining characteristics and a set of of ways we can interact with and manipulate the object. The reason why OOP is a very popular paradigm is because its meaning making process is similar to how we create and assign meaning in the natural languages. We have stronger intuition about this paradigm and it has features that make it suitable for most tasks.