Learning about Software Coding Practices

Tuesday, July 28, 2020

So, this post is for mentioning the new things that I learned about coding principles in general in the Udacity's AWS Machine Learning Foundations Course.

Software Coding Practices

These are practices that help you write clean, efficient, well-tested code. The different aspects that I learn about these practices are written below

  • Write clean and modular code
  • Improve code efficiency
  • Add effective documentation
  • Use version control
  • Testing
  • Logging
  • Code Reviews

Writing Clean and Modular Code

Clean and modular code divides a large program into small parts that make it easier to code as well as to understand and debug.

Learnt that we should avoid adding redundant comments by using descriptive variable names, avoid abbreviations and single letters, avoiding long names in order to create descriptive names (only relevent information should be conveyed via the name).

There were various points regarding creation of modular code. Learnt about Don't Repeat Yourself principle. What it means that if you are repeating the code that you had written earlier are, then there are chances that you can better refactor your code and use modularization to reuse code. Also, keeping checks that the function is only doing one thing and isn't taking lots of arguments ensures that you don't code a complicated function. We should also abstract out logic to ensure code readability while ensuring that we don't over-engineer the code.

PEP 8 Style Guide is a great guide on how you should write properly formatted code that is easier to read.

Code Optimization

Learnt about how making small changes can change the time taken to process large amount of data and how some functions can be better than other depending on specific use cases, such as using vector operations over loops when possible (using numpy’s intersect1d method to find intersecting points of two arrays) . Also, knowing data structures and which methods are faster, (using the set’s intersection method to get the common elements in two arrays) can help optimize code.

Also learnt how refactor code can improve its efficiency,such as selecting numpy array indices based on conditions is faster than looping through the array to select certain elements and then find their sum.

Documentation

Writing documentation is also an important part of software development. Documentation can be divided into 3 levels: inline, function-level, project-level, each with its own purpose.

  • Inline documentation should be written in order to explain parts that are not self-explanatory. Learnt how this documentation needs to strike a balance and should used when refactoring code is not ehough. An example can be using a non-conventional method to prevent a bug that has crept in due to external conditions.

  • Function level documentation, or docstrings, should be written in order to explain what a function is doing. Learnt how this documentation should be present in every function and involves a single line description along with what arguments it takes and what the fucntion returns. In case the function is complicated enough, a paragraph explaining the function can be added after the single line description.

  • Project documentation is the documentation that we write in the README files 😊. Learnt how this documentation should be able to tell what the code is about, list the dependencies as well as the instructions to use it. The course then mentions some READMEs of popular projects - Bootstrap, Scikit-Learn, StackOverflow Blog

Version Control

The course gave a bit of a refresher for the git version control and links that explain about git branching and handling merge conflicts

What I learnt new was that how model versioning could help in data science by adding the score of that model within each commit. This helps keep track of model versions and saves the effort of tracking the different hyperparameters, amounts of data, seeds etc. Here is a link to the blog shared in the course that talks about data version control.

Testing and Test-Driven Development

I was aware of the concept of incorporating testing while writing code for software development but not in the perspective of data science.

Problems in Data Science can include wrong encoding, features being used inappropriately as well as data breaking assumptions within our model.

Here is a link regarding testing in data science. Data-breaking assumptions can become a very real problem especially when the development spans across many days and assumptions are not properly documented.

Unit Testing and Integration Testing are the most common types of testing [check if there are other types of testing]

Unit Tests are test that cover a unit of code, usually a single function, idependently from the rest of the program.

The advantage is that they are isolated from the rest of the program so no dependencies are involved. However, they don't ensure that a program will run successfully as they don't require access to databases, APIs, or other external sources of information. To show that parts of the program are working as intended, we use integration testing.

Learnt about test driven development which is a development process where you write tests for tasks that you plan to code and implement. Was introduced to the pytest which is a tool to evaluate test cases. Unit test functions can be defined in a file starting with test_(can be changed in configuration). The pytest tool then looks for this file and gives and only outputs to show what test functions failed. It was thus recommended to only have one assert statement per test. The test doesn't stop due to failed assert statements but may fail due to syntax errors.

Logging

Logging helps understand the events that occured while runnning the program. It is the process of recording messages to describe the events that have occurred.

While most of my logging consisted of printing random statements in order to find what part of my code doesn't work properly 😛 , in most of the data science libraries, you will often find events that have been completed mentioned in the logs, as well as showing the progress when training models.

I learnt about how logging has 3 levels, DEBUG for any event that happens in the program, ERROR for recording any errors that occur and INFO to record user-driven or system-specific actions. Creating clear, concise and professional logs is a good practice.

Code Reviews

Got to know about code reviews. Code reviews are conducted to to improve the code written by catching errors, ensuring code readability, sharing knowledge and making sure that the code conforms to the styling guide anf best practices are being followed. Using a linter, writing objective comments, providing code example and explaining issues and giving suggestions can help ensuring a better and smoother code review.

Some questions you can ask to review the code by yourself (improving it further before the group review 😊):

  • Is the code clean and modular?
  • Is the code efficient?
  • Is documentation effective?
  • Is the code well tested?
  • Is the logging effective?

I got to know about how each organization has its own coding style and how we can use a code linter to ensure that we write clean code. Although not mentioned in the blog, I found a code formatter by the name Black that format codes. Its currently in beta, do check it out.

Conclusion

I learnt about a different side of software engineering, apart from LOC, Software Development cycles. This was a more hands-on, practical side to it. I'll talk about more things that I learn in another blog. See you in the next post! 😁

Course HighlightsData Sciencepython

Installing NodeJS binaries in KDE Neon