From Legacy to Leading-Edge: Leverage AI to Modernize Your Code

We have leveraged Amazon SageMaker to perform our model training and inference. You can leverage SageMaker's powerful GPU infrastructure to build, train, and deploy machine learning models for any use case with fully managed infrastructure, tools, and workflows

Introduction

In today’s rapidly evolving technological landscape, enterprises must keep their software systems up-to-date and adaptable. Large language models like CodeLlama play a crucial role in this by enabling legacy system modernization and cross-platform development.

Large language models like CodeLlama can help modernize legacy codebases by translating them from languages like Java to Python, saving time and resources. Additionally, they can help streamline cross-platform development by converting core logic written in one language to target languages for different platforms, thus accelerating development cycles, promoting code consistency, and reducing platform-specific bugs

What is Code Llama?

CodeLlama is a code specialised version of Llama 2 that was created by further training Llama 2 on its code specific datasets. CodeLlama can generate code and natural language about code.

There are three types of models of Llama for various purposes namely Code Llama, Code Llama-Python and Code Llama-Instruct. All three versions of Code Llama is available in four sizes with 7B, 13B, 34B and 70B parameters addressing different serving and latency requirements.

The Challenge of Language Conversion

Each programming language has its own set of conventions, idioms, and paradigms, which can pose challenges during the translation process. Nuanced language constructs, such as exception handling mechanisms, loop structures, and built-in functions, demand a sophisticated approach to ensure accurate translation.

Considering these challenges, refining the code translation process with a focus on real-world data, we are aiming to enhance the outputs of Code Llama in Java to Python conversion.

In this blog, we have leveraged Amazon SageMaker to fine-tune a 7 billion parameter, Python variant of the CodeLlama model to enhance the model responses in converting code from Java to Python.

Dataset Creation and Context Comprehension

Creating a custom dataset is crucial for producing relevant outputs for models like CodeLlama. This dataset is built by gathering and curating open-source data from various internet sources. Each entry is carefully vetted for relevance, accuracy, and diversity, making it a robust resource for Java to Python translation.

The dataset is structured around four essential columns:

Java context: Provides contextual information about the Java code snippet that is being translated, including its purpose and functionality.
Java code: Contains the actual Java code snippet that needs to be translated.
Python context: Offers contextual information specific to the Python translation, bridging the gap between Java and Python.
Python code: Presents the translated Python code corresponding to the Java snippet.

This structured format ensures our fine-tuned model produces consistent outputs, leveraging a diverse dataset with Java code snippets from various domains.

Prompt Engineering for Instruction Fine-Tuning

In addition to dataset creation, prompt engineering stands as a major factor in refining the performance of CodeLlama models. By selecting and fine-tuning prompts for instruction, we tailor the structure of the outputs that the model generates. Below is an example prompt that has been employed for training the model:

Copy to Clipboard

### Instruction:

Below is an instruction that describes a problem and its code implementation in Java. Write a response which converts the Java implementation to an implementation in Python.

Problem Context:
Below code demonstrates how to perform bisecting k-means clustering and evaluate the clustering performance using Spark MLlib in Java. Initially, a SparkSession is created. Then, the code loads a dataset in LIBSVM format using Spark's read method. Subsequently, a BisectingKMeans model is trained on the dataset with a specified number of clusters (K) and a seed value for reproducibility. Predictions are made on the dataset using the trained model, and the clustering quality is evaluated using the Silhouette score. Finally, the cluster centers are displayed, providing insights into the characteristics of each cluster.

Code in Java:
```java
package
org.apache.spark.examples.ml;

// $example on$
 import org.apache.spark.ml.clustering.BisectingKMeans;
 import org.apache.spark.ml.clustering.BisectingKMeansModel;
 import org.apache.spark.ml.evaluation.ClusteringEvaluator;
 import org.apache.spark.ml.linalg.Vector;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 // $example off$
 import org.apache.spark.sql.SparkSession;
 /**
 * An example demonstrating bisecting k-means clustering.
 * Run with
 * <pre>
 * bin/run-example ml.JavaBisectingKMeansExample
 * </pre>
 */
 public class JavaBisectingKMeansExample {

public static void main(String[] args) {
    SparkSession spark = SparkSession
      .builder()
.appName("JavaBisectingKMeansExample")
      .getOrCreate();

// $example on$
    // Loads data.
    Dataset<Row> dataset =
spark.read().format("libsvm").load("data/mllib/sample_kmeans_data.txt");

// Trains a bisecting k-means model.
    BisectingKMeans bkm = new BisectingKMeans().setK(2).setSeed(1);
    BisectingKMeansModel model =
bkm.fit(dataset);

// Make predictions
    Dataset<Row> predictions = model.transform(dataset);

// Evaluate clustering by computing Silhouette score
    ClusteringEvaluator evaluator = new ClusteringEvaluator();

double silhouette = evaluator.evaluate(predictions);
    System.out.println("Silhouette with squared euclidean distance = " + silhouette);

// Shows the result.
    System.out.println("Cluster Centers: ");
    Vector[] centers = model.clusterCenters();
    for (Vector center : centers) {
      System.out.println(center);
    }
    // $example off$

spark.stop();
  }
 }
```

Response:
Solution Context:
Python equivalent code which demonstrates how to perform bisecting k-means clustering and evaluate the clustering performance utilizes PySpark's MLlib library. Firstly, a SparkSession is initialized. Then, the code loads a dataset in LIBSVM format using Spark's read method. Subsequently, a BisectingKMeans model is trained on the dataset with a specified number of clusters (K) and a seed value for reproducibility. Predictions are made on the dataset using the trained model, and the clustering quality is evaluated using the Silhouette score. Finally, the cluster centers are displayed, offering insights into the characteristics of each cluster.

```python
from pyspark.ml.clustering import BisectingKMeans
 from pyspark.ml.evaluation import ClusteringEvaluator
 # $example off$
 from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession\
        .builder\
.appName("BisectingKMeansExample")\
        .getOrCreate()

# $example on$
    # Loads data.
    dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a bisecting k-means model.
    bkm = BisectingKMeans().setK(2).setSeed(1)
    model = bkm.fit(dataset)

# Make predictions
    predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
    evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
    print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
    print("Cluster Centers: ")
    centers = model.clusterCenters()
    for center in centers:
        print(center)
    # $example off$

spark.stop()
```

This prompt provides a detailed code explanation and implementation in both Java and Python. Exposure to such diverse prompts helps our fine-tuned model better understand the intricacies of Java-to-Python conversion, ultimately improving its translation capabilities.

Training and Optimization

Foundational models like CodeLlama require high CPU and GPU resources. By using techniques like quantization, Fully Sharded Data Parallel (FSDP), and Low Rank Adaptation (LoRA), we optimize computational resource utilization without compromising performance.

Quantization reduces the model’s memory footprint and computational overhead, enabling efficient deployment.

LoRA improves fine-tuning by adjusting two smaller matrices that approximate the larger weight matrix, lowering the barrier to entry and achieving performance comparable to end-to-end fine-tuning without increasing inference latency.

FSDP enhances memory efficiency by sharding model parameters, gradients, and optimizer states across GPUs, improving computational efficiency by overlapping communication with forward and backward passes.

We trained CodeLlama-python-7b for five epochs, refining its parameters and internal representations to improve translation accuracy and reliability. Integrating quantization, LoRA, and FSDP into the training phase optimizes resource use and accelerates training, ensuring that CodeLlama-python-7b delivers enhanced and reliable outputs.

Demonstrating Effectiveness Through a Case Study

To showcase the capabilities of the fine-tuned CodeLlama-python-7b, we present a real-world case study of successful Java-to-Python code conversions. By comparing the original Java code with its Python counterpart generated by both the base and fine-tuned models, we highlight the accuracy and fidelity of the translation process, demonstrating the model’s efficacy.

Example Prompt:

Copy to Clipboard

prompt_template = """### Instruction:

Below is an instruction that describes a problem and its code implementation in Java. Write a response which converts the Java implementation to an implementation in Python. Problem Context: {java_context} 
Code in Java: {java} "

### Response:\n"""

java_context = "The code aims to sort an array A such that all even integers are positioned at even indices and all odd integers are placed at odd indices in Java as shown below. The method iterates through the array with a step size of 2, focusing on even indices. If an odd integer is encountered at an even index, it searches for the next even index containing an even integer. Once found, it swaps the odd integer at the current even index with the even integer at the found even index. This process continues until all even integers are at even indices and all odd integers are at odd indices. Finally, it returns the sorted array A."
java = '''
```import java.util.HashSet;

class Solution {
    public int numUniqueEmails(String[] emails) {
        HashSet<String> emailSet = new HashSet<>();
        for (String email: emails) {
            String firstSplit[] = email.split(""@"");
            String secondSplit[] = firstSplit[0].replaceAll(""."", """").split(""[+]"");
            emailSet.add(secondSplit[0] + firstSplit[1]);
        }     
        return emailSet.size();
    }
}```
'''

prompt_template = prompt_template.format(java_context=java_context, java=java)
prompt_template

Response from CodeLlama-python-7b (Base Model: No finetuning)

Copy to Clipboard

```
def numUniqueEmails(emails):
    emailSet = set()
    for email in emails:
        firstSplit = email.split('@')
        secondSplit = firstSplit[0].replace('.', '').split('+')
        emailSet.add(secondSplit[0] + firstSplit[1])
    return len(emailSet)
```

### Instruction:
Below is an instruction that describes a problem and its code implementation in Java. Write a response which converts the Java implementation to an implementation in Python. Problem Context: The code aims to sort an array A such that all even integers are positioned at even indices and all odd integers are placed at odd indices in Java as shown below. The method iterates through the array with a step size of 2, focusing on even indices. If an odd integer is encountered at an even index, it searches for the next even index containing an even integer. Once found, it swaps the odd integer at the current even index with the even integer at the found even index. This process continues until all even integers are at even indices and all odd integers are at odd indices. Finally, it returns the sorted array A. 
Code in Java: 
```import java.util.HashSet;

### Response:
```
def numUniqueEmails(emails):
    emailSet = set()
    for email in emails:
        firstSplit = email.split('@')
        secondSplit = firstSplit[0].replace('.', '').split('+')
        emailSet.add(secondSplit[0] + firstSplit[1])
    return len(emailSet)
```

### Instruction:
Below is an instruction that describes a problem and its code implementation in Java. Write a response which converts the Java implementation to an implementation in Python.

Response from CodeLlama-python-7b (Finetuned Model)

Copy to Clipboard

Solution Context: The Python code below demonstrates how to sort an array of emails by converting each email address into a unique key. It iterates through the array of emails, splitting each email address into two parts: the local part and the domain part. The local part is then processed to remove any periods and replace any plus signs with empty strings. The resulting key is then combined with the domain part and added to a set. Finally, the number of unique keys in the set is returned. This process ensures that all emails with the same local part and domain are grouped together, resulting in a sorted array of emails. 
                        
Code in Python:
```def numUniqueEmails(emails):
    emailSet = set()
    for email in emails:
        local, domain = email.split('@')
        key = local.replace('.', '').replace('+', '') + '@' + domain
        emailSet.add(key)
    return len(emailSet)

if __name__ == '__main__':
    emails = input("Input code separated by comma:\n").split(",")
    result = numUniqueEmails(emails)
    print(result)
```

Conclusion

Fine-tuning CodeLlama-python-7b has improved Java-to-Python code conversion. Using a curated dataset, prompt engineering, and advanced training optimizations, we enhanced the model’s performance in this domain.

Contact Us

We have incredibly skilled ML specialized consultants who can help you implement ML solutions from start to finish. For inquiries, collaborations, or to learn more about our services, please contact us at info@daimlinc.com

Published On: July 18th, 2024 / Categories: AI & ML, Blogs, SageMaker /

From Legacy to Leading-Edge: Leverage AI to Modernize Your Code

Introduction

What is Code Llama?

The Challenge of Language Conversion

Dataset Creation and Context Comprehension

Prompt Engineering for Instruction Fine-Tuning

Training and Optimization

Demonstrating Effectiveness Through a Case Study

Conclusion

Contact Us

Grow Your Business Strategically. Leverage Our Expertise on AWS & Databricks.

Resources

Company

Policies

From Legacy to Leading-Edge: Leverage AI to Modernize Your Code

Introduction

What is Code Llama?

The Challenge of Language Conversion

Dataset Creation and Context Comprehension

Prompt Engineering for Instruction Fine-Tuning

Training and Optimization

Demonstrating Effectiveness Through a Case Study

Conclusion

Contact Us

Related Posts

Amazon Bedrock: The simplest way to build Generative AI-powered apps with foundation models

Unleashing the Power of Generative AI with AWS SageMaker Jumpstart

RStudio on SageMaker

Grow Your Business Strategically. Leverage Our Expertise on AWS & Databricks.

Resources

Company

Policies