File upload in Python

We are going to see code examples of how we can load different types of files. This code is not executable, since you would need to have the files in your working directory to run it. However, you can use it as a reference.

CSV

A CSV file (Comma-Separated Values) is a file that allows to represent information in table format, where columns are usually separated by a comma (,) (although other characters are also supported) and rows by a line break.

Normally, whenever we want to read a CSV we will need to load it in a Pandas DataFrame, so the following code would make it possible:

In [ ]:
import pandas as pd

# Set the path to the file you want to read
file = "input.csv"

# The 'read_csv' function allows this reading to be carried out, transforming the file into a DataFrame
df = pd.read_csv(file, sep = ",")

The read_csv function of Pandas has a lot of parameters that allow you to adapt the reading to the characteristics of the file. You can find the documentation for this function here.

Excel (XLSX, XLS)

A Microsoft Excel file (with XLSX or XLS extension) is itself a pure table definition, so it can also be transformed into a Pandas DataFrame:

In [ ]:
import pandas as pd

# Set the path to the file you want to read
file = "input.xlsx"

# The 'read_excel' function allows this reading to be carried out, transforming the file into a DataFrame
df = pd.read_excel(file, sheet_name = "Sheet 1")

The read_excel function of Pandas has a lot of parameters that allow to adapt the reading to the characteristics of the file. You can find the documentation for this function here.

JSON

A JSON file (JavaScript Object Notation) is a file format whose function is to transmit structured information. Its sorting logic has similar points to XML, but the notation is different. In a JSON file, the elements are hierarchical.

This type of file can be read in several ways, since there is a direct relationship between Python dictionaries and this type of file. We can also transform it into a Pandas DataFrame if it has a regular structure.

1. File to dictionary

Assume, for example, the following structure:

{
    "filename": "invoice.pdf",
    "numPages": 3,
    "fields": {
        "customerName": "Telefónica S.A.",
        "invoiceNumber": "1234ABCD",
        "totalAmount": "15.000",
        "currency": "EUR"
    }
}

This type of JSON element can only be transformed into a dictionary in Python. It does not make sense to read it as a Pandas DataFrame since it does not have a related structure, as we will see later. We could read this file with the json package of Python:

In [ ]:
import json

# Set the path to the file you want to read
file = "input.json"

with open(file, "r") as f:
    data = json.load(f)

2. File to DataFrame

Suppose, for example, the following structure:

{
    "files": [
        {
            "filename": "invoice1.pdf",
            "numPages": 3
        },
        {
            "filename": "invoice2.docx",
            "numPages": 10
        },
        {
            "filename": "invoice3.pdf",
            "numPages": 2
        }
    ],
    "status": 200
}

This JSON example replicates a response from a server after a query has been sent to it. Part of its content (actually the one we are interested in) has a table format structure, since each element of the list would represent a row, and each element (dictionary) would represent the column. Thus, we would transform it into DataFrame:

In [ ]:
import pandas as pd

# Set the path to the file you want to read
file = "input.json"

# First we read the JSON content
with open(file, "r") as f:
    data = json.load(f)

# The function 'from_dict' allows us to perform the transformation from JSON to DataFrame
df = pd.DataFrame.from_dict(data)

TXT

A TXT file (TeXT, TeXTo) is a flat file format containing structured or unstructured information. In this type of file, we can replicate CSVs, JSONs, etcetera. Therefore, the readings previously seen also apply to this type of files. To read this type of files, Python has a very simple way to do it:

In [ ]:
# Set the path to the file you want to read
file = "input.txt"

# We read the content of the TXT
with open(file, "r") as f:
    data = f.read()
    data = f.readline(10)
    data = f.readlines()

In the above example, three functions are used, each with a different result. Suppose the above file had the following contents:

Hello, how are you?
This file is an example document
To read it through Python

read() function

This function reads the entire contents of the file in string format, including line breaks such as "\n". In the above example, the result would be:

"Hello, how are you?\nThis file is an example document\nTo read it through Python".

readline(10) function

This function reads the first n characters of the file. In the above example, the result would be:

"Hello, how".

Since we pass a 10 as an argument to the function, it reads the first 10 characters.

readlines() function

This function reads all the content of the file separating the lines to return it in list format. In the above example, the result would be:

["Hello, how are you?", "This file is an example document", "To read it through Python"]