Tracking calories and macronutrients is very useful to maintaining a healthy lifestyle. To solve the numerous issues of image-based nutrition label parsers, I created a scanner that reads from a live video feed.
In the fitness world, one of the biggest factors in how our bodies are composed is food. Our diets mean everything, and for those trying to lose weight or put on muscle, tracking what enters our bodies is crucial to success. The most effective way to stay on top of our diet is by tracking calories and macronutrients (carbohydrates, proteins, fats), which are found on the nutrition labels on all of the food that we eat
Current solutions demand an ideal photo where all text is clearly visible. This may be challenging and time consuming for people in a rush, and the buildup of photos of nutrition labels quickly becomes annoying to manage.
My solution uses a live video feed to incrementally scan, detect, and parse all elements of a nutrition label, even under imperfect conditions (bad lighting/glare, camera motion). As long as the user presents enough angles to the label, the program will eventually reconstruct the label and terminate.
The process is based on this paper and adapted for live video.
Proposal due. Began researching current solutions and OCR Models, including EasyOCR, PaddleOCR, and Tesseract
Built prototype able to read data from ideal image of nutrition label (flat, clear lighting), matching existing industry solutions
Added label detection and homography warping to detect and flatten any warped, rotated labels. Also added image thresholding to increase contrast for OCR.
Added live video stream capturing. Able to quickly detect and preprocess label for fast and accurate OCR results. Basic Regex parsing of output
Added data aggregation to allow partial results per frame to persist across capture. Added counter system to reinforce good data and ensure that any OCR errors don't impact final output. More forgiving Regex filtering.
Preprocessed label can be read from various lighting and angle conditions
Efficient preprocessing allows scanning of 8-10 frames/second
Simple counter system ignores any one-time OCR mistakes
Info is saved across frames, meaning labels can be rebuilt over time
The implementation was done in Python using the OpenCV and PyTesseract libraries as the backbone, and numpy, PIL, and regex. The program uses OpenCV to capture from the webcam and find, warp, and threshold the image as the preprocessing step. To find the label, I detected the largest quadrilateral in frame. Then I used homography techniques taught in lecture to warp the label so that it would be flat. Finally, I applied some basic thresholding to boost the contrast for OCR.
From there, I send the preprocessed label to the OCR engine. Since the label is very clear and has high contrast, the engine returns results faster than if it were the original frame presented. I then use some regex rules to parse the text. They account for common OCR errors (like 1 vs l vs I), and return the measurement associated with each macronutrient. From here, I save the results in the backend, which can be aggregated over time. This means that even if one frame is missing a few pieces of data, that data can be found later. As another safety measure, I keep a history of each macronutrient's value using a list. The final result will only be overriden if a majority of the recent measures agree. This allows any 1 time errors to be ignored.
One drawback of detecting the label by using a quadrilateral is that the label needs to be entirely in frame. Additionally, the edges of the label also need to be unobstructed, or else the contours of the label may not be detected. While these aren't severe issues, they are worth noting for the user. I would like to solve these issues to make my scanner as approachable and easy to use as possible.